Hotelling's T-squared test is a classical tool to test if the normal mean of a multivariate normal distribution is a specified one or the means of two multivariate normal means are equal. When the population dimension is higher than the sample size, the test is no longer applicable. Under this situation, in this paper we revisit the tests proposed by Srivastava and Du (2008), who revise the Hotelling's statistics by replacing Wishart matrices with their diagonal matrices. They show the revised statistics are asymptotically normal. We use the random matrix theory to examine their statistics again and find that their discovery is just part of the big picture. In fact, we prove that their statistics, decided by the Euclidean norm of the population correlation matrix, can go to normal, mixing chi-squared distributions and a convolution of both. Examples are provided to show the phase transition phenomenon between the normal and mixing chi-squared distributions. The second contribution of ours is a rigorous derivation of an asymptotic ratio-unbiased-estimator of the squared Euclidean norm of the correlation matrix.
This paper studies the $\tau$-coherence of a (n x p)-observation matrix in a Gaussian framework. The $\tau$-coherence is defined as the largest magnitude outside a diagonal bandwith of size $\tau$ of the empirical correlation coefficients associated to our observations. Using the Chen-Stein method we derive the limiting law of the normalized coherence and show the convergence towards a Gumbel distribution. We generalize here the results of Cai and Jiang [CJ11a]. We assume that the covariance matrix of the model is bandwise. Moreover, we provide numerical considerations highlighting issues from the high dimension hypotheses. We numerically illustrate the asymptotic behaviour of the coherence with Monte-Carlo experiment using a HPC splitting strategy for high dimensional correlation matrices.
We consider the problem of change point detection for high-dimensional distributions in a location family when the dimension can be much larger than the sample size. In change point analysis, the widely used cumulative sum (CUSUM) statistics are sensitive to outliers and heavy-tailed distributions. In this paper, we propose a robust, tuning-free (i.e., fully data-dependent), and easy-to-implement change point test that enjoys strong theoretical guarantees. To achieve the robust purpose in a nonparametric setting, we formulate the change point detection in the multivariate $U$-statistics framework with anti-symmetric and nonlinear kernels. Specifically, the within-sample noise is canceled out by anti-symmetry of the kernel, while the signal distortion under certain nonlinear kernels can be controlled such that the between-sample change point signal is magnitude preserving. A (half) jackknife multiplier bootstrap (JMB) tailored to the change point detection setting is proposed to calibrate the distribution of our $\ell^{\infty}$-norm aggregated test statistic. Subject to mild moment conditions on kernels, we derive the uniform rates of convergence for the JMB to approximate the sampling distribution of the test statistic, and analyze its size and power properties. Extensions to multiple change point testing and estimation are discussed with illustration from numerical studies.
Data harmonization is the process by which an equivalence is developed between two variables measuring a common trait. Our problem is motivated by dementia research in which multiple tests are used in practice to measure the same underlying cognitive ability such as language or memory. We connect this statistical problem to mixing distribution estimation. We introduce and study a non-parametric latent trait model, develop a method which enforces uniqueness of the regularized maximum likelihood estimator, show how a nonparametric EM algorithm will converge weakly to its maximizer, and additionally propose a faster algorithm for learning a discretized approximation of the latent distribution. Furthermore, we develop methods to assess goodness of fit for the mixing likelihood which is an area neglected in most mixing distribution estimation problems. We apply our method to the National Alzheimer's Coordination Center Uniform Data Set and show that we can use our method to convert between score measurements and account for the measurement error. We show that this method outperforms standard techniques commonly used in dementia research. Full code is available at //github.com/SteveJWR/Data-Harmonization-Nonparametric.
We consider the problem of making inference about the population outcome mean of an outcome variable subject to nonignorable missingness. By leveraging a so-called shadow variable for the outcome, we propose a novel condition that ensures nonparametric identification of the outcome mean, although the full data distribution is not identified. The identifying condition requires the existence of a function as a solution to a representer equation that connects the shadow variable to the outcome mean. Under this condition, we use sieves to nonparametrically solve the representer equation and propose an estimator which avoids modeling the propensity score or the outcome regression. We establish the asymptotic properties of the proposed estimator. We also show that the estimator is locally efficient and attains the semiparametric efficiency bound for the shadow variable model under certain regularity conditions. We illustrate the proposed approach via simulations and a real data application on home pricing.
We study the limiting behavior of the familywise error rate (FWER) of the Bonferroni procedure in a multiple testing problem. We establish that, in the equicorrelated normal setup, the FWER of Bonferroni's method tends to zero asymptotically (i.e for a sufficiently large number of hypotheses) for any positive equicorrelation. We extend this result for generalized familywise error rates.
Modeling and drawing inference on the joint associations between single nucleotide polymorphisms and a disease has sparked interest in genome-wide associations studies. In the motivating Boston Lung Cancer Survival Cohort (BLCSC) data, the presence of a large number of single nucleotide polymorphisms of interest, though smaller than the sample size, challenges inference on their joint associations with the disease outcome. In similar settings, we find that neither the de-biased lasso approach (van de Geer et al. 2014), which assumes sparsity on the inverse information matrix, nor the standard maximum likelihood method can yield confidence intervals with satisfactory coverage probabilities for generalized linear models. Under this "large $n$, diverging $p$" scenario, we propose an alternative de-biased lasso approach by directly inverting the Hessian matrix without imposing the matrix sparsity assumption, which further reduces bias compared to the original de-biased lasso and ensures valid confidence intervals with nominal coverage probabilities. We establish the asymptotic distributions of any linear combinations of the parameter estimates, which lays the theoretical ground for drawing inference. Simulations show that the proposed refined de-biased estimating method performs well in removing bias and yields honest confidence interval coverage. We use the proposed method to analyze the aforementioned BLCSC data, a large scale hospital-based epidemiology cohort study, that investigates the joint effects of genetic variants on lung cancer risks.
Motivated by the case fatality rate (CFR) of COVID-19, in this paper, we develop a fully parametric quantile regression model based on the generalized three-parameter beta (GB3) distribution. Beta regression models are primarily used to model rates and proportions. However, these models are usually specified in terms of a conditional mean. Therefore, they may be inadequate if the observed response variable follows an asymmetrical distribution, such as CFR data. In addition, beta regression models do not consider the effect of the covariates across the spectrum of the dependent variable, which is possible through the conditional quantile approach. In order to introduce the proposed GB3 regression model, we first reparameterize the GB3 distribution by inserting a quantile parameter and then we develop the new proposed quantile model. We also propose a simple interpretation of the predictor-response relationship in terms of percentage increases/decreases of the quantile. A Monte Carlo study is carried out for evaluating the performance of the maximum likelihood estimates and the choice of the link functions. Finally, a real COVID-19 dataset from Chile is analyzed and discussed to illustrate the proposed approach.
Let $\mathbf{X} = (X_i)_{1\leq i \leq n}$ be an i.i.d. sample of square-integrable variables in $\mathbb{R}^d$, \GB{with common expectation $\mu$ and covariance matrix $\Sigma$, both unknown.} We consider the problem of testing if $\mu$ is $\eta$-close to zero, i.e. $\|\mu\| \leq \eta $ against $\|\mu\| \geq (\eta + \delta)$; we also tackle the more general two-sample mean closeness (also known as {\em relevant difference}) testing problem. The aim of this paper is to obtain nonasymptotic upper and lower bounds on the minimal separation distance $\delta$ such that we can control both the Type I and Type II errors at a given level. The main technical tools are concentration inequalities, first for a suitable estimator of $\|\mu\|^2$ used a test statistic, and secondly for estimating the operator and Frobenius norms of $\Sigma$ coming into the quantiles of said test statistic. These properties are obtained for Gaussian and bounded distributions. A particular attention is given to the dependence in the pseudo-dimension $d_*$ of the distribution, defined as $d_* := \|\Sigma\|_2^2/\|\Sigma\|_\infty^2$. In particular, for $\eta=0$, the minimum separation distance is ${\Theta}( d_*^{\frac{1}{4}}\sqrt{\|\Sigma\|_\infty/n})$, in contrast with the minimax estimation distance for $\mu$, which is ${\Theta}(d_e^{\frac{1}{2}}\sqrt{\|\Sigma\|_\infty/n})$ (where $d_e:=\|\Sigma\|_1/\|\Sigma\|_\infty$). This generalizes a phenomenon spelled out in particular by Baraud (2002).
Estimation of the mean vector and covariance matrix is of central importance in the analysis of multivariate data. In the framework of generalized linear models, usually the variances are certain functions of the means with the normal distribution being an exception. We study some implications of functional relationships between covariance and the mean by focusing on the maximum likelihood and Bayesian estimation of the mean-covariance under the joint constraint $\bm{\Sigma}\bm{\mu} = \bm{\mu}$ for a multivariate normal distribution. A novel structured covariance is proposed through reparameterization of the spectral decomposition of $\bm{\Sigma}$ involving its eigenvalues and $\bm{\mu}$. This is designed to address the challenging issue of positive-definiteness and to reduce the number of covariance parameters from quadratic to linear function of the dimension. We propose a fast (noniterative) method for approximating the maximum likelihood estimator by maximizing a lower bound for the profile likelihood function, which is concave. We use normal and inverse gamma priors on the mean and eigenvalues, and approximate the maximum aposteriori estimators by both MH within Gibbs sampling and a faster iterative method. A simulation study shows good performance of our estimators.
UMAP (Uniform Manifold Approximation and Projection) is a novel manifold learning technique for dimension reduction. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data. The UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance. Furthermore, UMAP has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning.