The problem of testing the equality of mean vectors for high-dimensional data has been intensively investigated in the literature. However, most of the existing tests impose strong assumptions on the underlying group covariance matrices which may not be satisfied or hardly be checked in practice. In this article, an F-type test for two-sample Behrens--Fisher problems for high-dimensional data is proposed and studied. When the two samples are normally distributed and when the null hypothesis is valid, the proposed F-type test statistic is shown to be an F-type mixture, a ratio of two independent chi-square-type mixtures. Under some regularity conditions and the null hypothesis, it is shown that the proposed F-type test statistic and the above F-type mixture have the same normal and non-normal limits. It is then justified to approximate the null distribution of the proposed F-type test statistic by that of the F-type mixture, resulting in the so-called normal reference F-type test. Since the F-type mixture is a ratio of two independent chi-square-type mixtures, we employ the Welch--Satterthwaite chi-square-approximation to the distributions of the numerator and the denominator of the F-type mixture respectively, resulting in an approximation F-distribution whose degrees of freedom can be consistently estimated from the data. The asymptotic power of the proposed F-type test is established. Two simulation studies are conducted and they show that in terms of size control, the proposed F-type test outperforms two existing competitors. The proposed F-type test is also illustrated by a real data example.
We present a simple method to approximate Rao's distance between multivariate normal distributions based on discretizing curves joining normal distributions and approximating Rao distances between successive nearby normal distributions on the curve by Jeffreys divergence. We consider experimentally the linear interpolation curves in the ordinary, natural and expectation parameterizations of the normal distributions, and compare these curves with a curve derived from the Calvo and Oller's isometric embedding of the Fisher-Rao $d$-variate normal manifold into the cone of $(d+1)\times (d+1)$ symmetric positive-definite matrices [Journal of multivariate analysis 35.2 (1990): 223-242]. We report on our experiments and assess the quality of our approximation technique by comparing the numerical approximations with lower and upper bounds. Finally, we present some information-geometric properties of the Calvo and Oller's isometric embedding.
Semi-supervised learning is extensively applied these days to estimate classifiers from training data in which not all of the labels of the feature vectors are available. With the use of generative models that propose a form for the joint distribution of a feature vector and its ground-truth label, the Bayes' classifier can be estimated via maximum likelihood on partially classified training data. To increase the accuracy of this sample classifier, \cite{ahfock2020apparent} proposed that a missing-label mechanism be adopted and that the Bayes' classifier be estimated on the basis of the full likelihood formed in the framework that models the probability of a missing label given its observed feature vector in terms of its entropy. In the case of two Gaussian classes with a common covariance matrix, it was shown that the accuracy of the classifier so estimated from the partially classified training data can even have lower error rate than if it were estimated from the sample completely classified. Here, we focus on an algorithm for estimating the Bayes' classifier via the full likelihood in the case of multiple Gaussian classes with arbitrary covariance matrices. Different strategies for initializing the algorithm are discussed and illustrated. A new \proglang{R} package with these tools, \texttt{gmmsslm}, is demonstrated on real data.
Estimating the entropy rate of discrete time series is a challenging problem with important applications in numerous areas including neuroscience, genomics, image processing and natural language processing. A number of approaches have been developed for this task, typically based either on universal data compression algorithms, or on statistical estimators of the underlying process distribution. In this work, we propose a fully-Bayesian approach for entropy estimation. Building on the recently introduced Bayesian Context Trees (BCT) framework for modelling discrete time series as variable-memory Markov chains, we show that it is possible to sample directly from the induced posterior on the entropy rate. This can be used to estimate the entire posterior distribution, providing much richer information than point estimates. We develop theoretical results for the posterior distribution of the entropy rate, including proofs of consistency and asymptotic normality. The practical utility of the method is illustrated on both simulated and real-world data, where it is found to outperform state-of-the-art alternatives.
We analyze the performance of the least absolute shrinkage and selection operator (Lasso) for the linear model when the number of regressors $N$ grows larger keeping the true support size $d$ finite, i.e., the ultra-sparse case. The result is based on a novel treatment of the non-rigorous replica method in statistical physics, which has been applied only to problem settings where $N$ ,$d$ and the number of observations $M$ tend to infinity at the same rate. Our analysis makes it possible to assess the average performance of Lasso with Gaussian sensing matrices without assumptions on the scaling of $N$ and $M$, the noise distribution, and the profile of the true signal. Under mild conditions on the noise distribution, the analysis also offers a lower bound on the sample complexity necessary for partial and perfect support recovery when $M$ diverges as $M = O(\log N)$. The obtained bound for perfect support recovery is a generalization of that given in previous literature, which only considers the case of Gaussian noise and diverging $d$. Extensive numerical experiments strongly support our analysis.
In this paper, we target the problem of sufficient dimension reduction with symmetric positive definite matrices valued responses. We propose the intrinsic minimum average variance estimation method and the intrinsic outer product gradient method which fully exploit the geometric structure of the Riemannian manifold where responses lie. We present the algorithms for our newly developed methods under the log-Euclidean metric and the log-Cholesky metric. Each of the two metrics is linked to an abelian Lie group structure that transforms our model defined on a manifold into a Euclidean one. The proposed methods are then further extended to general Riemannian manifolds. We establish rigourous asymptotic results for the proposed estimators, including the rate of convergence and the asymptotic normality. We also develop a cross validation algorithm for the estimation of the structural dimension with theoretical guarantee Comprehensive simulation studies and an application to the New York taxi network data are performed to show the superiority of the proposed methods.
Bayes factors for composite hypotheses have difficulty in encoding vague prior knowledge, leading to conflicts between objectivity and sensitivity including the Jeffreys-Lindley paradox. To address these issues we revisit the posterior Bayes factor, in which the posterior distribution from the data at hand is re-used in the Bayes factor for the same data. We argue that this is biased when calibrated against proper Bayes factors, but propose bias adjustments to allow interpretation on the same scale. In the important case of a regular normal model, the bias in log scale is half the number of parameters. The resulting empirical Bayes factor is closely related to the widely applicable information criterion. We develop test-based empirical Bayes factors for several standard tests and propose an extension to multiple testing closely related to the optimal discovery procedure. When only a P-value is available, such as in non-parametric tests, we obtain a Bayes factor calibration of 10p. We propose interpreting the strength of Bayes factors on a logarithmic scale with base 3.73, reflecting the sharpest distinction between weaker and stronger belief. Empirical Bayes factors are a frequentist-Bayesian compromise expressing an evidential view of hypothesis testing.
Non-asymptotic statistical analysis is often missing for modern geometry-aware machine learning algorithms due to the possibly intricate non-linear manifold structure. This paper studies an intrinsic mean model on the manifold of restricted positive semi-definite matrices and provides a non-asymptotic statistical analysis of the Karcher mean. We also consider a general extrinsic signal-plus-noise model, under which a deterministic error bound of the Karcher mean is provided. As an application, we show that the distributed principal component analysis algorithm, LRC-dPCA, achieves the same performance as the full sample PCA algorithm. Numerical experiments lend strong support to our theories.
Grid-free Monte Carlo methods such as \emph{walk on spheres} can be used to solve elliptic partial differential equations without mesh generation or global solves. However, such methods independently estimate the solution at every point, and hence do not take advantage of the high spatial regularity of solutions to elliptic problems. We propose a fast caching strategy which first estimates solution values and derivatives at randomly sampled points along the boundary of the domain (or a local region of interest). These cached values then provide cheap, output-sensitive evaluation of the solution (or its gradient) at interior points, via a boundary integral formulation. Unlike classic boundary integral methods, our caching scheme introduces zero statistical bias and does not require a dense global solve. Moreover we can handle imperfect geometry (e.g., with self-intersections) and detailed boundary/source terms without repairing or resampling the boundary representation. Overall, our scheme is similar in spirit to \emph{virtual point light} methods from photorealistic rendering: it suppresses the typical salt-and-pepper noise characteristic of independent Monte Carlo estimates, while still retaining the many advantages of Monte Carlo solvers: progressive evaluation, trivial parallelization, geometric robustness, \etc{}\ We validate our approach using test problems from visual and geometric computing.
We consider the problem of testing the equality of conditional distributions of a response variable given a vector of covariates between two populations. Such a hypothesis testing problem can be motivated from various machine learning and statistical inference scenarios, including transfer learning and causal predictive inference. We develop a nonparametric test procedure inspired from the conformal prediction framework. The construction of our test statistic combines recent developments in conformal prediction with a novel choice of conformity score, resulting in a weighted rank-sum test statistic that is valid and powerful under general settings. To our knowledge, this is the first successful attempt of using conformal prediction for testing statistical hypotheses beyond exchangeability. Our method is suitable for modern machine learning scenarios where the data has high dimensionality and large sample sizes, and can be effectively combined with existing classification algorithms to find good conformity score functions. The performance of the proposed method is demonstrated in various numerical examples.
The receiver operating characteristic (ROC) curve is a powerful statistical tool and has been widely applied in medical research. In the ROC curve estimation, a commonly used assumption is that larger the biomarker value, greater severity the disease. In this paper, we mathematically interpret ``greater severity of the disease" as ``larger probability of being diseased". This in turn is equivalent to assume the likelihood ratio ordering of the biomarker between the diseased and healthy individuals. With this assumption, we first propose a Bernstein polynomial method to model the distributions of both samples; we then estimate the distributions by the maximum empirical likelihood principle. The ROC curve estimate and the associated summary statistics are obtained subsequently. Theoretically, we establish the asymptotic consistency of our estimators. Via extensive numerical studies, we compare the performance of our method with competitive methods. The application of our method is illustrated by a real-data example.