In this study, we propose a projection estimation method for large-dimensional matrix factor models with cross-sectionally spiked eigenvalues. By projecting the observation matrix onto the row or column factor space, we simplify factor analysis for matrix series to that for a lower-dimensional tensor. This method also reduces the magnitudes of the idiosyncratic error components, thereby increasing the signal-to-noise ratio, because the projection matrix linearly filters the idiosyncratic error matrix. We theoretically prove that the projected estimators of the factor loading matrices achieve faster convergence rates than existing estimators under similar conditions. Asymptotic distributions of the projected estimators are also presented. A novel iterative procedure is given to specify the pair of row and column factor numbers. Extensive numerical studies verify the empirical performance of the projection method. Two real examples in finance and macroeconomics reveal factor patterns across rows and columns, which coincides with financial, economic, or geographical interpretations.
Inhomogeneous phase-type distributions (IPH) are a broad class of laws which arise from the absorption times of Markov jump processes. In the time-homogeneous particular case, we recover phase-type (PH) distributions. In matrix notation, various functionals corresponding to their distributional properties are explicitly available and succinctly described. As the number of parameters increases, IPH distributions may converge weakly to any probability measure on the positive real line, making them particularly attractive candidates for statistical modelling purposes. Contrary to PH distributions, the IPH class allows for a wide range of tail behaviours, which often leads to adequate estimation with a moderate number of parameters. One of the main difficulties in estimating PH and IPH distributions is their large number of matrix parameters. This drawback is best handled through the expectation-maximisation (EM) algorithm, exploiting the underlying and unobserved Markov structure. The matrixdist package presents tools for IPH distributions to efficiently evaluate functionals, simulate, and carry out maximum likelihood estimation through a three-step EM algorithm. Aggregated and right-censored data are supported by the fitting routines, and in particular, one may estimate time-to-event data, histograms, or discretised theoretical distributions.
This paper investigates limiting properties of eigenvalues of multivariate sample spatial-sign covariance matrices when both the number of variables and the sample size grow to infinity. The underlying p-variate populations are general enough to include the popular independent components model and the family of elliptical distributions. A first result of the paper establishes that the distribution of the eigenvalues converges to a deterministic limit that belongs to the family of generalized Marcenko-Pastur distributions. Furthermore, a new central limit theorem is established for a class of linear spectral statistics. We develop two applications of these results to robust statistics for a high-dimensional shape matrix. First, two statistics are proposed for testing the sphericity. Next, a spectrum-corrected estimator using the sample spatial-sign covariance matrix is proposed. Simulation experiments show that in high dimension, the sample spatial-sign covariance matrix provides a valid and robust tool for mitigating influence of outliers.
We present a novel class of projected methods, to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure, by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed.
Huge amount of applications in various fields, such as gene expression analysis or computer vision, undergo data sets with high-dimensional low-sample-size (HDLSS), which has putted forward great challenges for standard statistical and modern machine learning methods. In this paper, we propose a novel classification criterion on HDLSS, tolerance similarity, which emphasizes the maximization of within-class variance on the premise of class separability. According to this criterion, a novel linear binary classifier is designed, denoted by No-separated Data Maximum Dispersion classifier (NPDMD). The objective of NPDMD is to find a projecting direction w in which all of training samples scatter in as large an interval as possible. NPDMD has several characteristics compared to the state-of-the-art classification methods. First, it works well on HDLSS. Second, it combines the sample statistical information and local structural information (supporting vectors) into the objective function to find the solution of projecting direction in the whole feature spaces. Third, it solves the inverse of high dimensional matrix in low dimensional space. Fourth, it is relatively simple to be implemented based on Quadratic Programming. Fifth, it is robust to the model specification for various real applications. The theoretical properties of NPDMD are deduced. We conduct a series of evaluations on one simulated and six real-world benchmark data sets, including face classification and mRNA classification. NPDMD outperforms those widely used approaches in most cases, or at least obtains comparable results.
In this paper, we study the variable-order (VO) time-fractional diffusion equations. For a VO function $\alpha(t)\in(0,1)$, we develop an exponential-sum-approximation (ESA) technique to approach the VO Caputo fractional derivative. The ESA technique keeps both the quadrature exponents and the number of exponentials in the summation unchanged at the different time levels. Approximating parameters are properly selected to achieve efficient accuracy. Compared with the general direct method, the proposed method reduces the storage requirement from $\mathcal{O}(n)$ to $\mathcal{O}(\log^2 n)$ and the computational cost from $\mathcal{O}(n^2)$ to $\mathcal{O}(n\log^2 n)$, respectively, with $n$ being the number of the time levels. When this fast algorithm is exploited to construct a fast ESA scheme for the VO time-fractional diffusion equations, the computational complexity of the proposed scheme is only of $\mathcal{O}(mn\log^2 n)$ with $\mathcal{O}(m\log^2n)$ storage requirement, where $m$ denotes the number of spatial grids. Theoretically, the unconditional stability and error analysis of the fast ESA scheme are given. The effectiveness of the proposed algorithm is verified by numerical examples.
Robust estimation is an important problem in statistics which aims at providing a reasonable estimator when the data-generating distribution lies within an appropriately defined ball around an uncontaminated distribution. Although minimax rates of estimation have been established in recent years, many existing robust estimators with provably optimal convergence rates are also computationally intractable. In this paper, we study several estimation problems under a Wasserstein contamination model and present computationally tractable estimators motivated by generative adversarial networks (GANs). Specifically, we analyze properties of Wasserstein GAN-based estimators for location estimation, covariance matrix estimation, and linear regression and show that our proposed estimators are minimax optimal in many scenarios. Finally, we present numerical results which demonstrate the effectiveness of our estimators.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.
Implicit probabilistic models are models defined naturally in terms of a sampling procedure and often induces a likelihood function that cannot be expressed explicitly. We develop a simple method for estimating parameters in implicit models that does not require knowledge of the form of the likelihood function or any derived quantities, but can be shown to be equivalent to maximizing likelihood under some conditions. Our result holds in the non-asymptotic parametric setting, where both the capacity of the model and the number of data examples are finite. We also demonstrate encouraging experimental results.
We introduce negative binomial matrix factorization (NBMF), a matrix factorization technique specially designed for analyzing over-dispersed count data. It can be viewed as an extension of Poisson matrix factorization (PF) perturbed by a multiplicative term which models exposure. This term brings a degree of freedom for controlling the dispersion, making NBMF more robust to outliers. We show that NBMF allows to skip traditional pre-processing stages, such as binarization, which lead to loss of information. Two estimation approaches are presented: maximum likelihood and variational Bayes inference. We test our model with a recommendation task and show its ability to predict user tastes with better precision than PF.
This paper describes a suite of algorithms for constructing low-rank approximations of an input matrix from a random linear image of the matrix, called a sketch. These methods can preserve structural properties of the input matrix, such as positive-semidefiniteness, and they can produce approximations with a user-specified rank. The algorithms are simple, accurate, numerically stable, and provably correct. Moreover, each method is accompanied by an informative error bound that allows users to select parameters a priori to achieve a given approximation quality. These claims are supported by numerical experiments with real and synthetic data.