The asymptotic behaviour of Linear Spectral Statistics (LSS) of the smoothed periodogram estimator of the spectral coherency matrix of a complex Gaussian high-dimensional time series $(\y_n)_{n \in \mathbb{Z}}$ with independent components is studied under the asymptotic regime where the sample size $N$ converges towards $+\infty$ while the dimension $M$ of $\y$ and the smoothing span of the estimator grow to infinity at the same rate in such a way that $\frac{M}{N} \rightarrow 0$. It is established that, at each frequency, the estimated spectral coherency matrix is close from the sample covariance matrix of an independent identically $\mathcal{N}_{\mathbb{C}}(0,\I_M)$ distributed sequence, and that its empirical eigenvalue distribution converges towards the Marcenko-Pastur distribution. This allows to conclude that each LSS has a deterministic behaviour that can be evaluated explicitly. Using concentration inequalities, it is shown that the order of magnitude of the supremum over the frequencies of the deviation of each LSS from its deterministic approximation is of the order of $\frac{1}{M} + \frac{\sqrt{M}}{N}+ (\frac{M}{N})^{3}$ where $N$ is the sample size. Numerical simulations supports our results.
In health-pollution cohort studies, accurate predictions of pollutant concentrations at new locations are needed, since the locations of fixed monitoring sites and study participants are often spatially misaligned. For multi-pollution data, principal component analysis (PCA) is often incorporated to obtain low-rank (LR) structure of the data prior to spatial prediction. Recently developed predictive PCA modifies the traditional algorithm to improve the overall predictive performance by leveraging both LR and spatial structures within the data. However, predictive PCA requires complete data or an initial imputation step. Nonparametric imputation techniques without accounting for spatial information may distort the underlying structure of the data, and thus further reduce the predictive performance. We propose a convex optimization problem inspired by the LR matrix completion framework and develop a proximal algorithm to solve it. Missing data are imputed and handled concurrently within the algorithm, which eliminates the necessity of a separate imputation step. We show that our algorithm has low computational burden and leads to reliable predictive performance as the severity of missing data increases.
We study sparse linear regression over a network of agents, modeled as an undirected graph and no server node. The estimation of the $s$-sparse parameter is formulated as a constrained LASSO problem wherein each agent owns a subset of the $N$ total observations. We analyze the convergence rate and statistical guarantees of a distributed projected gradient tracking-based algorithm under high-dimensional scaling, allowing the ambient dimension $d$ to grow with (and possibly exceed) the sample size $N$. Our theory shows that, under standard notions of restricted strong convexity and smoothness of the loss functions, suitable conditions on the network connectivity and algorithm tuning, the distributed algorithm converges globally at a {\it linear} rate to an estimate that is within the centralized {\it statistical precision} of the model, $O(s\log d/N)$. When $s\log d/N=o(1)$, a condition necessary for statistical consistency, an $\varepsilon$-optimal solution is attained after $\mathcal{O}(\kappa \log (1/\varepsilon))$ gradient computations and $O (\kappa/(1-\rho) \log (1/\varepsilon))$ communication rounds, where $\kappa$ is the restricted condition number of the loss function and $\rho$ measures the network connectivity. The computation cost matches that of the centralized projected gradient algorithm despite having data distributed; whereas the communication rounds reduce as the network connectivity improves. Overall, our study reveals interesting connections between statistical efficiency, network connectivity \& topology, and convergence rate in high dimensions.
This study presents PRISM, a probabilistic simplex component analysis approach to identifying the vertices of a data-circumscribing simplex from data. The problem has a rich variety of applications, the most notable being hyperspectral unmixing in remote sensing and non-negative matrix factorization in machine learning. PRISM uses a simple probabilistic model, namely, uniform simplex data distribution and additive Gaussian noise, and it carries out inference by maximum likelihood. The inference model is sound in the sense that the vertices are provably identifiable under some assumptions, and it suggests that PRISM can be effective in combating noise when the number of data points is large. PRISM has strong, but hidden, relationships with simplex volume minimization, a powerful geometric approach for the same problem. We study these fundamental aspects, and we also consider algorithmic schemes based on importance sampling and variational inference. In particular, the variational inference scheme is shown to resemble a matrix factorization problem with a special regularizer, which draws an interesting connection to the matrix factorization approach. Numerical results are provided to demonstrate the potential of PRISM.
This paper studies the inference of the regression coefficient matrix under multivariate response linear regressions in the presence of hidden variables. A novel procedure for constructing confidence intervals of entries of the coefficient matrix is proposed. Our method first utilizes the multivariate nature of the responses by estimating and adjusting the hidden effect to construct an initial estimator of the coefficient matrix. By further deploying a low-dimensional projection procedure to reduce the bias introduced by the regularization in the previous step, a refined estimator is proposed and shown to be asymptotically normal. The asymptotic variance of the resulting estimator is derived with closed-form expression and can be consistently estimated. In addition, we propose a testing procedure for the existence of hidden effects and provide its theoretical justification. Both our procedures and their analyses are valid even when the feature dimension and the number of responses exceed the sample size. Our results are further backed up via extensive simulations and a real data analysis.
This paper studies the spectral estimation problem of estimating the locations of a fixed number of point sources given multiple snapshots of Fourier measurements collected by a uniform array of sensors. We prove novel stability bounds for MUSIC and ESPRIT as a function of the noise standard deviation, number of snapshots, source amplitudes, and support. Our most general result is a perturbation bound of the signal space in terms of the minimum singular value of Fourier matrices. When the point sources are located in several separated clumps, we provide an explicit upper bound of the noise-space correlation perturbation error in MUSIC and the support error in ESPRIT in terms of a Super-Resolution Factor (SRF). The upper bound for ESPRIT is then compared with a new Cram\'er-Rao lower bound for the clumps model. As a result, we show that ESPRIT is comparable to that of the optimal unbiased estimator(s) in terms of the dependence on noise, number of snapshots and SRF. As a byproduct of our analysis, we discover several fundamental differences between the single-snapshot and multi-snapshot problems. Our theory is validated by numerical experiments.
In the theory of linear switching systems with discrete time, as in other areas of mathematics, the problem of studying the growth rate of the norms of all possible matrix products $A_{\sigma_{n}}\cdots A_{\sigma_{0}}$ with factors from a set of matrices $\mathscr{A}$ arises. So far, only for a relatively small number of classes of matrices $\mathscr{A}$ has it been possible to accurately describe the sequences of matrices that guarantee the maximum rate of increase of the corresponding norms. Moreover, in almost all cases studied theoretically, the index sequences $\{\sigma_{n}\}$ of matrices maximizing the norms of the corresponding matrix products have been shown to be periodic or so-called Sturmian, which entails a whole set of "good" properties of the sequences $\{A_{\sigma_{n}}\}$, in particular the existence of a limiting frequency of occurrence of each matrix factor $A_{i}\in\mathscr{A}$ in them. In the paper it is shown that this is not always the case: a class of matrices is defined consisting of two $2\times 2$ matrices, similar to rotations in the plane, in which the sequence $\{A_{\sigma_{n}}\}$ maximizing the growth rate of the norms $\|A_{\sigma_{n}}\cdots A_{\sigma_{0}}\|$ is not Sturmian. All considerations are based on numerical modeling and cannot be considered mathematically rigorous in this part; rather, they should be interpreted as a set of questions for further comprehensive theoretical analysis.
The recently proposed statistical finite element (statFEM) approach synthesises measurement data with finite element models and allows for making predictions about the true system response. We provide a probabilistic error analysis for a prototypical statFEM setup based on a Gaussian process prior under the assumption that the noisy measurement data are generated by a deterministic true system response function that satisfies a second-order elliptic partial differential equation for an unknown true source term. In certain cases, properties such as the smoothness of the source term may be misspecified by the Gaussian process model. The error estimates we derive are for the expectation with respect to the measurement noise of the $L^2$-norm of the difference between the true system response and the mean of the statFEM posterior. The estimates imply polynomial rates of convergence in the numbers of measurement points and finite element basis functions and depend on the Sobolev smoothness of the true source term and the Gaussian process model. A numerical example for Poisson's equation is used to illustrate these theoretical results.
Statistical inference on the explained variation of an outcome by a set of covariates is of particular interest in practice. When the covariates are of moderate to high-dimension and the effects are not sparse, several approaches have been proposed for estimation and inference. One major problem with the existing approaches is that the inference procedures are not robust to the normality assumption on the covariates and the residual errors. In this paper, we propose an estimating equation approach to the estimation and inference on the explained variation in the high-dimensional linear model. Unlike the existing approaches, the proposed approach does not rely on the restrictive normality assumptions for inference. It is shown that the proposed estimator is consistent and asymptotically normally distributed under reasonable conditions. Simulation studies demonstrate better performance of the proposed inference procedure in comparison with the existing approaches. The proposed approach is applied to studying the variation of glycohemoglobin explained by environmental pollutants in a National Health and Nutrition Examination Survey data set.
We show that for the problem of testing if a matrix $A \in F^{n \times n}$ has rank at most $d$, or requires changing an $\epsilon$-fraction of entries to have rank at most $d$, there is a non-adaptive query algorithm making $\widetilde{O}(d^2/\epsilon)$ queries. Our algorithm works for any field $F$. This improves upon the previous $O(d^2/\epsilon^2)$ bound (SODA'03), and bypasses an $\Omega(d^2/\epsilon^2)$ lower bound of (KDD'14) which holds if the algorithm is required to read a submatrix. Our algorithm is the first such algorithm which does not read a submatrix, and instead reads a carefully selected non-adaptive pattern of entries in rows and columns of $A$. We complement our algorithm with a matching query complexity lower bound for non-adaptive testers over any field. We also give tight bounds of $\widetilde{\Theta}(d^2)$ queries in the sensing model for which query access comes in the form of $\langle X_i, A\rangle:=tr(X_i^\top A)$; perhaps surprisingly these bounds do not depend on $\epsilon$. We next develop a novel property testing framework for testing numerical properties of a real-valued matrix $A$ more generally, which includes the stable rank, Schatten-$p$ norms, and SVD entropy. Specifically, we propose a bounded entry model, where $A$ is required to have entries bounded by $1$ in absolute value. We give upper and lower bounds for a wide range of problems in this model, and discuss connections to the sensing model above.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.