We develop a dimension reduction framework for data consisting of matrices of counts. Our model is based on assuming the existence of a small amount of independent normal latent variables that drive the dependency structure of the observed data, and can be seen as the exact discrete analogue for a contaminated low-rank matrix normal model. We derive estimators for the model parameters and establish their root-$n$ consistency. An extension of a recent proposal from the literature is used to estimate the latent dimension of the model. Additionally, a sparsity-accommodating variant of the model is considered. The method is shown to surpass both its vectorization-based competitors and matrix methods assuming the continuity of the data distribution in analysing simulated data and real abundance data.
The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we first develop randomized approximations to $\tr(\G^\alpha)$ that transform the trace estimation into matrix-vector multiplications problem. We extend such strategy for arbitrary values of $\alpha$ (integer or non-integer). We then establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables us to exploit both clustering and block low-rank structure of $\G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.
Many economic and scientific problems involve the analysis of high-dimensional functional time series, where the number of functional variables ($p$) diverges as the number of serially dependent observations ($n$) increases. In this paper, we present a novel functional factor model for high-dimensional functional time series that maintains and makes use of the functional and dynamic structure to achieve great dimension reduction and find the latent factor structure. To estimate the number of functional factors and the factor loadings, we propose a fully functional estimation procedure based on an eigenanalysis for a nonnegative definite matrix. Our proposal involves a weight matrix to improve the estimation efficiency and tackle the issue of heterogeneity, the rationality of which is illustrated by formulating the estimation from a novel regression perspective. Asymptotic properties of the proposed method are studied when $p$ diverges at some polynomial rate as $n$ increases. To provide a parsimonious model and enhance interpretability for near-zero factor loadings, we impose sparsity assumptions on the factor loading space and then develop a regularized estimation procedure with theoretical guarantees when $p$ grows exponentially fast relative to $n.$ Finally, we demonstrate that our proposed estimators significantly outperform the competing methods through both simulations and applications to a U.K. temperature dataset and a Japanese mortality dataset.
We propose a novel scheme that allows MIMO system to modulate a set of permutation matrices to send more information bits, extending our initial work on the topic. This system is called Permutation Matrix Modulation (PMM). The basic idea is to employ a permutation matrix as a precoder and treat it as a modulated symbol. We continue the evolution of index modulation in MIMO by adopting all-antenna activation and obtaining a set of unique symbols from altering the positions of the antenna transmit power. We provide the analysis of the achievable rate of PMM under Gaussian Mixture Model (GMM) distribution and evaluate the numerical results by comparing it with the other existing systems. The result shows that PMM outperforms the existing systems under a fair parameter setting. We also present a way to attain the optimal achievable rate of PMM by solving a maximization problem via interior-point method. A low complexity detection scheme based on zero-forcing (ZF) is proposed, and maximum likelihood (ML) detection is discussed. We demonstrate the trade-off between simulation of the symbol error rate (SER) and the computational complexity where ZF performs worse in the SER simulation but requires much less computational complexity than ML.
Nearest neighbor (NN) matching as a tool to align data sampled from different groups is both conceptually natural and practically well-used. In a landmark paper, Abadie and Imbens (2006) provided the first large-sample analysis of NN matching under, however, a crucial assumption that the number of NNs, $M$, is fixed. This manuscript reveals something new out of their study and shows that, once allowing $M$ to diverge with the sample size, an intrinsic statistic in their analysis actually constitutes a consistent estimator of the density ratio. Furthermore, through selecting a suitable $M$, this statistic can attain the minimax lower bound of estimation over a Lipschitz density function class. Consequently, with a diverging $M$, the NN matching provably yields a doubly robust estimator of the average treatment effect and is semiparametrically efficient if the density functions are sufficiently smooth and the outcome model is appropriately specified. It can thus be viewed as a precursor of double machine learning estimators.
When are inferences (whether Direct-Likelihood, Bayesian, or Frequentist) obtained from partial data valid? This paper answers this question by offering a new theory about inference with missing data. It proves that as the sample size increases and the extent of missingness decreases, the mean-loglikelihood function generated by partial data and that ignores the missingness mechanism will almost surely converge uniformly to that which would have been generated by complete data; and if the data are Missing at Random (or "partially missing at random"), this convergence depends only on sample size. Thus, inferences from partial data, such as posterior modes, uncertainty estimates, confidence intervals, likelihood ratios, and indeed, all quantities or features derived from the partial-data loglikelihood function, will be consistently estimated. They will approximate their complete-data analogues. This adds to previous research which has only proved the consistency of the posterior mode. Practical implications of this result are discussed, and the theory is verified using a previous study of International Human Rights Law.
Covariance matrix estimation is a fundamental statistical task in many applications, but the sample covariance matrix is sub-optimal when the sample size is comparable to or less than the number of features. Such high-dimensional settings are common in modern genomics, where covariance matrix estimation is frequently employed as a method for inferring gene networks. To achieve estimation accuracy in these settings, existing methods typically either assume that the population covariance matrix has some particular structure, for example sparsity, or apply shrinkage to better estimate the population eigenvalues. In this paper, we study a new approach to estimating high-dimensional covariance matrices. We first frame covariance matrix estimation as a compound decision problem. This motivates defining a class of decision rules and using a nonparametric empirical Bayes g-modeling approach to estimate the optimal rule in the class. Simulation results and gene network inference in an RNA-seq experiment in mouse show that our approach is comparable to or can outperform a number of state-of-the-art proposals, particularly when the sample eigenvectors are poor estimates of the population eigenvectors.
Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal in terms of the magnitude of some cluster separation metric. The practical implementation of our algorithm with the optimal weight is also discussed. Finally, we conduct simulation studies to evaluate the finite sample performance of our algorithm and apply the method to a genomic dataset.
We propose a new wavelet-based method for density estimation when the data are size-biased. More specifically, we consider the power of the density of interest, where this power is some value greater than or equal to half. Warped wavelet bases are employed, where warping is attained by some continuous cumulative distribution function. This can be seen as a general framework for which the conventional orthonormal wavelet estimation is the case with the standard uniform c.d.f. We show that both linear and nonlinear wavelet estimators are consistent, with optimal and/or near-optimal rates. Monte Carlo simulations are performed to compare four special set-ups which are easy to interpret in practice. A real dataset application illustrates the method. We observe that warped bases provide more flexible and better estimates for both simulated and real data. Moreover, we can see that estimating the density power (for instance, its square root) further improves results.
The problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments to time-series in general dimension. For $\ell_p$-products of Euclidean metrics, for any $p$, we design simple and efficient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fr\'echet and Dynamic Time Warping distances. These are the most popular and practical approaches to comparing such curves. We offer the first data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or significantly improved by our algorithms, our algorithm is especially efficient when the length of the curves is bounded.
This paper describes a suite of algorithms for constructing low-rank approximations of an input matrix from a random linear image of the matrix, called a sketch. These methods can preserve structural properties of the input matrix, such as positive-semidefiniteness, and they can produce approximations with a user-specified rank. The algorithms are simple, accurate, numerically stable, and provably correct. Moreover, each method is accompanied by an informative error bound that allows users to select parameters a priori to achieve a given approximation quality. These claims are supported by numerical experiments with real and synthetic data.