Sparse principal component analysis (SPCA) is widely used for dimensionality reduction and feature extraction in high-dimensional data analysis. Despite many methodological and theoretical developments in the past two decades, the theoretical guarantees of the popular SPCA algorithm proposed by Zou, Hastie & Tibshirani (2006) are still unknown. This paper aims to address this critical gap. We first revisit the SPCA algorithm of Zou et al. (2006) and present our implementation. We also study a computationally more efficient variant of the SPCA algorithm in Zou et al. (2006) that can be considered as the limiting case of SPCA. We provide the guarantees of convergence to a stationary point for both algorithms and prove that, under a sparse spiked covariance model, both algorithms can recover the principal subspace consistently under mild regularity conditions. We show that their estimation error bounds match the best available bounds of existing works or the minimax rates up to some logarithmic factors. Moreover, we demonstrate the competitive numerical performance of both algorithms in numerical studies.
Topological data analysis (TDA) approaches are becoming increasingly popular for studying the dependence patterns in multivariate time series data. In particular, various dependence patterns in brain networks may be linked to specific tasks and cognitive processes, which can be altered by various neurological impairments such as epileptic seizures. Existing TDA approaches rely on the notion of distance between data points that is symmetric by definition for building graph filtrations. For brain dependence networks, this is a major limitation that constrains practitioners to using only symmetric dependence measures, such as correlations or coherence. However, it is known that the brain dependence network may be very complex and can contain a directed flow of information from one brain region to another. Such dependence networks are usually captured by more advanced measures of dependence such as partial directed coherence, which is a Granger causality based dependence measure. These dependence measures will result in a non-symmetric distance function, especially during epileptic seizures. In this paper we propose to solve this limitation by decomposing the weighted connectivity network into its symmetric and anti-symmetric components using matrix decomposition and comparing the anti-symmetric component prior to and post seizure. Our analysis of epileptic seizure EEG data shows promising results.
The impact of an extreme climate event depends strongly on its geographical scale. Max-stable processes can be used for the statistical investigation of climate extremes and their spatial dependencies on a continuous area. Most existing parametric models of max-stable processes assume spatial stationarity and are therefore not suitable for the application to data that cover a large and heterogeneous area. For this reason, it has recently been proposed to use a clustering algorithm to divide the area of investigation into smaller regions and to fit parametric max-stable processes to the data within those regions. We investigate this clustering algorithm further and point out that there are cases in which it results in regions on which spatial stationarity is not a reasonable assumption. We propose an alternative clustering algorithm and demonstrate in a simulation study that it can lead to improved results.
In this work we propose tailored model order reduction for varying boundary optimal control problems governed by parametric partial differential equations. With varying boundary control, we mean that a specific parameter changes where the boundary control acts on the system. This peculiar formulation might benefit from model order reduction. Indeed, fast and reliable simulations of this model can be of utmost usefulness in many applied fields, such as geophysics and energy engineering. However, varying boundary control features very complicated and diversified parametric behaviour for the state and adjoint variables. The state solution, for example, changing the boundary control parameter, might feature transport phenomena. Moreover, the problem loses its affine structure. It is well known that classical model order reduction techniques fail in this setting, both in accuracy and in efficiency. Thus, we propose reduced approaches inspired by the ones used when dealing with wave-like phenomena. Indeed, we compare standard proper orthogonal decomposition with two tailored strategies: geometric recasting and local proper orthogonal decomposition. Geometric recasting solves the optimization system in a reference domain simplifying the problem at hand avoiding hyper-reduction, while local proper orthogonal decomposition builds local bases to increase the accuracy of the reduced solution in very general settings (where geometric recasting is unfeasible). We compare the various approaches on two different numerical experiments based on geometries of increasing complexity.
This paper presents a new distance metric to compare two continuous probability density functions. The main advantage of this metric is that, unlike other statistical measurements, it can provide an analytic, closed-form expression for a mixture of Gaussian distributions while satisfying all metric properties. These characteristics enable fast, stable, and efficient calculations, which are highly desirable in real-world signal processing applications. The application in mind is Gaussian Mixture Reduction (GMR), which is widely used in density estimation, recursive tracking, and belief propagation. To address this problem, we developed a novel algorithm dubbed the Optimization-based Greedy GMR (OGGMR), which employs our metric as a criterion to approximate a high-order Gaussian mixture with a lower order. Experimental results show that the OGGMR algorithm is significantly faster and more efficient than state-of-the-art GMR algorithms while retaining the geometric shape of the original mixture.
We introduce a transformation framework that can be utilized to develop online algorithms with low $\epsilon$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $\epsilon$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $\epsilon$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.
A new data-driven bilateral generalized two-dimensional quaternion principal component analysis (BiG2DQPCA) is presented to extract the features of matrix samples from both row and column directions. This general framework directly works on the 2D color images without vectorizing and well preserves the spatial and color information, which makes it flexible to fit various real-world applications. A generalized ridge regression model of BiG2DQPCA is firstly proposed with orthogonality constrains on aimed features. Applying the deflation technique and the framework of minorization-maximization, a new quaternion optimization algorithm is proposed to compute the optimal features of BiG2DQPCA and a closed-form solution is obtained at each iteration. A new approach based on BiG2DQPCA is presented for color face recognition and image reconstruction with a new data-driven weighting technique. Sufficient numerical experiments are implemented on practical color face databases and indicate the superiority of BiG2DQPCA over the state-of-the-art methods in terms of recognition accuracies and rates of image reconstruction.
Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension $d$ and the sample size $n$ are ultra-large, by simultaneously performing parallel computing along $d$ and distributed computing along $n$. Specifically, we utilize $L$ parallel copies of $p$-dimensional fast sketches to divide the computing burden along $d$ and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when $Lp \ge d$. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as $Lp$ increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.
We approximate the d complex zeros of a univariate polynomial p(x) of a degree d or those zeros that lie in a fixed region of interest on the complex plane such as a disc or a square. Our divide and conquer algorithm of STOC 1995 supports solution of this problem in optimal Boolean time (up to a poly-logarithmic factor), that is, runs nearly as fast as one can access the coefficients of p with the precision necessary to support required accuracy of the output. That record complexity has not been matched by any other algorithm yet, but our root-finder of 1995 is quite involved and has never been implemented. We present alternative nearly optimal root-finders based on our novel variants of the classical subdivision iterations. Unlike our predecessor of 1995, we require randomization of Las Vegas type, allowing us to detect any output error at a dominated computational cost, but our new root-finders are much simpler to implement than their predecessor of 1995. According to the results of extensive test with standard test polynomials for their preliminary version, which incorporates only a part of our novel techniques, the new root-finders compete and for a large class of inputs significantly supersedes the package of root-finding subroutines MPSolve, which for decades has been user's choice package. Unlike our predecessor of 1995 and all known fast algorithms for the cited tasks of polynomial root-finding, our new algorithms can be also applied to a polynomial given by a black box oracle for its evaluation rather than by its coefficients. This makes our root-finders particularly efficient for polynomials p(x) that can be evaluated fast such as the Mandelbrot polynomials or those given by the sum of a small number of shifted monomials. Our algorithm can be readily extended to fast approximation of the eigenvalues of a matrix or a matrix polynomial.
In this paper, we propose a novel, computationally efficient reduced order method to solve linear parabolic inverse source problems. Our approach provides accurate numerical solutions without relying on specific training data. The forward solution is constructed using a Krylov sequence, while the source term is recovered via the conjugate gradient (CG) method. Under a weak regularity assumption on the solution of the parabolic partial differential equations (PDEs), we establish convergence of the forward solution and provide a rigorous error estimate for our method. Numerical results demonstrate that our approach offers substantial computational savings compared to the traditional finite element method (FEM) and retains equivalent accuracy.
In this paper, we present a notion of differential privacy (DP) for data that comes from different classes. Here, the class-membership is private information that needs to be protected. The proposed method is an output perturbation mechanism that adds noise to the release of query response such that the analyst is unable to infer the underlying class-label. The proposed DP method is capable of not only protecting the privacy of class-based data but also meets quality metrics of accuracy and is computationally efficient and practical. We illustrate the efficacy of the proposed method empirically while outperforming the baseline additive Gaussian noise mechanism. We also examine a real-world application and apply the proposed DP method to the autoregression and moving average (ARMA) forecasting method, protecting the privacy of the underlying data source. Case studies on the real-world advanced metering infrastructure (AMI) measurements of household power consumption validate the excellent performance of the proposed DP method while also satisfying the accuracy of forecasted power consumption measurements.