Functional principal component analysis (FPCA) could become invalid when data involve non-Gaussian features. Therefore, we aim to develop a general FPCA method to adapt to such non-Gaussian cases. A Kenall's $\tau$ function, which possesses identical eigenfunctions as covariance function, is constructed. The particular formulation of Kendall's $\tau$ function makes it less insensitive to data distribution. We further apply it to the estimation of FPCA and study the corresponding asymptotic consistency. Moreover, the effectiveness of the proposed method is demonstrated through a comprehensive simulation study and an application to the physical activity data collected by a wearable accelerometer monitor.
We revisit the classical problem of nonparametric density estimation, but impose local differential privacy constraints. Under such constraints, the original multivariate data $X_1,\ldots,X_n \in \mathbb{R}^d$ cannot be directly observed, and all estimators are functions of the randomised output of a suitable privacy mechanism. The statistician is free to choose the form of the privacy mechanism, and in this work we propose to add Laplace distributed noise to a discretisation of the location of a vector $X_i$. Based on these randomised data, we design a novel estimator of the density function, which can be viewed as a privatised version of the well-studied histogram density estimator. Our theoretical results include universal pointwise consistency and strong universal $L_1$-consistency. In addition, a convergence rate over classes of Lipschitz functions is derived, which is complemented by a matching minimax lower bound. We illustrate the trade-off between data utility and privacy by means of a small simulation study.
In this paper, we introduce a computational framework for recovering a high-resolution approximation of an unknown function from its low-resolution indirect measurements as well as high-resolution training observations by merging the frameworks of generalized sampling and functional principal component analysis. In particular, we increase the signal resolution via a data driven approach, which models the function of interest as a realization of a random field and leverages a training set of observations generated via the same underlying random process. We study the performance of the resulting estimation procedure and show that high-resolution recovery is indeed possible provided appropriate low-rank and angle conditions hold and provided the training set is sufficiently large relative to the desired resolution. Moreover, we show that the size of the training set can be reduced by leveraging sparse representations of the functional principal components. Furthermore, the effectiveness of the proposed reconstruction procedure is illustrated by various numerical examples.
We study the problem of estimating a rank-$1$ signal in the presence of rotationally invariant noise-a class of perturbations more general than Gaussian noise. Principal Component Analysis (PCA) provides a natural estimator, and sharp results on its performance have been obtained in the high-dimensional regime. Recently, an Approximate Message Passing (AMP) algorithm has been proposed as an alternative estimator with the potential to improve the accuracy of PCA. However, the existing analysis of AMP requires an initialization that is both correlated with the signal and independent of the noise, which is often unrealistic in practice. In this work, we combine the two methods, and propose to initialize AMP with PCA. Our main result is a rigorous asymptotic characterization of the performance of this estimator. Both the AMP algorithm and its analysis differ from those previously derived in the Gaussian setting: at every iteration, our AMP algorithm requires a specific term to account for PCA initialization, while in the Gaussian case, PCA initialization affects only the first iteration of AMP. The proof is based on a two-phase artificial AMP that first approximates the PCA estimator and then mimics the true AMP. Our numerical simulations show an excellent agreement between AMP results and theoretical predictions, and suggest an interesting open direction on achieving Bayes-optimal performance.
We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbf{\Phi} \mathbf{X}^\star|$, in which $\mathbf{\Phi}$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly correlated) random matrices $\mathbf{\Phi}$ and observation channels. Spectral methods are a powerful tool to obtain approximate observations of the signal $\mathbf{X}^\star$ which can be then used as initialization for a subsequent algorithm, at a low computational cost. In this paper, we extend and unify previous results and approaches on spectral methods for the phase retrieval problem. More precisely, we combine the linearization of message-passing algorithms and the analysis of the \emph{Bethe Hessian}, a classical tool of statistical physics. Using this toolbox, we show how to derive optimal spectral methods for arbitrary channel noise and right-unitarily invariant matrix $\mathbf{\Phi}$, in an automated manner (i.e. with no optimization over any hyperparameter or preprocessing function).
The liver has a unique blood supply system and plays an important role in the human blood circulatory system. Thus, hemodynamic problems related to the liver serve as an important part in clinical diagnosis and treatment. Although estimating parameters in these hemodynamic models is essential to the study of liver models, due to the limitations of medical measurement methods and constraints of ethics on clinical studies, it is impossible to directly measure the parameters of blood vessels in livers. Furthermore, as an important part of the systemic blood circulation, livers' studies are supposed to be in conjunction with other blood vessels. In this article, we present an innovative method to fix parameters of an individual liver in a human blood circulation using non-invasive clinical measurements. The method consists of a 1-D blood flow model of human arteries and veins, a 0-D model reflecting the peripheral resistance of capillaries and a lumped parameter circuit model for human livers. We apply the finite element method in fluid mechanics of these models to a numerical study, based on non-invasive blood related measures of 33 individuals. The estimated results of human blood vessel characteristic and liver model parameters are verified from the perspective of Stroke Value Variation, which shows the effectiveness of our estimation method.
Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and unstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.
We consider a sparse high-dimensional varying coefficients model with random effects, a flexible linear model allowing covariates and coefficients to have a functional dependence with time. For each individual, we observe discretely sampled responses and covariates as a function of time as well as time invariant covariates. Under sampling times that are either fixed and common or random and independent amongst individuals, we propose a projection procedure for the empirical estimation of all varying coefficients. We extend this estimator to construct confidence bands for a fixed number of varying coefficients.
Principal Subspace Analysis (PSA) -- and its sibling, Principal Component Analysis (PCA) -- is one of the most popular approaches for dimensionality reduction in signal processing and machine learning. But centralized PSA/PCA solutions are fast becoming irrelevant in the modern era of big data, in which the number of samples and/or the dimensionality of samples often exceed the storage and/or computational capabilities of individual machines. This has led to the study of distributed PSA/PCA solutions, in which the data are partitioned across multiple machines and an estimate of the principal subspace is obtained through collaboration among the machines. It is in this vein that this paper revisits the problem of distributed PSA/PCA under the general framework of an arbitrarily connected network of machines that lacks a central server. The main contributions of the paper in this regard are threefold. First, two algorithms are proposed in the paper that can be used for distributed PSA/PCA, with one in the case of data partitioned across samples and the other in the case of data partitioned across (raw) features. Second, in the case of sample-wise partitioned data, the proposed algorithm and a variant of it are analyzed, and their convergence to the true subspace at linear rates is established. Third, extensive experiments on both synthetic and real-world data are carried out to validate the usefulness of the proposed algorithms. In particular, in the case of sample-wise partitioned data, an MPI-based distributed implementation is carried out to study the interplay between network topology and communications cost as well as to study the effects of straggler machines on the proposed algorithms.
We consider the problem of making inference about the population outcome mean of an outcome variable subject to nonignorable missingness. By leveraging a so-called shadow variable for the outcome, we propose a novel condition that ensures nonparametric identification of the outcome mean, although the full data distribution is not identified. The identifying condition requires the existence of a function as a solution to a representer equation that connects the shadow variable to the outcome mean. Under this condition, we use sieves to nonparametrically solve the representer equation and propose an estimator which avoids modeling the propensity score or the outcome regression. We establish the asymptotic properties of the proposed estimator. We also show that the estimator is locally efficient and attains the semiparametric efficiency bound for the shadow variable model under certain regularity conditions. We illustrate the proposed approach via simulations and a real data application on home pricing.
From only positive (P) and unlabeled (U) data, a binary classifier could be trained with PU learning, in which the state of the art is unbiased PU learning. However, if its model is very flexible, empirical risks on training data will go negative, and we will suffer from serious overfitting. In this paper, we propose a non-negative risk estimator for PU learning: when getting minimized, it is more robust against overfitting, and thus we are able to use very flexible models (such as deep neural networks) given limited P data. Moreover, we analyze the bias, consistency, and mean-squared-error reduction of the proposed risk estimator, and bound the estimation error of the resulting empirical risk minimizer. Experiments demonstrate that our risk estimator fixes the overfitting problem of its unbiased counterparts.