We study the problem of estimating non-linear functionals of discrete distributions in the context of local differential privacy. The initial data $x_1,\ldots,x_n \in [K]$ are supposed i.i.d. and distributed according to an unknown discrete distribution $p = (p_1,\ldots,p_K)$. Only $\alpha$-locally differentially private (LDP) samples $z_1,...,z_n$ are publicly available, where the term 'local' means that each $z_i$ is produced using one individual attribute $x_i$. We exhibit privacy mechanisms (PM) that are interactive (i.e. they are allowed to use already published confidential data) or non-interactive. We describe the behavior of the quadratic risk for estimating the power sum functional $F_{\gamma} = \sum_{k=1}^K p_k^{\gamma}$, $\gamma >0$ as a function of $K, \, n$ and $\alpha$. In the non-interactive case, we study two plug-in type estimators of $F_{\gamma}$, for all $\gamma >0$, that are similar to the MLE analyzed by Jiao et al. (2017) in the multinomial model. However, due to the privacy constraint the rates we attain are slower and similar to those obtained in the Gaussian model by Collier et al. (2020). In the interactive case, we introduce for all $\gamma >1$ a two-step procedure which attains the faster parametric rate $(n \alpha^2)^{-1/2}$ when $\gamma \geq 2$. We give lower bounds results over all $\alpha$-LDP mechanisms and all estimators using the private samples.
Portfolio managers faced with limited sample sizes must use factor models to estimate the covariance matrix of a high-dimensional returns vector. For the simplest one-factor market model, success rests on the quality of the estimated leading eigenvector "beta". When only the returns themselves are observed, the practitioner has available the "PCA" estimate equal to the leading eigenvector of the sample covariance matrix. This estimator performs poorly in various ways. To address this problem in the high-dimension, limited sample size asymptotic regime and in the context of estimating the minimum variance portfolio, Goldberg, Papanicolau, and Shkolnik developed a shrinkage method (the "GPS estimator") that improves the PCA estimator of beta by shrinking it toward a constant target unit vector. In this paper we continue their work to develop a more general framework of shrinkage targets that allows the practitioner to make use of further information to improve the estimator. Examples include sector separation of stock betas, and recent information from prior estimates. We prove some precise statements and illustrate the resulting improvements over the GPS estimator with some numerical experiments.
An irreducible stochastic matrix with rational entries has a stationary distribution given by a vector of rational numbers. We give an upper bound on the lowest common denominator of the entries of this vector. Bounds of this kind are used to study the complexity of algorithms for solving stochastic mean payoff games. They are usually derived using the Hadamard inequality, but this leads to suboptimal results. We replace the Hadamard inequality with the Markov chain tree formula in order to obtain optimal bounds. We also adapt our approach to obtain bounds on the absorption probabilities of finite Markov chains and on the gains and bias vectors of Markov chains with rewards.
Motivated by establishing theoretical foundations for various manifold learning algorithms, we study the problem of Mahalanobis distance (MD), and the associated precision matrix, estimation from high-dimensional noisy data. By relying on recent transformative results in covariance matrix estimation, we demonstrate the sensitivity of \MD~and the associated precision matrix to measurement noise, determining the exact asymptotic signal-to-noise ratio at which MD fails, and quantifying its performance otherwise. In addition, for an appropriate loss function, we propose an asymptotically optimal shrinker, which is shown to be beneficial over the classical implementation of the MD, both analytically and in simulations. The result is extended to the manifold setup, where the nonlinear interaction between curvature and high-dimensional noise is taken care of. The developed solution is applied to study a multiscale reduction problem in the dynamical system analysis.
This paper addresses the task of estimating a covariance matrix under a patternless sparsity assumption. In contrast to existing approaches based on thresholding or shrinkage penalties, we propose a likelihood-based method that regularizes the distance from the covariance estimate to a symmetric sparsity set. This formulation avoids unwanted shrinkage induced by more common norm penalties and enables optimization of the resulting non-convex objective by solving a sequence of smooth, unconstrained subproblems. These subproblems are generated and solved via the proximal distance version of the majorization-minimization principle. The resulting algorithm executes rapidly, gracefully handles settings where the number of parameters exceeds the number of cases, yields a positive definite solution, and enjoys desirable convergence properties. Empirically, we demonstrate that our approach outperforms competing methods by several metrics across a suite of simulated experiments. Its merits are illustrated on an international migration dataset and a classic case study on flow cytometry. Our findings suggest that the marginal and conditional dependency networks for the cell signalling data are more similar than previously concluded.
The rapid finding of effective therapeutics requires the efficient use of available resources in clinical trials. The use of covariate adjustment can yield statistical estimates with improved precision, resulting in a reduction in the number of participants required to draw futility or efficacy conclusions. We focus on time-to-event and ordinal outcomes. A key question for covariate adjustment in randomized studies is how to fit a model relating the outcome and the baseline covariates to maximize precision. We present a novel theoretical result establishing conditions for asymptotic normality of a variety of covariate-adjusted estimators that rely on machine learning (e.g., l1-regularization, Random Forests, XGBoost, and Multivariate Adaptive Regression Splines), under the assumption that outcome data is missing completely at random. We further present a consistent estimator of the asymptotic variance. Importantly, the conditions do not require the machine learning methods to converge to the true outcome distribution conditional on baseline variables, as long as they converge to some (possibly incorrect) limit. We conducted a simulation study to evaluate the performance of the aforementioned prediction methods in COVID-19 trials using longitudinal data from over 1,500 patients hospitalized with COVID-19 at Weill Cornell Medicine New York Presbyterian Hospital. We found that using l1-regularization led to estimators and corresponding hypothesis tests that control type 1 error and are more precise than an unadjusted estimator across all sample sizes tested. We also show that when covariates are not prognostic of the outcome, l1-regularization remains as precise as the unadjusted estimator, even at small sample sizes (n = 100). We give an R package adjrct that performs model-robust covariate adjustment for ordinal and time-to-event outcomes.
In this paper, we develop a general framework to design differentially private expectation-maximization (EM) algorithms in high-dimensional latent variable models, based on the noisy iterative hard-thresholding. We derive the statistical guarantees of the proposed framework and apply it to three specific models: Gaussian mixture, mixture of regression, and regression with missing covariates. In each model, we establish the near-optimal rate of convergence with differential privacy constraints, and show the proposed algorithm is minimax rate optimal up to logarithm factors. The technical tools developed for the high-dimensional setting are then extended to the classic low-dimensional latent variable models, and we propose a near rate-optimal EM algorithm with differential privacy guarantees in this setting. Simulation studies and real data analysis are conducted to support our results.
The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience, however. As such, a variety of estimators have been derived to overcome the shortcomings of the sample covariance matrix in these settings. Yet, the question of selecting an optimal estimator from among the plethora available remains largely unaddressed. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. In particular, we propose a general class of loss functions for covariance matrix estimation and establish finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validated estimator selector with respect to these loss functions. We evaluate our proposed approach via a comprehensive set of simulation experiments and demonstrate its practical benefits by application in the exploratory analysis of two single-cell transcriptome sequencing datasets. A free and open-source software implementation of the proposed methodology, the cvCovEst R package, is briefly introduced.
In more and more applications, a quantity of interest may depend on several covariates, with at least one of them infinite-dimensional (e.g. a curve). To select the relevant covariates in this context, we propose an adaptation of the Lasso method. Two estimation methods are defined. The first one consists in the minimisation of a criterion inspired by classical Lasso inference under group sparsity (Yuan and Lin, 2006; Lounici et al., 2011) on the whole multivariate functional space H. The second one minimises the same criterion but on a finite-dimensional subspace of H which dimension is chosen by a penalized leasts-squares method base on the work of Barron et al. (1999). Sparsity-oracle inequalities are proven in case of fixed or random design in our infinite-dimensional context. To calculate the solutions of both criteria, we propose a coordinate-wise descent algorithm, inspired by the glmnet algorithm (Friedman et al., 2007). A numerical study on simulated and experimental datasets illustrates the behavior of the estimators.
We introduce a new method analyzing the cumulative sum (CUSUM) procedure in sequential change-point detection. When observations are phase-type distributed and the post-change distribution is given by exponential tilting of its pre-change distribution, the first passage analysis of the CUSUM statistic is reduced to that of a certain Markov additive process. By using the theory of the so-called scale matrix and further developing it, we derive exact expressions of the average run length, average detection delay, and false alarm probability under the CUSUM procedure. The proposed method is robust and applicable in a general setting with non-i.i.d. observations. Numerical results also are given.
Recent work of Erlingsson, Feldman, Mironov, Raghunathan, Talwar, and Thakurta [EFMRTT19] demonstrates that random shuffling amplifies differential privacy guarantees of locally randomized data. Such amplification implies substantially stronger privacy guarantees for systems in which data is contributed anonymously [BEMMRLRKTS17] and has lead to significant interest in the shuffle model of privacy [CSUZZ19; EFMRTT19]. We show that random shuffling of $n$ data records that are input to $\varepsilon_0$-differentially private local randomizers results in an $(O((1-e^{-\varepsilon_0})\sqrt{\frac{e^{\varepsilon_0}\log(1/\delta)}{n}}), \delta)$-differentially private algorithm. This significantly improves over previous work and achieves the asymptotically optimal dependence in $\varepsilon_0$. Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees. Importantly, our work also yields an algorithm for deriving tighter bounds on the resulting $\varepsilon$ and $\delta$ as well as R\'enyi differential privacy guarantees. We show numerically that our algorithm gets to within a small constant factor of the optimal bound. As a direct corollary of our analysis we derive a simple and nearly optimal algorithm for frequency estimation in the shuffle model of privacy. We also observe that our result implies the first asymptotically optimal privacy analysis of noisy stochastic gradient descent that applies to sampling without replacement.