We present \textit{universal} estimators for the statistical mean, variance, and scale (in particular, the interquartile range) under pure differential privacy. These estimators are universal in the sense that they work on an arbitrary, unknown continuous distribution $\mathcal{P}$ over $\mathbb{R}$, while yielding strong utility guarantees except for ill-behaved $\mathcal{P}$. For certain distribution families like Gaussians or heavy-tailed distributions, we show that our universal estimators match or improve existing estimators, which are often specifically designed for the given family and under \textit{a priori} boundedness assumptions on the mean and variance of $\mathcal{P}$. This is the first time these boundedness assumptions are removed under pure differential privacy. The main technical tools in our development are instance-optimal empirical estimators for the mean and quantiles over the unbounded integer domain, which can be of independent interest.
We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.
In this work, we propose an efficient two-stage algorithm solving a joint problem of correlation detection and partial alignment recovery between two Gaussian databases. Correlation detection is a hypothesis testing problem; under the null hypothesis, the databases are independent, and under the alternate hypothesis, they are correlated, under an unknown row permutation. We develop bounds on the type-I and type-II error probabilities, and show that the analyzed detector performs better than a recently proposed detector, at least for some specific parameter choices. Since the proposed detector relies on a statistic, which is a sum of dependent indicator random variables, then in order to bound the type-I probability of error, we develop a novel graph-theoretic technique for bounding the $k$-th order moments of such statistics. When the databases are accepted as correlated, the algorithm also recovers some partial alignment between the given databases. We also propose two more algorithms: (i) One more algorithm for partial alignment recovery, whose reliability and computational complexity are both higher than those of the first proposed algorithm. (ii) An algorithm for full alignment recovery, which has a reduced amount of calculations and a not much lower error probability, when compared to the optimal recovery procedure.
Generating differentially private (DP) synthetic data that closely resembles the original private data without leaking sensitive user information is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are accessible via their inference APIs while the model weights are unreleased. However, this comes with greater challenges due to strictly more restrictive model access and the additional need to protect privacy from the API provider. In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID<=7.9 with privacy cost epsilon=0.67, significantly improving the previous SOTA from epsilon=32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images.
Decision trees are interpretable models that are well-suited to non-linear learning problems. Much work has been done on extending decision tree learning algorithms with differential privacy, a system that guarantees the privacy of samples within the training data. However, current state-of-the-art algorithms for this purpose sacrifice much utility for a small privacy benefit. These solutions create random decision nodes that reduce decision tree accuracy or spend an excessive share of the privacy budget on labeling leaves. Moreover, many works do not support or leak information about feature values when data is continuous. We propose a new method called PrivaTree based on private histograms that chooses good splits while consuming a small privacy budget. The resulting trees provide a significantly better privacy-utility trade-off and accept mixed numerical and categorical data without leaking additional information. Finally, while it is notoriously hard to give robustness guarantees against data poisoning attacks, we prove bounds for the expected success rates of backdoor attacks against differentially-private learners. Our experimental results show that PrivaTree consistently outperforms previous works on predictive accuracy and significantly improves robustness against backdoor attacks compared to regular decision trees.
Kaplan-Meier estimators capture the survival behavior of a cohort. They are one of the key statistics in survival analysis. As with any estimator, they become more accurate in presence of larger datasets. This motivates multiple data holders to share their data in order to calculate a more accurate Kaplan-Meier estimator. However, these survival datasets often contain sensitive information of individuals and it is the responsibility of the data holders to protect their data, thus a naive sharing of data is often not viable. In this work, we propose two novel differentially private schemes that are facilitated by our novel synthetic dataset generation method. Based on these scheme we propose various paths that allow a joint estimation of the Kaplan-Meier curves with strict privacy guarantees. Our contribution includes a taxonomy of methods for this task and an extensive experimental exploration and evaluation based on this structure. We show that we can construct a joint, global Kaplan-Meier estimator which satisfies very tight privacy guarantees and with no statistically-significant utility loss compared to the non-private centralized setting.
In many life science experiments or medical studies, subjects are repeatedly observed and measurements are collected in factorial designs with multivariate data. The analysis of such multivariate data is typically based on multivariate analysis of variance (MANOVA) or mixed models, requiring complete data, and certain assumption on the underlying parametric distribution such as continuity or a specific covariance structure, e.g., compound symmetry. However, these methods are usually not applicable when discrete data or even ordered categorical data are present. In such cases, nonparametric rank-based methods that do not require stringent distributional assumptions are the preferred choice. However, in the multivariate case, most rank-based approaches have only been developed for complete observations. It is the aim of this work is to develop asymptotic correct procedures that are capable of handling missing values, allowing for singular covariance matrices and are applicable for ordinal or ordered categorical data. This is achieved by applying a wild bootstrap procedure in combination with quadratic form-type test statistics. Beyond proving their asymptotic correctness, extensive simulation studies validate their applicability for small samples. Finally, two real data examples are analyzed.
Suppose we want to train text prediction models in email clients or word processors. The models must preserve the privacy of user data and adhere to a specific fixed size to meet memory and inference time requirements. We introduce a generic framework to solve this problem. Specifically, we are given a public dataset $D_\text{pub}$ and a private dataset $D_\text{priv}$ corresponding to a downstream task $T$. How should we pre-train a fixed-size model $M$ on $D_\text{pub}$ and fine-tune it on $D_\text{priv}$ such that performance of $M$ with respect to $T$ is maximized and $M$ satisfies differential privacy with respect to $D_\text{priv}$? We show that pre-training on a {\em subset} of dataset $D_\text{pub}$ that brings the public distribution closer to the private distribution is a crucial ingredient to maximize the transfer learning abilities of $M$ after pre-training, especially in the regimes where model sizes are relatively small. Besides performance improvements, our framework also shows that with careful pre-training and private fine-tuning, {\em smaller models} can match the performance of much larger models, highlighting the promise of differentially private training as a tool for model compression and efficiency.
The Inverse-Wishart (IW) distribution is a standard and popular choice of priors for covariance matrices and has attractive properties such as conditional conjugacy. However, the IW family of priors has crucial drawbacks, including the lack of effective choices for non-informative priors. Several classes of priors for covariance matrices that alleviate these drawbacks, while preserving computational tractability, have been proposed in the literature. These priors can be obtained through appropriate scale mixtures of IW priors. However, the high-dimensional posterior consistency of models which incorporate such priors has not been investigated. We address this issue for the multi-response regression setting ($q$ responses, $n$ samples) under a wide variety of IW scale mixture priors for the error covariance matrix. Posterior consistency and contraction rates for both the regression coefficient matrix and the error covariance matrix are established in the ``large $q$, large $n$" setting under mild assumptions on the true data-generating covariance matrix and relevant hyperparameters. In particular, the number of responses $q_n$ is allowed to grow with $n$, but with $q_n = o(n)$. Also, some results related to the inconsistency of posterior mean for $q_n/n \to \gamma$, where $\gamma \in (0,\infty)$ are provided.
Privacy estimation techniques for differentially private (DP) algorithms are useful for comparing against analytical bounds, or to empirically measure privacy loss in settings where known analytical bounds are not tight. However, existing privacy auditing techniques usually make strong assumptions on the adversary (e.g., knowledge of intermediate model iterates or the training data distribution), are tailored to specific tasks and model architectures, and require retraining the model many times (typically on the order of thousands). These shortcomings make deploying such techniques at scale difficult in practice, especially in federated settings where model training can take days or weeks. In this work, we present a novel "one-shot" approach that can systematically address these challenges, allowing efficient auditing or estimation of the privacy loss of a model during the same, single training run used to fit model parameters, and without requiring any a priori knowledge about the model architecture or task. We show that our method provides provably correct estimates for privacy loss under the Gaussian mechanism, and we demonstrate its performance on a well-established FL benchmark dataset under several adversarial models.
We construct differentially private estimators with low sample complexity that estimate the median of an arbitrary distribution over $\mathbb{R}$ satisfying very mild moment conditions. Our result stands in contrast to the surprising negative result of Bun et al. (FOCS 2015) that showed there is no differentially private estimator with any finite sample complexity that returns any non-trivial approximation to the median of an arbitrary distribution.