In this work, we introduce statistical testing under distributional shifts. We are interested in the hypothesis $P^* \in H_0$ for a target distribution $P^*$, but observe data from a different distribution $Q^*$. We assume that $P^*$ is related to $Q^*$ through a known shift $\tau$ and formally introduce hypothesis testing in this setting. We propose a general testing procedure that first resamples from the observed data to construct an auxiliary data set and then applies an existing test in the target domain. We prove that if the size of the resample is at most $o(\sqrt{n})$ and the resampling weights are well-behaved, this procedure inherits the pointwise asymptotic level and power from the target test. If the map $\tau$ is estimated from data, we can maintain the above guarantees under mild conditions if the estimation works sufficiently well. We further extend our results to uniform asymptotic level and a different resampling scheme. Testing under distributional shifts allows us to tackle a diverse set of problems. We argue that it may prove useful in reinforcement learning and covariate shift, we show how it reduces conditional to unconditional independence testing and we provide example applications in causal inference.
A membership inference attack (MIA) against a machine-learning model enables an attacker to determine whether a given data record was part of the model's training data or not. In this paper, we provide an in-depth study of the phenomenon of disparate vulnerability against MIAs: unequal success rate of MIAs against different population subgroups. We first establish necessary and sufficient conditions for MIAs to be prevented, both on average and for population subgroups, using a notion of distributional generalization. Second, we derive connections of disparate vulnerability to algorithmic fairness and to differential privacy. We show that fairness can only prevent disparate vulnerability against limited classes of adversaries. Differential privacy bounds disparate vulnerability but can significantly reduce the accuracy of the model. We show that estimating disparate vulnerability to MIAs by na\"ively applying existing attacks can lead to overestimation. We then establish which attacks are suitable for estimating disparate vulnerability, and provide a statistical framework for doing so reliably. We conduct experiments on synthetic and real-world data finding statistically significant evidence of disparate vulnerability in realistic settings.
Goodness-of-fit (GoF) testing is ubiquitous in statistics, with direct ties to model selection, confidence interval construction, conditional independence testing, and multiple testing, just to name a few applications. While testing the GoF of a simple (point) null hypothesis provides an analyst great flexibility in the choice of test statistic while still ensuring validity, most GoF tests for composite null hypotheses are far more constrained, as the test statistic must have a tractable distribution over the entire null model space. A notable exception is co-sufficient sampling (CSS): resampling the data conditional on a sufficient statistic for the null model guarantees valid GoF testing using any test statistic the analyst chooses. But CSS testing requires the null model to have a compact (in an information-theoretic sense) sufficient statistic, which only holds for a very limited class of models; even for a null model as simple as logistic regression, CSS testing is powerless. In this paper, we leverage the concept of approximate sufficiency to generalize CSS testing to essentially any parametric model with an asymptotically-efficient estimator; we call our extension "approximate CSS" (aCSS) testing. We quantify the finite-sample Type I error inflation of aCSS testing and show that it is vanishing under standard maximum likelihood asymptotics, for any choice of test statistic. We apply our proposed procedure both theoretically and in simulation to a number of models of interest to demonstrate its finite-sample Type I error and power.
Property inference attacks reveal statistical properties about a training set but are difficult to distinguish from the primary purposes of statistical machine learning, which is to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal and generic definition of property inference attacks. The proposed notion describes attacks that can distinguish between possible training distributions, extending beyond previous property inference attacks that infer the ratio of a particular type of data in the training data set. In this paper, we show how our definition captures previous property inference attacks as well as a new attack that reveals the average degree of nodes of a training graph and report on experiments giving insight into the potential risks of property inference attacks.
We present a framework for speeding up the time it takes to sample from discrete distributions $\mu$ defined over subsets of size $k$ of a ground set of $n$ elements, in the regime $k\ll n$. We show that having estimates of marginals $\mathbb{P}_{S\sim \mu}[i\in S]$, the task of sampling from $\mu$ can be reduced to sampling from distributions $\nu$ supported on size $k$ subsets of a ground set of only $n^{1-\alpha}\cdot \operatorname{poly}(k)$ elements. Here, $1/\alpha\in [1, k]$ is the parameter of entropic independence for $\mu$. Further, the sparsified distributions $\nu$ are obtained by applying a sparse (mostly $0$) external field to $\mu$, an operation that often retains algorithmic tractability of sampling from $\nu$. This phenomenon, which we dub domain sparsification, allows us to pay a one-time cost of estimating the marginals of $\mu$, and in return reduce the amortized cost needed to produce many samples from the distribution $\mu$, as is often needed in upstream tasks such as counting and inference. For a wide range of distributions where $\alpha=\Omega(1)$, our result reduces the domain size, and as a corollary, the cost-per-sample, by a $\operatorname{poly}(n)$ factor. Examples include monomers in a monomer-dimer system, non-symmetric determinantal point processes, and partition-constrained Strongly Rayleigh measures. Our work significantly extends the reach of prior work of Anari and Derezi\'nski who obtained domain sparsification for distributions with a log-concave generating polynomial (corresponding to $\alpha=1$). As a corollary of our new analysis techniques, we also obtain a less stringent requirement on the accuracy of marginal estimates even for the case of log-concave polynomials; roughly speaking, we show that constant-factor approximation is enough for domain sparsification, improving over $O(1/k)$ relative error established in prior work.
We consider the estimation of an n-dimensional vector s from the noisy element-wise measurements of $\mathbf{s}\mathbf{s}^T$, a generic problem that arises in statistics and machine learning. We study a mismatched Bayesian inference setting, where some of the parameters are not known to the statistician. We derive the full exact analytic expression of the asymptotic mean squared error (MSE) in the large system size limit for the particular case of Gaussian priors and additive noise. From our formulas, we see that estimation is still possible in the mismatched case; and also that the minimum MSE (MMSE) can be achieved if the statistician chooses suitable parameters. Our technique relies on the asymptotics of the spherical integrals and can be applied as long as the statistician chooses a rotationally invariant prior.
Property inference attacks reveal statistical properties about a training set but are difficult to distinguish from the intrinsic purpose of statistical machine learning, namely to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal and general definition of property inference attacks. The proposed notion describes attacks that can distinguish between possible training distributions, extending beyond previous property inference attacks that infer the ratio of a particular type of data in the training data set such as the proportion of females. We show how our definition captures previous property inference attacks as well as a new attack that can reveal the average node degree or clustering coefficient of a training graph. Our definition also enables a theorem that connects the maximum possible accuracy of inference attacks distinguishing between distributions to the effective size of dataset leaked by the model. To quantify and understand property inference risks, we conduct a series of experiments across a range of different distributions using both black-box and white-box attacks. Our results show that inexpensive attacks are often as effective as expensive meta-classifier attacks, and that there are surprising asymmetries in the effectiveness of attacks. We also extend the state-of-the-art property inference attack to work on convolutional neural networks, and propose techniques to help identify parameters in a model that leak the most information, thus significantly lowering resource requirements for meta-classifier attacks.
Based on the covert communication framework, we consider a covert queueing problem that has a Markovian statistic. Willie jobs arrive according to a Poisson process and require service from server Bob. Bob does not have a queue for jobs to wait and hence when the server is busy, arriving Willie jobs are lost. Willie and Bob enter a contract under which Bob should only serve Willie jobs. As part of the usage statistic, for a sequence of N consecutive jobs that arrived, Bob informs Willie whether each job was served or lost (this is the Markovian statistic). Bob is assumed to be violating the contract and admitting non-Willie (Nillie) jobs according to a Poisson process. For such a setting, we identify the hypothesis testing to be performed (given the Markovian data) by Willie to detect the presence or absence of Nillie jobs. We also characterize the upper bound on arrival rate of Nillie jobs such that the error in the hypothesis testing of Willie is arbitrarily large, ensuring covertness in admitting Nillie jobs.
Modern neural networks can assign high confidence to inputs drawn from outside the training distribution, posing threats to models in real-world deployments. While much research attention has been placed on designing new out-of-distribution (OOD) detection methods, the precise definition of OOD is often left in vagueness and falls short of the desired notion of OOD in reality. In this paper, we present a new formalization and model the data shifts by taking into account both the invariant and environmental (spurious) features. Under such formalization, we systematically investigate how spurious correlation in the training set impacts OOD detection. Our results suggest that the detection performance is severely worsened when the correlation between spurious features and labels is increased in the training set. We further show insights on detection methods that are more effective in reducing the impact of spurious correlation and provide theoretical analysis on why reliance on environmental features leads to high OOD detection error. Our work aims to facilitate a better understanding of OOD samples and their formalization, as well as the exploration of methods that enhance OOD detection.
We study data-driven decision-making problems in the Bayesian framework, where the expectation in the Bayes risk is replaced by a risk-sensitive entropic risk measure. We focus on problems where calculating the posterior distribution is intractable, a typical situation in modern applications with large datasets and complex data generating models. We leverage a dual representation of the entropic risk measure to introduce a novel risk-sensitive variational Bayesian (RSVB) framework for jointly computing a risk-sensitive posterior approximation and the corresponding decision rule. The proposed RSVB framework can be used to extract computational methods for doing risk-sensitive approximate Bayesian inference. We show that our general framework includes two well-known computational methods for doing approximate Bayesian inference viz. naive VB and loss-calibrated VB. We also study the impact of these computational approximations on the predictive performance of the inferred decision rules and values. We compute the convergence rates of the RSVB approximate posterior and also of the corresponding optimal value and decision rules. We illustrate our theoretical findings in both parametric and nonparametric settings with the help of three examples: the single and multi-product newsvendor model and Gaussian process classification.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.