We introduce a family of pairwise stochastic gradient estimators for gradients of expectations, which are related to the log-derivative trick, but involve pairwise interactions between samples. The simplest example of our new estimator, dubbed the fundamental trick estimator, is shown to arise from either a) introducing and approximating an integral representation based on the fundamental theorem of calculus, or b) applying the reparameterisation trick to an implicit parameterisation under infinitesimal perturbation of the parameters. From the former perspective we generalise to a reproducing kernel Hilbert space representation, giving rise to a locality parameter in the pairwise interactions mentioned above, yielding our representer trick estimator. The resulting estimators are unbiased and shown to offer an independent component of useful information in comparison with the log-derivative estimator. We provide a further novel theoretical analysis which further characterises the variance reduction afforded by the new techniques. Promising analytical and numerical examples confirm the theory and intuitions behind the new estimators.
Algorithms such as Differentially Private SGD enable training machine learning models with formal privacy guarantees. However, there is a discrepancy between the protection that such algorithms guarantee in theory and the protection they afford in practice. An emerging strand of work empirically estimates the protection afforded by differentially private training as a confidence interval for the privacy budget $\varepsilon$ spent on training a model. Existing approaches derive confidence intervals for $\varepsilon$ from confidence intervals for the false positive and false negative rates of membership inference attacks. Unfortunately, obtaining narrow high-confidence intervals for $\epsilon$ using this method requires an impractically large sample size and training as many models as samples. We propose a novel Bayesian method that greatly reduces sample size, and adapt and validate a heuristic to draw more than one sample per trained model. Our Bayesian method exploits the hypothesis testing interpretation of differential privacy to obtain a posterior for $\varepsilon$ (not just a confidence interval) from the joint posterior of the false positive and false negative rates of membership inference attacks. For the same sample size and confidence, we derive confidence intervals for $\varepsilon$ around 40% narrower than prior work. The heuristic, which we adapt from label-only DP, can be used to further reduce the number of trained models needed to get enough samples by up to 2 orders of magnitude.
Active learning enables the efficient construction of a labeled dataset by labeling informative samples from an unlabeled dataset. In a real-world active learning scenario, considering the diversity of the selected samples is crucial because many redundant or highly similar samples exist. Core-set approach is the promising diversity-based method selecting diverse samples based on the distance between samples. However, the approach poorly performs compared to the uncertainty-based approaches that select the most difficult samples where neural models reveal low confidence. In this work, we analyze the feature space through the lens of the density and, interestingly, observe that locally sparse regions tend to have more informative samples than dense regions. Motivated by our analysis, we empower the core-set approach with the density-awareness and propose a density-aware core-set (DACS). The strategy is to estimate the density of the unlabeled samples and select diverse samples mainly from sparse regions. To reduce the computational bottlenecks in estimating the density, we also introduce a new density approximation based on locality-sensitive hashing. Experimental results clearly demonstrate the efficacy of DACS in both classification and regression tasks and specifically show that DACS can produce state-of-the-art performance in a practical scenario. Since DACS is weakly dependent on neural architectures, we present a simple yet effective combination method to show that the existing methods can be beneficially combined with DACS.
We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.
Fluid-structure systems occur in a range of scientific and engineering applications. The immersed boundary(IB) method is a widely recognized and effective modeling paradigm for simulating fluid-structure interaction(FSI) in such systems, but a difficulty of the IB formulation is that the pressure and viscous stress are generally discontinuous at the interface. The conventional IB method regularizes these discontinuities, which typically yields low-order accuracy at these interfaces. The immersed interface method(IIM) is an IB-like approach to FSI that sharply imposes stress jump conditions, enabling higher-order accuracy, but prior applications of the IIM have been largely restricted to methods that rely on smooth representations of the interface geometry. This paper introduces an IIM that uses only a C0 representation of the interface,such as those provided by standard nodal Lagrangian FE methods. Verification examples for models with prescribed motion demonstrate that the method sharply resolves stress discontinuities along the IB while avoiding the need for analytic information of the interface geometry. We demonstrate that only the lowest-order jump conditions for the pressure and velocity gradient are required to realize global 2nd-order accuracy. Specifically,we show 2nd-order global convergence rate along with nearly 2nd-order local convergence in the Eulerian velocity, and between 1st-and 2nd-order global convergence rates along with 1st-order local convergence for the Eulerian pressure. We also show 2nd-order local convergence in the interfacial displacement and velocity along with 1st-order local convergence in the fluid traction. As a demonstration of the method's ability to tackle complex geometries,this approach is also used to simulate flow in an anatomical model of the inferior vena cava.
The optimal moment to start renal replacement therapy in a patient with acute kidney injury (AKI) remains a challenging problem in intensive care nephrology. Multiple randomised controlled trials have tried to answer this question, but these can, by definition, only analyse a limited number of treatment initiation strategies. In view of this, we use routinely collected observational data from the Ghent University Hospital intensive care units (ICUs) to investigate different pre-specified timing strategies for renal replacement therapy initiation based on time-updated levels of serum potassium, pH and fluid balance in critically ill patients with AKI with the aim to minimize 30-day ICU mortality. For this purpose, we apply statistical techniques for evaluating the impact of specific dynamic treatment regimes in the presence of ICU discharge as a competing event. We discuss two approaches, a non-parametric one - using an inverse probability weighted Aalen-Johansen estimator - and a semiparametric one - using dynamic-regime marginal structural models. Furthermore, we suggest an easy to implement cross-validation technique that can be used for the out-of-sample performance assessment of the optimal dynamic treatment regime. Our work illustrates the potential of data-driven medical decision support based on routinely collected observational data.
Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in minimax settings, their generalization guarantees in the stochastic setting, i.e., how the solution trained on empirical data performs on the unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization, fails in simple examples of minimax problems. Furthermore, another popular metric, the primal-dual risk, also fails to characterize the generalization behavior for minimax problems with nonconvexity, due to non-existence of saddle points. We thus propose a new metric to study generalization of minimax learners: the primal gap, to circumvent these issues. Next, we derive generalization bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization bounds for primal risk and primal-dual risk in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization.
We study the problem of high-dimensional sparse mean estimation in the presence of an $\epsilon$-fraction of adversarial outliers. Prior work obtained sample and computationally efficient algorithms for this task for identity-covariance subgaussian distributions. In this work, we develop the first efficient algorithms for robust sparse mean estimation without a priori knowledge of the covariance. For distributions on $\mathbb R^d$ with "certifiably bounded" $t$-th moments and sufficiently light tails, our algorithm achieves error of $O(\epsilon^{1-1/t})$ with sample complexity $m = (k\log(d))^{O(t)}/\epsilon^{2-2/t}$. For the special case of the Gaussian distribution, our algorithm achieves near-optimal error of $\tilde O(\epsilon)$ with sample complexity $m = O(k^4 \mathrm{polylog}(d))/\epsilon^2$. Our algorithms follow the Sum-of-Squares based, proofs to algorithms approach. We complement our upper bounds with Statistical Query and low-degree polynomial testing lower bounds, providing evidence that the sample-time-error tradeoffs achieved by our algorithms are qualitatively the best possible.
In inverse problems, the parameters of a model are estimated based on observations of the model response. The Bayesian approach is powerful for solving such problems; one formulates a prior distribution for the parameter state that is updated with the observations to compute the posterior parameter distribution. Solving for the posterior distribution can be challenging when, e.g., prior and posterior significantly differ from one another and/or the parameter space is high-dimensional. We use a sequence of importance sampling measures that arise by tempering the likelihood to approach inverse problems exhibiting a significant distance between prior and posterior. Each importance sampling measure is identified by cross-entropy minimization as proposed in the context of Bayesian inverse problems in Engel et al. (2021). To efficiently address problems with high-dimensional parameter spaces we set up the minimization procedure in a low-dimensional subspace of the original parameter space. The principal idea is to analyse the spectrum of the second-moment matrix of the gradient of the log-likelihood function to identify a suitable subspace. Following Zahm et al. (2021), an upper bound on the Kullback-Leibler-divergence between full-dimensional and subspace posterior is provided, which can be utilized to determine the effective dimension of the inverse problem corresponding to a prescribed approximation error bound. We suggest heuristic criteria for optimally selecting the number of model and model gradient evaluations in each iteration of the importance sampling sequence. We investigate the performance of this approach using examples from engineering mechanics set in various parameter space dimensions.
Estimating the conditional quantile of the interested variable with respect to changes in the covariates is frequent in many economical applications as it can offer a comprehensive insight. In this paper, we propose a novel semiparametric model averaging to predict the conditional quantile even if all models under consideration are potentially misspecified. Specifically, we first build a series of non-nested partially linear sub-models, each with different nonlinear component. Then a leave-one-out cross-validation criterion is applied to choose the model weights. Under some regularity conditions, we have proved that the resulting model averaging estimator is asymptotically optimal in terms of minimizing the out-of-sample average quantile prediction error. Our modelling strategy not only effectively avoids the problem of specifying which a covariate should be nonlinear when one fits a partially linear model, but also results in a more accurate prediction than traditional model-based procedures because of the optimality of the selected weights by the cross-validation criterion. Simulation experiments and an illustrative application show that our proposed model averaging method is superior to other commonly used alternatives.
This PhD thesis contains several contributions to the field of statistical causal modeling. Statistical causal models are statistical models embedded with causal assumptions that allow for the inference and reasoning about the behavior of stochastic systems affected by external manipulation (interventions). This thesis contributes to the research areas concerning the estimation of causal effects, causal structure learning, and distributionally robust (out-of-distribution generalizing) prediction methods. We present novel and consistent linear and non-linear causal effects estimators in instrumental variable settings that employ data-dependent mean squared prediction error regularization. Our proposed estimators show, in certain settings, mean squared error improvements compared to both canonical and state-of-the-art estimators. We show that recent research on distributionally robust prediction methods has connections to well-studied estimators from econometrics. This connection leads us to prove that general K-class estimators possess distributional robustness properties. We, furthermore, propose a general framework for distributional robustness with respect to intervention-induced distributions. In this framework, we derive sufficient conditions for the identifiability of distributionally robust prediction methods and present impossibility results that show the necessity of several of these conditions. We present a new structure learning method applicable in additive noise models with directed trees as causal graphs. We prove consistency in a vanishing identifiability setup and provide a method for testing substructure hypotheses with asymptotic family-wise error control that remains valid post-selection. Finally, we present heuristic ideas for learning summary graphs of nonlinear time-series models.