In randomized experiments and observational studies, weighting methods are often used to generalize and transport treatment effect estimates to target populations. Traditional methods construct the weights by separately modeling the treatment assignment and study selection probabilities and then multiplying functions (e.g., inverses) of their estimates. In this work, we provide a justification and an implementation for weighting in a single step. We show a formal connection between this one-step method and inverse probability and inverse odds weighting. We demonstrate that the resulting estimator for the target average treatment effect is consistent, asymptotically Normal, multiply robust, and semiparametrically efficient.
The impact of player age on performance has received attention across sport. Most research has focused on the performance of players at each age, ignoring the reality that age likewise influences which players receive opportunities to perform. Our manuscript makes two contributions. First, we highlight how selection bias is linked to both (i) which players receive opportunity to perform in sport, and (ii) at which ages we observe these players perform. This approach is used to generate underlying distributions of how players move in and out of sport organizations. Second, motivated by methods for missing data, we propose novel estimation methods of age curves by using both observed and unobserved (imputed) data. We use simulations to compare several comparative approaches for estimating aging curves. Imputation-based methods, as well as models that account for individual player skill, tend to generate lower RMSE and age curve shapes that better match the truth. We implement our approach using data from the National Hockey League.
A treatment policy defines when and what treatments are applied to affect some outcome of interest. Data-driven decision-making requires the ability to predict what happens if a policy is changed. Existing methods that predict how the outcome evolves under different scenarios assume that the tentative sequences of future treatments are fixed in advance, while in practice the treatments are determined stochastically by a policy and may depend for example on the efficiency of previous treatments. Therefore, the current methods are not applicable if the treatment policy is unknown or a counterfactual analysis is needed. To handle these limitations, we model the treatments and outcomes jointly in continuous time, by combining Gaussian processes and point processes. Our model enables the estimation of a treatment policy from observational sequences of treatments and outcomes, and it can predict the interventional and counterfactual progression of the outcome after an intervention on the treatment policy (in contrast with the causal effect of a single treatment). We show with real-world and semi-synthetic data on blood glucose progression that our method can answer causal queries more accurately than existing alternatives.
We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.
Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.
We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
Causal inference on populations embedded in social networks poses technical challenges, since the typical no interference assumption may no longer hold. For instance, in the context of social research, the outcome of a study unit will likely be affected by an intervention or treatment received by close neighbors. While inverse probability-of-treatment weighted (IPW) estimators have been developed for this setting, they are often highly inefficient. In this work, we assume that the network is a union of disjoint components and propose doubly robust (DR) estimators combining models for treatment and outcome that are consistent and asymptotically normal if either model is correctly specified. We present empirical results that illustrate the DR property and the efficiency gain of DR over IPW estimators when both the outcome and treatment models are correctly specified. Simulations are conducted for networks with equal and unequal component sizes and outcome data with and without a multilevel structure. We apply these methods in an illustrative analysis using the Add Health network, examining the impact of maternal college education on adolescent school performance, both direct and indirect.
A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.
Regression models that ignore measurement error in predictors may produce highly biased estimates leading to erroneous inferences. It is well known that it is extremely difficult to take measurement error into account in Gaussian nonparametric regression. This problem becomes tremendously more difficult when considering other families such as logistic regression, Poisson and negative-binomial. For the first time, we present a method aiming to correct for measurement error when estimating regression functions flexibly covering virtually all distributions and link functions regularly considered in generalized linear models. This approach depends on approximating the first and the second moment of the response after integrating out the true unobserved predictors in a semiparametric generalized linear model. Unlike previous methods, this method is not restricted to truncated splines and can utilize various basis functions. Through extensive simulation studies, we study the performance of our method under many scenarios.
Modern neural network training relies heavily on data augmentation for improved generalization. After the initial success of label-preserving augmentations, there has been a recent surge of interest in label-perturbing approaches, which combine features and labels across training samples to smooth the learned decision surface. In this paper, we propose a new augmentation method that leverages the first and second moments extracted and re-injected by feature normalization. We replace the moments of the learned features of one training image by those of another, and also interpolate the target labels. As our approach is fast, operates entirely in feature space, and mixes different signals than prior methods, one can effectively combine it with existing augmentation methods. We demonstrate its efficacy across benchmark data sets in computer vision, speech, and natural language processing, where it consistently improves the generalization performance of highly competitive baseline networks.