Large observational data are increasingly available in disciplines such as health, economic and social sciences, where researchers are interested in causal questions rather than prediction. In this paper, we examine the problem of estimating heterogeneous treatment effects using non-parametric regression-based methods, starting from an empirical study aimed at investigating the effect of participation in school meal programs on health indicators. Firstly, we introduce the setup and the issues related to conducting causal inference with observational or non-fully randomized data, and how these issues can be tackled with the help of statistical learning tools. Then, we review and develop a unifying taxonomy of the existing state-of-the-art frameworks that allow for individual treatment effects estimation via non-parametric regression models. After presenting a brief overview on the problem of model selection, we illustrate the performance of some of the methods on three different simulated studies. We conclude by demonstrating the use of some of the methods on an empirical analysis of the school meal program data.
Accelerometers enable an objective measurement of physical activity levels among groups of individuals in free-living environments, providing high-resolution detail about physical activity changes at different time scales. Current approaches used in the literature for analyzing such data typically employ summary measures such as total inactivity time or compositional metrics. However, at the conceptual level, these methods have the potential disadvantage of discarding important information from recorded data when calculating these summaries and metrics since these typically depend on cut-offs related to exercise intensity zones chosen subjectively or even arbitrarily. Furthermore, much of the data collected in these studies follow complex survey designs. Then, using specific estimation strategies adapted to a particular sampling mechanism is mandatory. The aim of this paper is two-fold. First, a new functional representation of a distributional nature accelerometer data is introduced to build a complete individualized profile of each subject's physical activity levels. Second, we extend two nonparametric functional regression models, kernel smoothing and kernel ridge regression, to handle survey data and obtain reliable conclusions about the influence of physical activity in the different analyses performed in the complex sampling design NHANES cohort and so, show representation advantages.
The Student-$t$ distribution is widely used in statistical modeling of datasets involving outliers since its longer-than-normal tails provide a robust approach to hand such data. Furthermore, data collected over time may contain censored or missing observations, making it impossible to use standard statistical procedures. This paper proposes an algorithm to estimate the parameters of a censored linear regression model when the regression errors are autocorrelated and the innovations follow a Student-$t$ distribution. To fit the proposed model, maximum likelihood estimates are obtained throughout the SAEM algorithm, which is a stochastic approximation of the EM algorithm useful for models in which the E-step does not have an analytic form. The methods are illustrated by the analysis of a real dataset that has left-censored and missing observations. We also conducted two simulations studies to examine the asymptotic properties of the estimates and the robustness of the model.
Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the intervention effect (e.g., at the individual-level or at the cluster-level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we use causal models to formally define an array of causal effects as summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of well-known CRT estimators, including the t-test and generalized estimating equations (GEE), as well as less known methods, including augmented-GEE and targeted maximum likelihood estimation (TMLE). In finite sample simulations, we illustrate the performance of these estimators and the importance of effect specification, especially when cluster size varies. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real-world importance of selecting an analytic approach corresponding to the research question. Given its flexibility to estimate a variety of effects and ability to adaptively adjust for covariates for precision gains while maintaining Type-I error control, we conclude TMLE is a promising tool for CRT analysis.
Autoregressive models are a class of time series models that are important in both applied and theoretical statistics. Typically, inferential devices such as confidence sets and hypothesis tests for time series models require nuanced asymptotic arguments and constructions. We present a simple alternative to such arguments that allow for the construction of finite sample valid inferential devices, using a data splitting approach. We prove the validity of our constructions, as well as the validity of related sequential inference tools. A set of simulation studies are presented to demonstrate the applicability of our methodology.
Simulation studies are commonly used to evaluate the performance of newly developed meta-analysis methods. For methodology that is developed for an aggregated data meta-analysis, researchers often resort to simulation of the aggregated data directly, instead of simulating individual participant data from which the aggregated data would be calculated in reality. Clearly, distributional characteristics of the aggregated data statistics may be derived from distributional assumptions of the underlying individual data, but they are often not made explicit in publications. This paper provides the distribution of the aggregated data statistics that were derived from a heteroscedastic mixed effects model for continuous individual data. As a result, we provide a procedure for directly simulating the aggregated data statistics. We also compare our distributional findings with other simulation approaches of aggregated data used in literature by describing their theoretical differences and by conducting a simulation study for three meta-analysis methods: DerSimonian and Laird's pooled estimate and the Trim & Fill and PET-PEESE method for adjustment of publication bias. We demonstrate that the choices of simulation model for aggregated data may have a relevant impact on (the conclusions of) the performance of the meta-analysis method. We recommend the use of multiple aggregated data simulation models for investigation of new methodology to determine sensitivity or otherwise make the individual participant data model explicit that would lead to the distributional choices of the aggregated data statistics used in the simulation.
Understanding treatment effect heterogeneity in observational studies is of great practical importance to many scientific fields. Quantile regression provides a natural framework for modeling such heterogeneity. In this paper, we propose a new method for inference on heterogeneous quantile treatment effects in the presence of high-dimensional covariates. Our estimator combines a $\ell_1$-penalized regression adjustment with a quantile-specific bias correction scheme based on quantile regression rank scores. We present a comprehensive study of the theoretical properties of this estimator, including weak convergence of the heterogeneous quantile treatment effect process to a Gaussian process. We illustrate the finite-sample performance of our approach through Monte Carlo experiments and an empirical example, dealing with the differential effect of statin usage for lowering low-density lipoprotein cholesterol levels for the Alzheimer's disease patients who participated in the UK Biobank study.
Personalized decision-making, aiming to derive optimal individualized treatment rules (ITRs) based on individual characteristics, has recently attracted increasing attention in many fields, such as medicine, social services, and economics. Current literature mainly focuses on estimating ITRs from a single source population. In real-world applications, the distribution of a target population can be different from that of the source population. Therefore, ITRs learned by existing methods may not generalize well to the target population. Due to privacy concerns and other practical issues, individual-level data from the target population is often not available, which makes ITR learning more challenging. We consider an ITR estimation problem where the source and target populations may be heterogeneous, individual data is available from the source population, and only the summary information of covariates, such as moments, is accessible from the target population. We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics. Specifically, we propose a calibrated augmented inverse probability weighted estimator of the value function for the target population and estimate an optimal ITR by maximizing this estimator within a class of pre-specified ITRs. We show that the proposed calibrated estimator is consistent and asymptotically normal even with flexible semi/nonparametric models for nuisance function approximation, and the variance of the value estimator can be consistently estimated. We demonstrate the empirical performance of the proposed method using simulation studies and a real application to an eICU dataset as the source sample and a MIMIC-III dataset as the target sample.
The selection of essential variables in logistic regression is vital because of its extensive use in medical studies, finance, economics and related fields. In this paper, we explore four main typologies (test-based, penalty-based, screening-based, and tree-based) of frequentist variable selection methods in logistic regression setup. Primary objective of this work is to give a comprehensive overview of the existing literature for practitioners. Underlying assumptions and theory, along with the specifics of their implementations, are detailed as well. Next, we conduct a thorough simulation study to explore the performances of fifteen different methods in terms of variable selection, estimation of coefficients, prediction accuracy as well as time complexity under various settings. We take low, moderate and high dimensional setups and consider different correlation structures for the covariates. A real-life application, using a high-dimensional gene expression data, is also included in this study to further understand the efficacy and consistency of the methods. Finally, based on our findings in the simulated data and in the real data, we provide recommendations for practitioners on the choice of variable selection methods under various contexts.
Since the average treatment effect (ATE) measures the change in social welfare, even if positive, there is a risk of negative effect on, say, some 10% of the population. Assessing such risk is difficult, however, because any one individual treatment effect (ITE) is never observed so the 10% worst-affected cannot be identified, while distributional treatment effects only compare the first deciles within each treatment group, which does not correspond to any 10%-subpopulation. In this paper we consider how to nonetheless assess this important risk measure, formalized as the conditional value at risk (CVaR) of the ITE distribution. We leverage the availability of pre-treatment covariates and characterize the tightest-possible upper and lower bounds on ITE-CVaR given by the covariate-conditional average treatment effect (CATE) function. Some bounds can also be interpreted as summarizing a complex CATE function into a single metric and are of interest independently of being a bound. We then proceed to study how to estimate these bounds efficiently from data and construct confidence intervals. This is challenging even in randomized experiments as it requires understanding the distribution of the unknown CATE function, which can be very complex if we use rich covariates so as to best control for heterogeneity. We develop a debiasing method that overcomes this and prove it enjoys favorable statistical properties even when CATE and other nuisances are estimated by black-box machine learning or even inconsistently. Studying a hypothetical change to French job-search counseling services, our bounds and inference demonstrate a small social benefit entails a negative impact on a substantial subpopulation.
Recently, there has been great interest in estimating the conditional average treatment effect using flexible machine learning methods. However, in practice, investigators often have working hypotheses about effect heterogeneity across pre-defined subgroups of individuals, which we call the groupwise approach. The paper compares two modern ways to estimate groupwise treatment effects, a nonparametric approach and a semiparametric approach, with the goal of better informing practice. Specifically, we compare (a) the underlying assumptions, (b) efficiency characteristics, and (c) a way to combine the two approaches. We also discuss how to obtain cluster-robust standard errors if the study units in the same subgroups are not independent and identically distributed. We conclude by reanalyzing the National Educational Longitudinal Study of 1988.