Causal mediation analysis with random interventions has become an area of significant interest for understanding time-varying effects with longitudinal and survival outcomes. To tackle causal and statistical challenges due to the complex longitudinal data structure with time-varying confounders, competing risks, and informative censoring, there exists a general desire to combine machine learning techniques and semiparametric theory. In this manuscript, we focus on targeted maximum likelihood estimation (TMLE) of longitudinal natural direct and indirect effects defined with random interventions. The proposed estimators are multiply robust, locally efficient, and directly estimate and update the conditional densities that factorize data likelihoods. We utilize the highly adaptive lasso (HAL) and projection representations to derive new estimators (HAL-EIC) of the efficient influence curves of longitudinal mediation problems and propose a fast one-step TMLE algorithm using HAL-EIC while preserving the asymptotic properties. The proposed method can be generalized for other longitudinal causal parameters that are smooth functions of data likelihoods, and thereby provides a novel and flexible statistical toolbox.
Functional quadratic regression models postulate a polynomial relationship between a scalar response rather than a linear one. As in functional linear regression, vertical and specially high-leverage outliers may affect the classical estimators. For that reason, the proposal of robust procedures providing reliable estimators in such situations is an important issue. Taking into account that the functional polynomial model is equivalent to a regression model that is a polynomial of the same order in the functional principal component scores of the predictor processes, our proposal combines robust estimators of the principal directions with robust regression estimators based on a bounded loss function and a preliminary residual scale estimator. Fisher-consistency of the proposed method is derived under mild assumptions. The results of a numerical study show, for finite samples, the benefits of the robust proposal over the one based on sample principal directions and least squares. The usefulness of the proposed approach is also illustrated through the analysis of a real data set which reveals that when the potential outliers are removed the classical and robust methods behave very similarly.
To navigate in an environment safely and autonomously, robots must accurately estimate where obstacles are and how they move. Instead of using expensive traditional 3D sensors, we explore the use of a much cheaper, faster, and higher resolution alternative: programmable light curtains. Light curtains are a controllable depth sensor that sense only along a surface that the user selects. We adapt a probabilistic method based on particle filters and occupancy grids to explicitly estimate the position and velocity of 3D points in the scene using partial measurements made by light curtains. The central challenge is to decide where to place the light curtain to accurately perform this task. We propose multiple curtain placement strategies guided by maximizing information gain and verifying predicted object locations. Then, we combine these strategies using an online learning framework. We propose a novel self-supervised reward function that evaluates the accuracy of current velocity estimates using future light curtain placements. We use a multi-armed bandit framework to intelligently switch between placement policies in real time, outperforming fixed policies. We develop a full-stack navigation system that uses position and velocity estimates from light curtains for downstream tasks such as localization, mapping, path-planning, and obstacle avoidance. This work paves the way for controllable light curtains to accurately, efficiently, and purposefully perceive and navigate complex and dynamic environments. Project website: //siddancha.github.io/projects/active-velocity-estimation/
In linear models, omitting a covariate that is orthogonal to covariates in the model does not result in biased coefficient estimation. This in general does not hold for longitudinal data, where additional assumptions are needed to get unbiased coefficient estimation in addition to the orthogonality between omitted longitudinal covariates and longitudinal covariates in the model. We propose methods to mitigate the omitted variable bias under weaker assumptions. A two-step estimation procedure is proposed for inference about the asynchronous longitudinal covariates, when such covariates are observed. For mixed synchronous and asynchronous longitudinal covariates, we get parametric rate of convergence for the coefficient estimation of the synchronous longitudinal covariates by the two-step method. Extensive simulation studies provide numerical support for the theoretical findings. We illustrate the performance of our method on dataset from the Alzheimers Disease Neuroimaging Initiative study.
We consider varying-coefficient models for mixed synchronous and asynchronous longitudinal covariates, where asynchronicity refers to the misalignment of longitudinal measurement times within an individual. We propose three different methods of parameter estimation and inference. The first method is a one-step approach that estimates non-parametric regression functions for synchronous and asynchronous longitudinal covariates simultaneously. The second method is a two-step approach in which synchronous longitudinal covariates are regressed with the longitudinal response by centering the synchronous longitudinal covariates first and, in the second step, the residuals from the first step are regressed with asynchronous longitudinal covariates. The third method is the same as the second method except that in the first step, we omit the asynchronous longitudinal covariate and include a non-parametric intercept in the regression analysis of synchronous longitudinal covariates and the longitudinal response. We further construct simultaneous confidence bands for the non-parametric regression functions to quantify the overall magnitude of variation. Extensive simulation studies provide numerical support for the theoretical findings. The practical utility of the methods is illustrated on a dataset from the ADNI study.
This chapter presents an overview of a specific form of limited dependent variable models, namely discrete choice models, where the dependent (response or outcome) variable takes values which are discrete, inherently ordered, and characterized by an underlying continuous latent variable. Within this setting, the dependent variable may take only two discrete values (such as 0 and 1) giving rise to binary models (e.g., probit and logit models) or more than two values (say $j=1,2, \ldots, J$, where $J$ is some integer, typically small) giving rise to ordinal models (e.g., ordinal probit and ordinal logit models). In these models, the primary goal is to model the probability of responses/outcomes conditional on the covariates. We connect the outcomes of a discrete choice model to the random utility framework in economics, discuss estimation techniques, present the calculation of covariate effects and measures to assess model fitting. Some recent advances in discrete data modeling are also discussed. Following the theoretical review, we utilize the binary and ordinal models to analyze public opinion on marijuana legalization and the extent of legalization -- a socially relevant but controversial topic in the United States. We obtain several interesting results including that past use of marijuana, belief about legalization and political partisanship are important factors that shape the public opinion.
In order to achieve unbiased and efficient estimators of causal effects from observational data, covariate selection for confounding adjustment becomes an important task in causal inference. Despite recent advancements in graphical criterion for constructing valid and efficient adjustment sets, these methods often rely on assumptions that may not hold in practice. We examine the properties of existing graph-free covariate selection methods with respect to both validity and efficiency, highlighting the potential dangers of producing invalid adjustment sets when hidden variables are present. To address this issue, we propose a novel graph-free method, referred to as CMIO, adapted from Mixed Integer Optimization (MIO) with a set of causal constraints. Our results demonstrate that CMIO outperforms existing state-of-the-art methods and provides theoretically sound outputs. Furthermore, we present a revised version of CMIO capable of handling the scenario in the absence of causal sufficiency and graphical information, offering efficient and valid covariate adjustments for causal inference.
Noise plagues many numerical datasets, where the recorded values in the data may fail to match the true underlying values due to reasons including: erroneous sensors, data entry/processing mistakes, or imperfect human estimates. Here we consider estimating \emph{which} data values are incorrect along a numerical column. We present a model-agnostic approach that can utilize \emph{any} regressor (i.e.\ statistical or machine learning model) which was fit to predict values in this column based on the other variables in the dataset. By accounting for various uncertainties, our approach distinguishes between genuine anomalies and natural data fluctuations, conditioned on the available information in the dataset. We establish theoretical guarantees for our method and show that other approaches like conformal inference struggle to detect errors. We also contribute a new error detection benchmark involving 5 regression datasets with real-world numerical errors (for which the true values are also known). In this benchmark and additional simulation studies, our method identifies incorrect values with better precision/recall than other approaches.
Existing heterogeneous treatment effects learners, also known as conditional average treatment effects (CATE) learners, lack a general mechanism for end-to-end inter-treatment information sharing, and data have to be split among potential outcome functions to train CATE learners which can lead to biased estimates with limited observational datasets. To address this issue, we propose a novel deep learning-based framework to train CATE learners that facilitates dynamic end-to-end information sharing among treatment groups. The framework is based on \textit{soft weight sharing} of \textit{hypernetworks}, which offers advantages such as parameter efficiency, faster training, and improved results. The proposed framework complements existing CATE learners and introduces a new class of uncertainty-aware CATE learners that we refer to as \textit{HyperCATE}. We develop HyperCATE versions of commonly used CATE learners and evaluate them on IHDP, ACIC-2016, and Twins benchmarks. Our experimental results show that the proposed framework improves the CATE estimation error via counterfactual inference, with increasing effectiveness for smaller datasets.
There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often infeasible to tackle the large covariance matrix of the random effects involved in the model. In this article, we propose a high-dimensional response growth curve model, with three novel components: a low-rank factor model structure that substantially reduces the number of parameters in the large covariance matrix, a re-parameterization formulation coupled with a sparsity penalty that selects important fixed and random effect terms, and a computational trick that turns the inversion of a large matrix into the inversion of a stack of small matrices and thus considerably speeds up the computation. We develop an efficient expectation-maximization type estimation algorithm, and demonstrate the competitive performance of the proposed method through both simulations and a longitudinal study of brain structural connectivity in association with human immunodeficiency virus.
Outcome phenotype measurement error is rarely corrected in comparative effect estimation studies in observational pharmacoepidemiology. Quantitative bias analysis (QBA) is a misclassification correction method that algebraically adjusts person counts in exposure-outcome contingency tables to reflect the magnitude of misclassification. The extent QBA minimizes bias is unclear because few systematic evaluations have been reported. We empirically evaluated QBA impact on odds ratios (OR) in several comparative effect estimation scenarios. We estimated non-differential and differential phenotype errors with internal validation studies using a probabilistic reference. Further, we synthesized an analytic space defined by outcome incidence, uncorrected ORs, and phenotype errors to identify which combinations produce invalid results indicative of input errors. We evaluated impact with relative bias [(OR-ORQBA)]/OR*100%]. Results were considered invalid if any contingency table cell was corrected to a negative number. Empirical bias correction was greatest in lower incidence scenarios where uncorrected ORs were larger. Similarly, synthetic bias correction was greater in lower incidence settings with larger uncorrected estimates. The invalid proportion of synthetic scenarios increased as uncorrected estimates increased. Results were invalid in common, low incidence scenarios indicating problematic inputs. This demonstrates the importance of accurately and precisely estimating phenotype errors before implementing QBA in comparative effect estimation studies.