The estimation of Average Treatment Effect (ATE) as a causal parameter is carried out in two steps, wherein the first step, the treatment, and outcome are modeled to incorporate the potential confounders, and in the second step, the predictions are inserted into the ATE estimators such as the Augmented Inverse Probability Weighting (AIPW) estimator. Due to the concerns regarding the nonlinear or unknown relationships between confounders and the treatment and outcome, there has been an interest in applying non-parametric methods such as Machine Learning (ML) algorithms instead. \cite{farrell2018deep} proposed to use two separate Neural Networks (NNs) where there's no regularization on the network's parameters except the Stochastic Gradient Descent (SGD) in the NN's optimization. Our simulations indicate that the AIPW estimator suffers extensively if no regularization is utilized. We propose the normalization of AIPW (referred to as nAIPW) which can be helpful in some scenarios. nAIPW, provably, has the same properties as AIPW, that is double-robustness and orthogonality \citep{chernozhukov2018double}. Further, if the first step algorithms converge fast enough, under regulatory conditions \citep{chernozhukov2018double}, nAIPW will be asymptotically normal.
This PhD thesis contains several contributions to the field of statistical causal modeling. Statistical causal models are statistical models embedded with causal assumptions that allow for the inference and reasoning about the behavior of stochastic systems affected by external manipulation (interventions). This thesis contributes to the research areas concerning the estimation of causal effects, causal structure learning, and distributionally robust (out-of-distribution generalizing) prediction methods. We present novel and consistent linear and non-linear causal effects estimators in instrumental variable settings that employ data-dependent mean squared prediction error regularization. Our proposed estimators show, in certain settings, mean squared error improvements compared to both canonical and state-of-the-art estimators. We show that recent research on distributionally robust prediction methods has connections to well-studied estimators from econometrics. This connection leads us to prove that general K-class estimators possess distributional robustness properties. We, furthermore, propose a general framework for distributional robustness with respect to intervention-induced distributions. In this framework, we derive sufficient conditions for the identifiability of distributionally robust prediction methods and present impossibility results that show the necessity of several of these conditions. We present a new structure learning method applicable in additive noise models with directed trees as causal graphs. We prove consistency in a vanishing identifiability setup and provide a method for testing substructure hypotheses with asymptotic family-wise error control that remains valid post-selection. Finally, we present heuristic ideas for learning summary graphs of nonlinear time-series models.
It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets and applying standard statistical learning methods can result in poor out-of-study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown $\textit{multi-study ensembling}$ to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multi-study ensembling uses a two-stage $\textit{stacking}$ strategy which fits study-specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model-fitting stage, potentially resulting in a loss of efficiency. We therefore propose $\textit{optimal ensemble construction}$, an $\textit{all-in-one}$ approach to multi-study stacking whereby we jointly estimate ensemble weights as well as parameters associated with each study-specific model. We prove that limiting cases of our approach yield existing methods such as multi-study stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the proposed loss function. We compare our approach to standard methods by applying it to a multi-country COVID-19 dataset for baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. Importantly, our approach outperforms multi-study stacking and other standard methods in this application. We further characterize the method's performance in simulations. Our method remains competitive with or outperforms multi-study stacking and other earlier methods across a range of between-study heterogeneity levels.
Voting systems have a wide range of applications including recommender systems, web search, product design and elections. Limited by the lack of general-purpose analytical tools, it is difficult to hand-engineer desirable voting rules for each use case. For this reason, it is appealing to automatically discover voting rules geared towards each scenario. In this paper, we show that set-input neural network architectures such as Set Transformers, fully-connected graph networks and DeepSets are both theoretically and empirically well-suited for learning voting rules. In particular, we show that these network models can not only mimic a number of existing voting rules to compelling accuracy -- both position-based (such as Plurality and Borda) and comparison-based (such as Kemeny, Copeland and Maximin) -- but also discover near-optimal voting rules that maximize different social welfare functions. Furthermore, the learned voting rules generalize well to different voter utility distributions and election sizes unseen during training.
As data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by situations where the initial conditions may have significant impact on the long-term behavior of the system. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization to performative risk minimizers.
We consider the problem of detecting jumps in an otherwise smoothly evolving trend whilst the covariance and higher-order structures of the system can experience both smooth and abrupt changes over time. The number of jump points is allowed to diverge to infinity with the jump sizes possibly shrinking to zero. The method is based on a multiscale application of an optimal jump-pass filter to the time series, where the scales are dense between admissible lower and upper bounds. For a wide class of non-stationary time series models and trend functions, the proposed method is shown to be able to detect all jump points within a nearly optimal range with a prescribed probability asymptotically under mild conditions. For a time series of length $n$, the computational complexity of the proposed method is $O(n)$ for each scale and $O(n\log^{1+\epsilon} n)$ overall, where $\epsilon$ is an arbitrarily small positive constant. Numerical studies show that the proposed jump testing and estimation method performs robustly and accurately under complex temporal dynamics.
Linear thresholding models postulate that the conditional distribution of a response variable in terms of covariates differs on the two sides of a (typically unknown) hyperplane in the covariate space. A key goal in such models is to learn about this separating hyperplane. Exact likelihood or least squares methods to estimate the thresholding parameter involve an indicator function which make them difficult to optimize and are, therefore, often tackled by using a surrogate loss that uses a smooth approximation to the indicator. In this paper, we demonstrate that the resulting estimator is asymptotically normal with a near optimal rate of convergence: $n^{-1}$ up to a log factor, in both classification and regression thresholding models. This is substantially faster than the currently established convergence rates of smoothed estimators for similar models in the statistics and econometrics literatures. We also present a real-data application of our approach to an environmental data set where $CO_2$ emission is explained in terms of a separating hyperplane defined through per-capita GDP and urban agglomeration.
In this paper, we establish the almost sure convergence of two-timescale stochastic gradient descent algorithms in continuous time under general noise and stability conditions, extending well known results in discrete time. We analyse algorithms with additive noise and those with non-additive noise. In the non-additive case, our analysis is carried out under the assumption that the noise is a continuous-time Markov process, controlled by the algorithm states. The algorithms we consider can be applied to a broad class of bilevel optimisation problems. We study one such problem in detail, namely, the problem of joint online parameter estimation and optimal sensor placement for a partially observed diffusion process. We demonstrate how this can be formulated as a bilevel optimisation problem, and propose a solution in the form of a continuous-time, two-timescale, stochastic gradient descent algorithm. Furthermore, under suitable conditions on the latent signal, the filter, and the filter derivatives, we establish almost sure convergence of the online parameter estimates and optimal sensor placements to the stationary points of the asymptotic log-likelihood and asymptotic filter covariance, respectively. We also provide numerical examples, illustrating the application of the proposed methodology to a partially observed Bene\v{s} equation, and a partially observed stochastic advection-diffusion equation.
Stein's method for Gaussian process approximation can be used to bound the differences between the expectations of smooth functionals $h$ of a c\`adl\`ag random process $X$ of interest and the expectations of the same functionals of a well understood target random process $Z$ with continuous paths. Unfortunately, the class of smooth functionals for which this is easily possible is very restricted. Here, we prove an infinite dimensional Gaussian smoothing inequality, which enables the class of functionals to be greatly expanded -- examples are Lipschitz functionals with respect to the uniform metric, and indicators of arbitrary events -- in exchange for a loss of precision in the bounds. Our inequalities are expressed in terms of the smooth test function bound, an expectation of a functional of $X$ that is closely related to classical tightness criteria, a similar expectation for $Z$, and, for the indicator of a set $K$, the probability $\mathbb{P}(Z \in K^\theta \setminus K^{-\theta})$ that the target process is close to the boundary of $K$.
We study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain structural information regarding the reward distributions and would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer only minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call DUSA whose worst-case regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. Our proposed algorithm is the first computationally viable learning policy for structured bandit problems that has asymptotic minimal regret.
Heatmap-based methods dominate in the field of human pose estimation by modelling the output distribution through likelihood heatmaps. In contrast, regression-based methods are more efficient but suffer from inferior performance. In this work, we explore maximum likelihood estimation (MLE) to develop an efficient and effective regression-based methods. From the perspective of MLE, adopting different regression losses is making different assumptions about the output density function. A density function closer to the true distribution leads to a better regression performance. In light of this, we propose a novel regression paradigm with Residual Log-likelihood Estimation (RLE) to capture the underlying output distribution. Concretely, RLE learns the change of the distribution instead of the unreferenced underlying distribution to facilitate the training process. With the proposed reparameterization design, our method is compatible with off-the-shelf flow models. The proposed method is effective, efficient and flexible. We show its potential in various human pose estimation tasks with comprehensive experiments. Compared to the conventional regression paradigm, regression with RLE bring 12.4 mAP improvement on MSCOCO without any test-time overhead. Moreover, for the first time, especially on multi-person pose estimation, our regression method is superior to the heatmap-based methods. Our code is available at //github.com/Jeff-sjtu/res-loglikelihood-regression