Motivated by recent findings that within-subject (WS) variability of longitudinal biomarkers is a risk factor for many health outcomes, this paper introduces and studies a new joint model of a longitudinal biomarker with heterogeneous WS variability and competing risks time-to-event outcome. Specifically, our joint model consists of a linear mixed-effects multiple location-scale submodel for the individual mean trajectory and WS variability of the longitudinal biomarker and a semiparametric cause-specific Cox proportional hazards submodel for the competing risks survival outcome. The submodels are linked together via shared random effects. We derive an expectation-maximization (EM) algorithm for semiparametric maximum likelihood estimation and a profile-likelihood method for standard error estimation. We implement scalable computational algorithms that can scale to biobank-scale data with tens of thousands of subjects. Our simulation results demonstrate that the proposed method has superior performance and that classical joint models with homogeneous WS variability can suffer from estimation bias, invalid inference, and poor prediction accuracy in the presence of heterogeneous WS variability. An application of the developed method to the large Multi-Ethnic Study of Atherosclerosis (MESA) data not only revealed that subject-specific WS variability in systolic blood pressure (SBP) is highly predictive of heart failure and death, but also yielded more accurate dynamic prediction of heart failure or death by accounting for both the mean trajectory and WS variability of SBP. Our user-friendly R package \textbf{JMH} is publicly available at \url{//github.com/shanpengli/JMH}.
Rather than refining individual candidate solutions for a general non-convex optimization problem, by analogy to evolution, we consider minimizing the average loss for a parametric distribution over hypotheses. In this setting, we prove that Fisher-Rao natural gradient descent (FR-NGD) optimally approximates the continuous-time replicator equation (an essential model of evolutionary dynamics) by minimizing the mean-squared error for the relative fitness of competing hypotheses. We term this finding "conjugate natural selection" and demonstrate its utility by numerically solving an example non-convex optimization problem over a continuous strategy space. Next, by developing known connections between discrete-time replicator dynamics and Bayes's rule, we show that when absolute fitness corresponds to the negative KL-divergence of a hypothesis's predictions from actual observations, FR-NGD provides the optimal approximation of continuous Bayesian inference. We use this result to demonstrate a novel method for estimating the parameters of stochastic processes.
Making causal inferences from observational studies can be challenging when confounders are missing not at random. In such cases, identifying causal effects is often not guaranteed. Motivated by a real example, we consider a treatment-independent missingness assumption under which we establish the identification of causal effects when confounders are missing not at random. We propose a weighted estimating equation (WEE) approach for estimating model parameters and introduce three estimators for the average causal effect, based on regression, propensity score weighting, and doubly robust methods. We evaluate the performance of these estimators through simulations, and provide a real data analysis to illustrate our proposed method.
Wikipedia plays a crucial role in the integrity of the Web. This work analyzes the reliability of this global encyclopedia through the lens of its references. We operationalize the notion of reference quality by defining reference need (RN), i.e., the percentage of sentences missing a citation, and reference risk (RR), i.e., the proportion of non-authoritative references. We release Citation Detective, a tool for automatically calculating the RN score, and discover that the RN score has dropped by 20 percent point in the last decade, with more than half of verifiable statements now accompanying references. The RR score has remained below 1% over the years as a result of the efforts of the community to eliminate unreliable references. We propose pairing novice and experienced editors on the same Wikipedia article as a strategy to enhance reference quality. Our quasi-experiment indicates that such a co-editing experience can result in a lasting advantage in identifying unreliable sources in future edits. As Wikipedia is frequently used as the ground truth for numerous Web applications, our findings and suggestions on its reliability can have a far-reaching impact. We discuss the possibility of other Web services adopting Wiki-style user collaboration to eliminate unreliable content.
Dynamic treatment regimens (DTRs), also known as treatment algorithms or adaptive interventions, play an increasingly important role in many health domains. DTRs are motivated to address the unique and changing needs of individuals by delivering the type of treatment needed, when needed, while minimizing unnecessary treatment. Practically, a DTR is a sequence of decision rules that specify, for each of several points in time, how available information about the individual's status and progress should be used in practice to decide which treatment (e.g., type or intensity) to deliver. The sequential multiple assignment randomized trial (SMART) is an experimental design widely used to empirically inform the development of DTRs. Sample size planning resources for SMARTs have been developed for continuous, binary, and survival outcomes. However, an important gap exists in sample size estimation methodology for SMARTs with longitudinal count outcomes. Further, in many health domains, count data are overdispersed - having variance greater than their mean. We propose a Monte Carlo-based approach to sample size estimation applicable to many types of longitudinal outcomes and provide a case study with longitudinal overdispersed count outcomes. A SMART for engaging alcohol and cocaine-dependent patients in treatment is used as motivation.
We introduce a class of networked Markov potential games where agents are associated with nodes in a network. Each agent has its own local potential function, and the reward of each agent depends only on the states and actions of agents within a $\kappa$-hop neighborhood. In this context, we propose a localized actor-critic algorithm. The algorithm is scalable since each agent uses only local information and does not need access to the global state. Further, the algorithm overcomes the curse of dimensionality through the use of function approximation. Our main results provide finite-sample guarantees up to a localization error and a function approximation error. Specifically, we achieve an $\tilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity measured by the averaged Nash regret. This is the first finite-sample bound for multi-agent competitive games that does not depend on the number of agents.
In many longitudinal settings, time-varying covariates may not be measured at the same time as responses and are often prone to measurement error. Naive last-observation-carried-forward methods incur estimation biases, and existing kernel-based methods suffer from slow convergence rates and large variations. To address these challenges, we propose a new functional calibration approach to efficiently learn longitudinal covariate processes based on sparse functional data with measurement error. Our approach, stemming from functional principal component analysis, calibrates the unobserved synchronized covariate values from the observed asynchronous and error-prone covariate values, and is broadly applicable to asynchronous longitudinal regression with time-invariant or time-varying coefficients. For regression with time-invariant coefficients, our estimator is asymptotically unbiased, root-n consistent, and asymptotically normal; for time-varying coefficient models, our estimator has the optimal varying coefficient model convergence rate with inflated asymptotic variance from the calibration. In both cases, our estimators present asymptotic properties superior to the existing methods. The feasibility and usability of the proposed methods are verified by simulations and an application to the Study of Women's Health Across the Nation, a large-scale multi-site longitudinal study on women's health during mid-life.
The test-negative design (TND) has become a standard approach to evaluate vaccine effectiveness against the risk of acquiring infectious diseases in real-world settings, such as Influenza, Rotavirus, Dengue fever, and more recently COVID-19. In a TND study, individuals who experience symptoms and seek care are recruited and tested for the infectious disease which defines cases and controls. Despite TND's potential to reduce unobserved differences in healthcare seeking behavior (HSB) between vaccinated and unvaccinated subjects, it remains subject to various potential biases. First, residual confounding may remain due to unobserved HSB, occupation as healthcare worker, or previous infection history. Second, because selection into the TND sample is a common consequence of infection and HSB, collider stratification bias may exist when conditioning the analysis on tested samples, which further induces confounding by latent HSB. In this paper, we present a novel approach to identify and estimate vaccine effectiveness in the target population by carefully leveraging a pair of negative control exposure and outcome variables to account for potential hidden bias in TND studies. We illustrate our proposed method with extensive simulations and an application to study COVID-19 vaccine effectiveness using data from the University of Michigan Health System.
We propose a method of sufficient dimension reduction for functional data using distance covariance. We consider the case where the response variable is a scalar but the predictor is a random function. Our method has several advantages. It requires very mild conditions on the predictor, unlike the existing methods require the restrictive linear conditional mean assumption and constant covariance assumption. It also does not involve the inverse of the covariance operator which is not bounded. The link function between the response and the predictor can be arbitrary and our method maintains the model free advantage without estimating the link function. Moreover, our method is naturally applicable to sparse longitudinal data. We use functional principal component analysis with truncation as the regularization mechanism in the development. The justification for validity of the proposed method is provided and under some regularization conditions, statistical consistency of our estimator is established. Simulation studies and real data analysis are also provided to demonstrate the performance of our method.
The only pharmacologic treatment for gestational diabetes (GDM) approved by U.S. Food and Drug Administration is insulin. However, due to improved ease of use and lower cost, oral antidiabetic medications, such as glyburide, are prescribed more commonly than insulin. We investigate glyburide's impact on two adverse perinatal outcomes compared to medical nutritional therapy, the universal first-line therapy, in a large, population-based cohort. At the design stage, we employ matching to select comparable treated subjects(received glyburide) and controls (received medical nutritional therapy). Multiple background variables were associated with GDM treatment modality and perinatal outcomes; however, there is ambiguity about which of the many potential confounding variables should be prioritized in matching. Standard selection methods based on treatment imbalance alone neglect variables' relationships with the outcome. Thus, we propose the joint variable importance plot (jointVIP) to guide variable prioritization for this study. This plot adds outcome associations on a second dimension to better contextualize standard imbalance measures, further enhances variable comparisons using unadjusted bias curves derived under the omitted variable bias framework, and can produce recommended values for tuning parameters in existing methods. After forming matched pairs, we conduct inference for adverse effects of glyburide and perform sensitivity analyses to assess the potential role of unmeasured confounding. Our findings of no reliable adverse effect of glyburide inform future pharmacologic treatment strategies to manage GDM.
This PhD thesis contains several contributions to the field of statistical causal modeling. Statistical causal models are statistical models embedded with causal assumptions that allow for the inference and reasoning about the behavior of stochastic systems affected by external manipulation (interventions). This thesis contributes to the research areas concerning the estimation of causal effects, causal structure learning, and distributionally robust (out-of-distribution generalizing) prediction methods. We present novel and consistent linear and non-linear causal effects estimators in instrumental variable settings that employ data-dependent mean squared prediction error regularization. Our proposed estimators show, in certain settings, mean squared error improvements compared to both canonical and state-of-the-art estimators. We show that recent research on distributionally robust prediction methods has connections to well-studied estimators from econometrics. This connection leads us to prove that general K-class estimators possess distributional robustness properties. We, furthermore, propose a general framework for distributional robustness with respect to intervention-induced distributions. In this framework, we derive sufficient conditions for the identifiability of distributionally robust prediction methods and present impossibility results that show the necessity of several of these conditions. We present a new structure learning method applicable in additive noise models with directed trees as causal graphs. We prove consistency in a vanishing identifiability setup and provide a method for testing substructure hypotheses with asymptotic family-wise error control that remains valid post-selection. Finally, we present heuristic ideas for learning summary graphs of nonlinear time-series models.