Motivated by recent findings that within-subject (WS) visit-to-visit variabilities of longitudinal biomarkers can be strong risk factors for health outcomes, this paper introduces and examines a new joint model of a longitudinal biomarker with heterogeneous WS variability and competing risks time-to-event outcome. Specifically, our joint model consists of a linear mixed-effects multiple location-scale submodel for the individual mean trajectory and WS variability of the longitudinal biomarker and a semiparametric cause-specific Cox proportional hazards submodel for the competing risks survival outcome. The submodels are linked together via shared random effects. We derive an expectation-maximization algorithm for semiparametric maximum likelihood estimation and a profile-likelihood method for standard error estimation. We implement efficient computational algorithms that scales to biobank-scale data with tens of thousands of subjects. Our simulation results demonstrate that, in the presence of heterogeneous WS variability, the proposed method has superior performance for estimation, inference, and prediction, over the classical joint model with homogeneous WS variability. An application of our method to a Multi-Ethnic Study of Atherosclerosis (MESA) data reveals that there is substantial heterogeneity in systolic blood pressure (SBP) WS variability across MESA individuals and that SBP WS variability is an important predictor for heart failure and death, (independent of, or in addition to) the individual SBP mean level. Furthermore, by accounting for both the mean trajectory and WS variability of SBP, our method leads to a more accurate dynamic prediction model for heart failure or death. A user-friendly R package \textbf{JMH} is developed and publicly available at \url{//github.com/shanpengli/JMH}.
We develop methods, based on extreme value theory, for analysing observations in the tails of longitudinal data, i.e., a data set consisting of a large number of short time series, which are typically irregularly and non-simultaneously sampled, yet have some commonality in the structure of each series and exhibit independence between time series. Extreme value theory has not been considered previously for the unique features of longitudinal data. Across time series the data are assumed to follow a common generalised Pareto distribution, above a high threshold. To account for temporal dependence of such data we require a model to describe (i) the variation between the different time series properties, (ii) the changes in distribution over time, and (iii) the temporal dependence within each series. Our methodology has the flexibility to capture both asymptotic dependence and asymptotic independence, with this characteristic determined by the data. Bayesian inference is used given the need for inference of parameters that are unique to each time series. Our novel methodology is illustrated through the analysis of data from elite swimmers in the men's 100m breaststroke. Unlike previous analyses of personal-best data in this event, we are able to make inference about the careers of individual swimmers - such as the probability an individual will break the world record or swim the fastest time next year.
Understanding individual treatment effects in extreme regimes is important for characterizing risks associated with different interventions. This is hindered by the fact that extreme regime data may be hard to collect, as it is scarcely observed in practice. In addressing this issue, we propose a new framework for estimating the individual treatment effect in extreme regimes (ITE$_2$). Specifically, we quantify this effect by the changes in the tail decay rates of potential outcomes in the presence or absence of the treatment. Subsequently, we establish conditions under which ITE$_2$ may be calculated and develop algorithms for its computation. We demonstrate the efficacy of our proposed method on various synthetic and semi-synthetic datasets.
The Plackett--Luce model is a popular approach for ranking data analysis, where a utility vector is employed to determine the probability of each outcome based on Luce's choice axiom. In this paper, we investigate the asymptotic theory of utility vector estimation by maximizing different types of likelihood, such as the full-, marginal-, and quasi-likelihood. We provide a rank-matching interpretation for the estimating equations of these estimators and analyze their asymptotic behavior as the number of items being compared tends to infinity. In particular, we establish the uniform consistency of these estimators under conditions characterized by the topology of the underlying comparison graph sequence and demonstrate that the proposed conditions are sharp for common sampling scenarios such as the nonuniform random hypergraph model and the hypergraph stochastic block model; we also obtain the asymptotic normality of these estimators and discuss the trade-off between statistical efficiency and computational complexity for practical uncertainty quantification. Both results allow for nonuniform and inhomogeneous comparison graphs with varying edge sizes and different asymptotic orders of edge probabilities. We verify our theoretical findings by conducting detailed numerical experiments.
Sequential Multiple-Assignment Randomized Trials (SMARTs) play an increasingly important role in psychological and behavioral health research. This experimental approach enables researchers to answer scientific questions about how to sequence and match interventions to the unique, changing needs of individuals. A variety of sample size planning resources for SMART studies have been developed in recent years; these enable researchers to plan SMARTs for addressing different types of scientific questions. However, relatively limited attention has been given to planning SMARTs with binary (dichotomous) outcomes, which often require higher sample sizes relative to continuous outcomes. Existing resources for estimating sample size requirements for SMARTs with binary outcomes do not consider the potential to improve power by including a baseline measurement and/or multiple repeated outcome measurements. The current paper addresses this issue by providing sample size simulation code and approximate formulas for two-wave repeated measures binary outcomes (i.e., two measurement times for the outcome variable, before and after receiving the intervention). The simulation results agree well with the formulas. We also discuss how to use simulations to calculate power for studies with more than two outcome measurement occasions. The results show that having at least one repeated measurement of the outcome can substantially improve power under certain conditions.
Inferring the means in the multivariate normal model $X \sim N_n(\theta, I)$ with unknown mean vector $\theta=(\theta_1,...,\theta_n)' \in \mathbb{R}^n$ and observed data $X=(X_1,...,X_n)'\in {\mathbb R}^n$ is a challenging task, known as the problem of many normal means (MNMs). This paper tackles two fundamental kinds of MNMs within the framework of Inferential Models (IMs). The first kind, referred to as the {\it classic} kind, is presented as is. The second kind, referred to as the {\it empirical Bayes} kind, assumes that the individual means $\theta_i$'s are drawn independently {\it a priori} from an unknown distribution $G(.)$. The IM formulation for the empirical Bayes kind utilizes numerical deconvolution, enabling prior-free probabilistic inference with over-parameterization for $G(.)$. The IM formulation for the classic kind, on the other hand, utilizes a latent random permutation, providing a novel approach for reasoning with uncertainty and deeper understanding. For uncertainty quantification within the familiar frequentist inference framework, the IM method of maximum plausibility is used for point estimation. Conservative interval estimation is obtained based on plausibility, using a Monte Carlo-based adaptive adjustment approach to construct shorter confidence intervals with targeted coverage. These methods are demonstrated through simulation studies and a real-data example. The numerical results show that the proposed methods for point estimation outperform traditional James-Stein and Efron's $g$-modeling in terms of mean square error, and the adaptive intervals are satisfactory in both coverage and efficiency. The paper concludes with suggestions for future developments and extensions of the proposed methods.
We present new results on average causal effects in settings with unmeasured exposure-outcome confounding. Our results are motivated by a class of estimands, e.g., frequently of interest in medicine and public health, that are currently not targeted by standard approaches for average causal effects. We recognize these estimands as queries about the average causal effect of an intervening variable. We anchor our introduction of these estimands in an investigation of the role of chronic pain and opioid prescription patterns in the opioid epidemic, and illustrate how conventional approaches will lead unreplicable estimates with ambiguous policy implications. We argue that our altenative effects are replicable and have clear policy implications, and furthermore are non-parametrically identified by the classical frontdoor formula. As an independent contribution, we derive a new semiparametric efficient estimator of the frontdoor formula with a uniform sample boundedness guarantee. This property is unique among previously-described estimators in its class, and we demonstrate superior performance in finite-sample settings. Theoretical results are applied with data from the National Health and Nutrition Examination Survey.
In recent years we have been able to gather large amounts of genomic data at a fast rate, creating situations where the number of variables greatly exceeds the number of observations. In these situations, most models that can handle a moderately high dimension will now become computationally infeasible. Hence, there is a need for a pre-screening of variables to reduce the dimension efficiently and accurately to a more moderate scale. There has been much work to develop such screening procedures for independent outcomes. However, much less work has been done for high-dimensional longitudinal data, in which the observations can no longer be assumed to be independent. In addition, it is of interest to capture possible interactions between the genomic variable and time in many of these longitudinal studies. This calls for the development of new screening procedures for high-dimensional longitudinal data, where the focus is on interactions with time. In this work, we propose a novel conditional screening procedure that ranks variables according to the likelihood value at the maximum likelihood estimates in a semi-marginal linear mixed model, where the genomic variable and its interaction with time are included in the model. This is to our knowledge the first conditional screening approach for clustered data. We prove that this approach enjoys the sure screening property, and assess the finite sample performance of the method through simulations, with a comparison of an already existing screening approach based on generalized estimating equations.
A very classical problem in statistics is to test the stochastic superiority of one distribution to another. However, many existing approaches are developed for independent samples and, moreover, do not take censored data into account. We develop a new estimand-driven method to compare the effectiveness of two treatments in the context of right-censored survival data with matched pairs. With the help of competing risks techniques, the so-called relative treatment effect is estimated. It quantifies the probability that the individual undergoing the first treatment survives the matched individual undergoing the second treatment. Hypothesis tests and confidence intervals are based on a studentized version of the estimator, where resampling-based inference is established by means of a randomization method. In a simulation study, we found that the developed test exhibits good power, when compared to competitors which are actually testing the simpler null hypothesis of the equality of both marginal survival functions. Finally, we apply the methodology to a well-known benchmark data set from a trial with patients suffering from with diabetic retinopathy.
We develop a methodology for conducting inference on extreme quantiles of unobserved individual heterogeneity (heterogeneous coefficients, heterogeneous treatment effects, etc.) in a panel data or meta-analysis setting. Inference in such settings is challenging: only noisy estimates of unobserved heterogeneity are available, and approximations based on the central limit theorem work poorly for extreme quantiles. For this situation, under weak assumptions we derive an extreme value theorem and an intermediate order theorem for noisy estimates and appropriate rate and moment conditions. Both theorems are then used to construct confidence intervals for extremal quantiles. The intervals are simple to construct and require no optimization. Inference based on the intermediate order theorem involves a novel self-normalized intermediate order theorem. In simulations, our extremal confidence intervals have favorable coverage properties in the tail. Our methodology is illustrated with an application to firm productivity in denser and less dense areas.
The problem of generalization and transportation of treatment effect estimates from a study sample to a target population is central to empirical research and statistical methodology. In both randomized experiments and observational studies, weighting methods are often used with this objective. Traditional methods construct the weights by separately modeling the treatment assignment and study selection probabilities and then multiplying functions (e.g., inverses) of their estimates. In this work, we provide a justification and an implementation for weighting in a single step. We show a formal connection between this one-step method and inverse probability and inverse odds weighting. We demonstrate that the resulting estimator for the target average treatment effect is consistent, asymptotically Normal, multiply robust, and semiparametrically efficient. We evaluate the performance of the one-step estimator in a simulation study. We illustrate its use in a case study on the effects of physician racial diversity on preventive healthcare utilization among Black men in California. We provide R code implementing the methodology.