In assessing prediction accuracy of multivariable prediction models, optimism corrections are essential for preventing biased results. However, in most published papers of clinical prediction models, the point estimates of the prediction accuracy measures are corrected by adequate bootstrap-based correction methods, but their confidence intervals are not corrected, e.g., the DeLong's confidence interval is usually used for assessing the C-statistic. These naive methods do not adjust for the optimism bias and do not account for statistical variability in the estimation of parameters in the prediction models. Therefore, their coverage probabilities of the true value of the prediction accuracy measure can be seriously below the nominal level (e.g., 95%). In this article, we provide two generic bootstrap methods, namely (1) location-shifted bootstrap confidence intervals and (2) two-stage bootstrap confidence intervals, that can be generally applied to the bootstrap-based optimism correction methods, i.e., the Harrell's bias correction, 0.632, and 0.632+ methods. In addition, they can be widely applied to various methods for prediction model development involving modern shrinkage methods such as the ridge and lasso regressions. Through numerical evaluations by simulations, the proposed confidence intervals showed favourable coverage performances. Besides, the current standard practices based on the optimism-uncorrected methods showed serious undercoverage properties. To avoid erroneous results, the optimism-uncorrected confidence intervals should not be used in practice, and the adjusted methods are recommended instead. We also developed the R package predboot for implementing these methods (//github.com/nomahi/predboot). The effectiveness of the proposed methods are illustrated via applications to the GUSTO-I clinical trial.
In clinical trials and other applications, we often see regions of the feature space that appear to exhibit interesting behaviour, but it is unclear whether these observed phenomena are reflected at the population level. Focusing on a regression setting, we consider the subgroup selection challenge of identifying a region of the feature space on which the regression function exceeds a pre-determined threshold. We formulate the problem as one of constrained optimisation, where we seek a low-complexity, data-dependent selection set on which, with a guaranteed probability, the regression function is uniformly at least as large as the threshold; subject to this constraint, we would like the region to contain as much mass under the marginal feature distribution as possible. This leads to a natural notion of regret, and our main contribution is to determine the minimax optimal rate for this regret in both the sample size and the Type I error probability. The rate involves a delicate interplay between parameters that control the smoothness of the regression function, as well as exponents that quantify the extent to which the optimal selection set at the population level can be approximated by families of well-behaved subsets. Finally, we expand the scope of our previous results by illustrating how they may be generalised to a treatment and control setting, where interest lies in the heterogeneous treatment effect.
We study approaches to robust model-based design of experiments in the context of maximum-likelihood estimation. These approaches provide robustification of model-based methodologies for the design of optimal experiments by accounting for the effect of the parametric uncertainty. We study the problem of robust optimal design of experiments in the framework of nonlinear least-squares parameter estimation using linearized confidence regions. We investigate several well-known robustification frameworks in this respect and propose a novel methodology based on multi-stage robust optimization. The proposed methodology aims at problems, where the experiments are designed sequentially with a possibility of re-estimation in-between the experiments. The multi-stage formalism aids in identifying experiments that are better conducted in the early phase of experimentation, where parameter knowledge is poor. We demonstrate the findings and effectiveness of the proposed methodology using four case studies of varying complexity.
Motivation: Combining the results of different experiments to exhibit complex patterns or to improve statistical power is a typical aim of data integration. The starting point of the statistical analysis often comes as sets of p-values resulting from previous analyses, that need to be combined in a flexible way to explore complex hypotheses, while guaranteeing a low proportion of false discoveries. Results: We introduce the generic concept of composed hypothesis, which corresponds to an arbitrary complex combination of simple hypotheses. We rephrase the problem of testing a composed hypothesis as a classification task, and show that finding items for which the composed null hypothesis is rejected boils down to fitting a mixture model and classify the items according to their posterior probabilities. We show that inference can be efficiently performed and provide a thorough classification rule to control for type I error. The performance and the usefulness of the approach are illustrated on simulations and on two different applications. The method is scalable, does not require any parameter tuning, and provided valuable biological insight on the considered application cases. Availability: The QCH methodology is implemented in the qch R package hosted on CRAN.
In cluster randomized trials, patients are recruited after clusters are randomized, and the recruiters and patients may not be blinded to the assignment. This often leads to differential recruitment and systematic differences in baseline characteristics of the recruited patients between intervention and control arms, inducing post-randomization selection bias. We aim to rigorously define causal estimands in the presence of selection bias. We elucidate the conditions under which standard covariate adjustment methods can validly estimate these estimands. We further discuss the additional data and assumptions necessary for estimating causal effects when such conditions are not met. Adopting the principal stratification framework in causal inference, we clarify there are two average treatment effect (ATE) estimands in cluster randomized trials: one for the overall population and one for the recruited population. We derive the analytical formula of the two estimands in terms of principal-stratum-specific causal effects. Further, using simulation studies, we assess the empirical performance of the multivariable regression adjustment method under different data generating processes leading to selection bias. When treatment effects are heterogeneous across principal strata, the ATE on the overall population generally differs from the ATE on the recruited population. A naive intention-to-treat analysis of the recruited sample leads to biased estimates of both ATEs. In the presence of post-randomization selection and without additional data on the non-recruited subjects, the ATE on the recruited population is estimable only when the treatment effects are homogenous between principal strata, and the ATE on the overall population is generally not estimable. The extent to which covariate adjustment can remove selection bias depends on the degree of effect heterogeneity across principal strata.
Traditionally, the least squares regression is mainly concerned with studying the effects of individual predictor variables, but strongly correlated variables generate multicollinearity which makes it difficult to study their effects. Existing methods for handling multicollinearity such as ridge regression are complicated. To resolve the multicollinearity issue without abandoning the simple least squares regression, for situations where predictor variables are in groups with strong within-group correlations but weak between-group correlations, we propose to study the effects of the groups with a group approach to the least squares regression. Using an all positive correlations arrangement of the strongly correlated variables, we first characterize group effects that are meaningful and can be accurately estimated. We then present the group approach with numerical examples and demonstrate its advantages over existing methods for handling multicollinearity. We also address a common misconception about prediction accuracy of the least squares estimated model.
Longitudinal and high-dimensional measurements have become increasingly common in biomedical research. However, methods to predict survival outcomes using covariates that are both longitudinal and high-dimensional are currently missing. In this article, we propose penalized regression calibration (PRC), a method that can be employed to predict survival in such situations. PRC comprises three modeling steps: First, the trajectories described by the longitudinal predictors are flexibly modeled through the specification of multivariate mixed effects models. Second, subject-specific summaries of the longitudinal trajectories are derived from the fitted mixed models. Third, the time to event outcome is predicted using the subject-specific summaries as covariates in a penalized Cox model. To ensure a proper internal validation of the fitted PRC models, we furthermore develop a cluster bootstrap optimism correction procedure that allows to correct for the optimistic bias of apparent measures of predictiveness. PRC and the CBOCP are implemented in the R package pencal, available from CRAN. After studying the behavior of PRC via simulations, we conclude by illustrating an application of PRC to data from an observational study that involved patients affected by Duchenne muscular dystrophy, where the goal is predict time to loss of ambulation using longitudinal blood biomarkers.
Existing techniques for mitigating dataset bias often leverage a biased model to identify biased instances. The role of these biased instances is then reduced during the training of the main model to enhance its robustness to out-of-distribution data. A common core assumption of these techniques is that the main model handles biased instances similarly to the biased model, in that it will resort to biases whenever available. In this paper, we show that this assumption does not hold in general. We carry out a critical investigation on two well-known datasets in the domain, MNLI and FEVER, along with two biased instance detection methods, partial-input and limited-capacity models. Our experiments show that in around a third to a half of instances, the biased model is unable to predict the main model's behavior, highlighted by the significantly different parts of the input on which they base their decisions. Based on a manual validation, we also show that this estimate is highly in line with human interpretation. Our findings suggest that down-weighting of instances detected by bias detection methods, which is a widely-practiced procedure, is an unnecessary waste of training data. We release our code to facilitate reproducibility and future research.
The inverse probability of treatment weighting (IPTW) approach is commonly used in propensity score analysis to infer causal effects in regression models. Due to oversized IPTW weights and errors associated with propensity score estimation, the IPTW approach can underestimate the standard error of causal effect. To remediate this, bootstrap standard errors have been recommended to replace the IPTW standard error, but the ordinary bootstrap (OB) procedure might still result in underestimation of the standard error because of its inefficient sampling algorithm and un-stabilized weights. In this paper, we develop a generalized bootstrap (GB) procedure for estimating the standard error of the IPTW approach. Compared with the OB procedure, the GB procedure has much lower risk of underestimating the standard error and is more efficient for both point and standard error estimates. The GB procedure also has smaller risk of standard error underestimation than the ordinary bootstrap procedure with trimmed weights, with comparable efficiencies. We demonstrate the effectiveness of the GB procedure via a simulation study and a dataset from the National Educational Longitudinal Study-1988 (NELS-88).
Random subspaces $X$ of $\mathbb{R}^n$ of dimension proportional to $n$ are, with high probability, well-spread with respect to the $\ell_p$-norm (for $p \in [1,2]$). Namely, every nonzero $x \in X$ is "robustly non-sparse" in the following sense: $x$ is $\varepsilon \|x\|_p$-far in $\ell_p$-distance from all $\delta n$-sparse vectors, for positive constants $\varepsilon, \delta$ bounded away from $0$. This "$\ell_p$-spread" property is the natural counterpart, for subspaces over the reals, of the minimum distance of linear codes over finite fields, and, for $p = 2$, corresponds to $X$ being a Euclidean section of the $\ell_1$ unit ball. Explicit $\ell_p$-spread subspaces of dimension $\Omega(n)$, however, are not known except for $p=1$. The construction for $p=1$, as well as the best known constructions for $p \in (1,2]$ (which achieve weaker spread properties), are analogs of low density parity check (LDPC) codes over the reals, i.e., they are kernels of sparse matrices. We study the spread properties of the kernels of sparse random matrices. Rather surprisingly, we prove that with high probability such subspaces contain vectors $x$ that are $o(1)\cdot \|x\|_2$-close to $o(n)$-sparse with respect to the $\ell_2$-norm, and in particular are not $\ell_2$-spread. On the other hand, for $p < 2$ we prove that such subspaces are $\ell_p$-spread with high probability. Moreover, we show that a random sparse matrix has the stronger restricted isometry property (RIP) with respect to the $\ell_p$ norm, and this follows solely from the unique expansion of a random biregular graph, yielding a somewhat unexpected generalization of a similar result for the $\ell_1$ norm [BGI+08]. Instantiating this with explicit expanders, we obtain the first explicit constructions of $\ell_p$-spread subspaces and $\ell_p$-RIP matrices for $1 \leq p < p_0$, where $1 < p_0 < 2$ is an absolute constant.
Some years ago, Snapinn and Jiang[1] considered the interpretation and pitfalls of absolute versus relative treatment effect measures in analyses of time-to-event outcomes. Through specific examples and analytical considerations based solely on the exponential and the Weibull distributions they reach two conclusions: 1) that the commonly used criteria for clinical effectiveness, the ARR (Absolute Risk Reduction) and the median (survival time) difference (MD) directly contradict each other and 2) cost-effectiveness depends only the hazard ratio(HR) and the shape parameter (in the Weibull case) but not the overall baseline risk of the population. Though provocative, the first conclusion does not apply to either the two special cases considered or even more generally, while the second conclusion is strictly correct only for the exponential case. Therefore, the implication inferred by the authors i.e. all measures of absolute treatment effect are of little value compared with the relative measure of the hazard ratio, is not of general validity and hence both absolute and relative measures should continue to be used when appraising clinical evidence.