In survival analysis, complex machine learning algorithms have been increasingly used for predictive modeling. Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. In particular, in HIV vaccine trials, participant baseline characteristics are used to predict the probability of infection over the intended follow-up period, and investigators may wish to understand how much certain types of predictors, such as behavioral factors, contribute toward overall predictiveness. Time-to-event outcomes such as time to infection are often subject to right censoring, and existing methods for assessing variable importance are typically not intended to be used in this setting. We describe a broad class of algorithm-agnostic variable importance measures for prediction in the context of survival data. We propose a nonparametric efficient estimation procedure that incorporates flexible learning of nuisance parameters, yields asymptotically valid inference, and enjoys double-robustness. We assess the performance of our proposed procedure via numerical simulations and analyze data from the HVTN 702 study to inform enrollment strategies for future HIV vaccine trials.
Partial differential equations (PDEs) have become an essential tool for modeling complex physical systems. Such equations are typically solved numerically via mesh-based methods, such as finite element methods, with solutions over the spatial domain. However, obtaining these solutions are often prohibitively costly, limiting the feasibility of exploring parameters in PDEs. In this paper, we propose an efficient emulator that simultaneously predicts the solutions over the spatial domain, with theoretical justification of its uncertainty quantification. The novelty of the proposed method lies in the incorporation of the mesh node coordinates into the statistical model. In particular, the proposed method segments the mesh nodes into multiple clusters via a Dirichlet process prior and fits Gaussian process models with the same hyperparameters in each of them. Most importantly, by revealing the underlying clustering structures, the proposed method can provide valuable insights into qualitative features of the resulting dynamics that can be used to guide further investigations. Real examples are demonstrated to show that our proposed method has smaller prediction errors than its main competitors, with competitive computation time, and identifies interesting clusters of mesh nodes that possess physical significance, such as satisfying boundary conditions. An R package for the proposed methodology is provided in an open repository.
Neural operators (NO) are discretization invariant deep learning methods with functional output and can approximate any continuous operator. NO have demonstrated the superiority of solving partial differential equations (PDEs) over other deep learning methods. However, the spatial domain of its input function needs to be identical to its output, which limits its applicability. For instance, the widely used Fourier neural operator (FNO) fails to approximate the operator that maps the boundary condition to the PDE solution. To address this issue, we propose a novel framework called resolution-invariant deep operator (RDO) that decouples the spatial domain of the input and output. RDO is motivated by the Deep operator network (DeepONet) and it does not require retraining the network when the input/output is changed compared with DeepONet. RDO takes functional input and its output is also functional so that it keeps the resolution invariant property of NO. It can also resolve PDEs with complex geometries whereas NO fail. Various numerical experiments demonstrate the advantage of our method over DeepONet and FNO.
SDRDPy is a desktop application that allows experts an intuitive graphic and tabular representation of the knowledge extracted by any supervised descriptive rule discovery algorithm. The application is able to provide an analysis of the data showing the relevant information of the data set and the relationship between the rules, data and the quality measures associated for each rule regardless of the tool where algorithm has been executed. All of the information is presented in a user-friendly application in order to facilitate expert analysis and also the exportation of reports in different formats.
Among semiparametric regression models, partially linear additive models provide a useful tool to include additive nonparametric components as well as a parametric component, when explaining the relationship between the response and a set of explanatory variables. This paper concerns such models under sparsity assumptions for the covariates included in the linear component. Sparse covariates are frequent in regression problems where the task of variable selection is usually of interest. As in other settings, outliers either in the residuals or in the covariates involved in the linear component have a harmful effect. To simultaneously achieve model selection for the parametric component of the model and resistance to outliers, we combine preliminary robust estimators of the additive component, robust linear $MM-$regression estimators with a penalty such as SCAD on the coefficients in the parametric part. Under mild assumptions, consistency results and rates of convergence for the proposed estimators are derived. A Monte Carlo study is carried out to compare, under different models and contamination schemes, the performance of the robust proposal with its classical counterpart. The obtained results show the advantage of using the robust approach. Through the analysis of a real data set, we also illustrate the benefits of the proposed procedure.
Posterior predictive p-values (ppps) have become popular tools for Bayesian model assessment, being general-purpose and easy to use. However, interpretation can be difficult because their distribution is not uniform under the hypothesis that the model did generate the data. Calibrated ppps (cppps) can be obtained via a bootstrap-like procedure, yet remain unavailable in practice due to high computational cost. This paper introduces methods to enable efficient approximation of cppps and their uncertainty for fast model assessment. We first investigate the computational trade-off between the number of calibration replicates and the number of MCMC samples per replicate. Provided that the MCMC chain from the real data has converged, using short MCMC chains per calibration replicate can save significant computation time compared to naive implementations, without significant loss in accuracy. We propose different variance estimators for the cppp approximation, which can be used to confirm quickly the lack of evidence against model misspecification. As variance estimation uses effective sample sizes of many short MCMC chains, we show these can be approximated well from the real-data MCMC chain. The procedure for cppp is implemented in NIMBLE, a flexible framework for hierarchical modeling that supports many models and discrepancy measures.
A statistical emulator can be used as a surrogate of complex physics-based calculations to drastically reduce the computational cost. Its successful implementation hinges on an accurate representation of the nonlinear response surface with a high-dimensional input space. Conventional "space-filling" designs, including random sampling and Latin hypercube sampling, become inefficient as the dimensionality of the input variables increases, and the predictive accuracy of the emulator can degrade substantially for a test input distant from the training input set. To address this fundamental challenge, we develop a reliable emulator for predicting complex functionals by active learning with error control (ALEC). The algorithm is applicable to infinite-dimensional mapping with high-fidelity predictions and a controlled predictive error. The computational efficiency has been demonstrated by emulating the classical density functional theory (cDFT) calculations, a statistical-mechanical method widely used in modeling the equilibrium properties of complex molecular systems. We show that ALEC is much more accurate than conventional emulators based on the Gaussian processes with "space-filling" designs and alternative active learning methods. Besides, it is computationally more efficient than direct cDFT calculations. ALEC can be a reliable building block for emulating expensive functionals owing to its minimal computational cost, controllable predictive error, and fully automatic features.
Regression with random data objects is becoming increasingly common in modern data analysis. Unfortunately, like the traditional regression setting with Euclidean data, random response regression is not immune to the trouble caused by unusual observations. A metric Cook's distance extending the classical Cook's distances of Cook (1977) to general metric-valued response objects is proposed. The performance of the metric Cook's distance in both Euclidean and non-Euclidean response regression with Euclidean predictors is demonstrated in an extensive experimental study. A real data analysis of county-level COVID-19 transmission in the United States also illustrates the usefulness of this method in practice.
The analysis of survey data is a frequently arising issue in clinical trials, particularly when capturing quantities which are difficult to measure. Typical examples are questionnaires about patient's well-being, pain, or consent to an intervention. In these, data is captured on a discrete scale containing only a limited number of possible answers, from which the respondent has to pick the answer which fits best his/her personal opinion. This data is generally located on an ordinal scale as answers can usually be arranged in an ascending order, e.g., "bad", "neutral", "good" for well-being. Since responses are usually stored numerically for data processing purposes, analysis of survey data using ordinary linear regression models are commonly applied. However, assumptions of these models are often not met as linear regression requires a constant variability of the response variable and can yield predictions out of the range of response categories. By using linear models, one only gains insights about the mean response which may affect representativeness. In contrast, ordinal regression models can provide probability estimates for all response categories and yield information about the full response scale beyond the mean. In this work, we provide a concise overview of the fundamentals of latent variable based ordinal models, applications to a real data set, and outline the use of state-of-the-art-software for this purpose. Moreover, we discuss strengths, limitations and typical pitfalls. This is a companion work to a current vignette-based structured interview study in paediatric anaesthesia.
This study presents a Bayesian regression framework to model the relationship between scalar outcomes and brain functional connectivity represented as symmetric positive definite (SPD) matrices. Unlike many proposals that simply vectorize the connectivity predictors thereby ignoring their matrix structures, our method respects the Riemannian geometry of SPD matrices by modelling them in a tangent space. We perform dimension reduction in the tangent space, relating the resulting low-dimensional representations with the responses. The dimension reduction matrix is learnt in a supervised manner with a sparsity-inducing prior imposed on a Stiefel manifold to prevent overfitting. Our method yields a parsimonious regression model that allows uncertainty quantification of the estimates and identification of key brain regions that predict the outcomes. We demonstrate the performance of our approach in simulation settings and through a case study to predict Picture Vocabulary scores using data from the Human Connectome Project.
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting. We survey recent theoretical progress that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behavior of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favorable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.