When studying the association between treatment and a clinical outcome, a parametric multivariable model of the conditional outcome expectation is often used to adjust for covariates. The treatment coefficient of the outcome model targets a conditional treatment effect. Model-based standardization is typically applied to average the model predictions over the target covariate distribution, and generate a covariate-adjusted estimate of the marginal treatment effect. The standard approach to model-based standardization involves maximum-likelihood estimation and use of the non-parametric bootstrap. We introduce a novel, general-purpose, model-based standardization method based on multiple imputation that is easily applicable when the outcome model is a generalized linear model. We term our proposed approach multiple imputation marginalization (MIM). MIM consists of two main stages: the generation of synthetic datasets and their analysis. MIM accommodates a Bayesian statistical framework, which naturally allows for the principled propagation of uncertainty, integrates the analysis into a probabilistic framework, and allows for the incorporation of prior evidence. We conduct a simulation study to benchmark the finite-sample performance of MIM in conjunction with a parametric outcome model. The simulations provide proof-of-principle in scenarios with binary outcomes, continuous-valued covariates, a logistic outcome model and the marginal log odds ratio as the target effect measure. When parametric modeling assumptions hold, MIM yields unbiased estimation in the target covariate distribution, valid coverage rates, and similar precision and efficiency than the standard approach to model-based standardization.
Given the high incidence of cardio and cerebrovascular diseases (CVD), and its association with morbidity and mortality, its prevention is a major public health issue. A high level of blood pressure is a well-known risk factor for these events and an increasing number of studies suggest that blood pressure variability may also be an independent risk factor. However, these studies suffer from significant methodological weaknesses. In this work we propose a new location-scale joint model for the repeated measures of a marker and competing events. This joint model combines a mixed model including a subject-specific and time-dependent residual variance modeled through random effects, and cause-specific proportional intensity models for the competing events. The risk of events may depend simultaneously on the current value of the variance, as well as, the current value and the current slope of the marker trajectory. The model is estimated by maximizing the likelihood function using the Marquardt-Levenberg algorithm. The estimation procedure is implemented in a R-package and is validated through a simulation study. This model is applied to study the association between blood pressure variability and the risk of CVD and death from other causes. Using data from a large clinical trial on the secondary prevention of stroke, we find that the current individual variability of blood pressure is associated with the risk of CVD and death. Moreover, the comparison with a model without heterogeneous variance shows the importance of taking into account this variability in the goodness-of-fit and for dynamic predictions.
Despite the success of deep-learning models in many tasks, there have been concerns about such models learning shortcuts, and their lack of robustness to irrelevant confounders. When it comes to models directly trained on human faces, a sensitive confounder is that of human identities. Many face-related tasks should ideally be identity-independent, and perform uniformly across different individuals (i.e. be fair). One way to measure and enforce such robustness and performance uniformity is through enforcing it during training, assuming identity-related information is available at scale. However, due to privacy concerns and also the cost of collecting such information, this is often not the case, and most face datasets simply contain input images and their corresponding task-related labels. Thus, improving identity-related robustness without the need for such annotations is of great importance. Here, we explore using face-recognition embedding vectors, as proxies for identities, to enforce such robustness. We propose to use the structure in the face-recognition embedding space, to implicitly emphasize rare samples within each class. We do so by weighting samples according to their conditional inverse density (CID) in the proxy embedding space. Our experiments suggest that such a simple sample weighting scheme, not only improves the training robustness, it often improves the overall performance as a result of such robustness. We also show that employing such constraints during training results in models that are significantly less sensitive to different levels of bias in the dataset.
It is well known that the Euler method for approximating the solutions of a random ordinary differential equation $\mathrm{d}X_t/\mathrm{d}t = f(t, X_t, Y_t)$ driven by a stochastic process $\{Y_t\}_t$ with $\theta$-H\"older sample paths is estimated to be of strong order $\theta$ with respect to the time step, provided $f=f(t, x, y)$ is sufficiently regular and with suitable bounds. Here, it is proved that, in many typical cases, further conditions on the noise can be exploited so that the strong convergence is actually of order 1, regardless of the H\"older regularity of the sample paths. This applies for instance to additive or multiplicative It\^o process noises (such as Wiener, Ornstein-Uhlenbeck, and geometric Brownian motion processes); to point-process noises (such as Poisson point processes and Hawkes self-exciting processes, which even have jump-type discontinuities); and to transport-type processes with sample paths of bounded variation. The result is based on a novel approach, estimating the global error as an iterated integral over both large and small mesh scales, and switching the order of integration to move the critical regularity to the large scale. The work is complemented with numerical simulations illustrating the strong order 1 convergence in those cases, and with an example with fractional Brownian motion noise with Hurst parameter $0 < H < 1/2$ for which the order of convergence is $H + 1/2$, hence lower than the attained order 1 in the examples above, but still higher than the order $H$ of convergence expected from previous works.
We propose a new nonparametric modeling framework for causal inference when outcomes depend on how agents are linked in a social or economic network. Such network interference describes a large literature on treatment spillovers, social interactions, social learning, information diffusion, disease and financial contagion, social capital formation, and more. Our approach works by first characterizing how an agent is linked in the network using the configuration of other agents and connections nearby as measured by path distance. The impact of a policy or treatment assignment is then learned by pooling outcome data across similarly configured agents. We demonstrate the approach by proposing an asymptotically valid test for the hypothesis of policy irrelevance/no treatment effects and bounding the mean-squared error of a k-nearest-neighbor estimator for the average or distributional policy effect/treatment response.
Modern advancements in large-scale machine learning would be impossible without the paradigm of data-parallel distributed computing. Since distributed computing with large-scale models imparts excessive pressure on communication channels, significant recent research has been directed toward co-designing communication compression strategies and training algorithms with the goal of reducing communication costs. While pure data parallelism allows better data scaling, it suffers from poor model scaling properties. Indeed, compute nodes are severely limited by memory constraints, preventing further increases in model size. For this reason, the latest achievements in training giant neural network models also rely on some form of model parallelism. In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model.
In the classic implementation of the LOBPCG method, orthogonalization and the R-R (Rayleigh-Ritz) procedure cost nonignorable CPU time. Especially this consumption could be very expensive to deal with situations with large block sizes. In this paper, we propose an orthogonalization-free framework of implementing the LOBPCG method for SCF (self-consistent field) iterations in solving the Kohn-Sham equation. In this framework, orthogonalization is avoided in calculations, which can decrease the computational complexity. And the R-R procedure is implemented parallelly through OpenMP, which can further reduce computational time. During numerical experiments, an effective preconditioning strategy is designed, which can accelerate the LOBPCG method remarkably. Consequently, the efficiency of the LOBPCG method can be significantly improved. Based on this, the SCF iteration can solve the Kohn-Sham equation efficiently. A series of numerical experiments are inducted to demonstrate the effectiveness of our implementation, in which significant improvements in computational time can be observed.
Clinical trials often involve the assessment of multiple endpoints to comprehensively evaluate the efficacy and safety of interventions. In the work, we consider a global nonparametric testing procedure based on multivariate rank for the analysis of multiple endpoints in clinical trials. Unlike other existing approaches that rely on pairwise comparisons for each individual endpoint, the proposed method directly incorporates the multivariate ranks of the observations. By considering the joint ranking of all endpoints, the proposed approach provides robustness against diverse data distributions and censoring mechanisms commonly encountered in clinical trials. Through extensive simulations, we demonstrate the superior performance of the multivariate rank-based approach in controlling type I error and achieving higher power compared to existing rank-based methods. The simulations illustrate the advantages of leveraging multivariate ranks and highlight the robustness of the approach in various settings. The proposed method offers an effective tool for the analysis of multiple endpoints in clinical trials, enhancing the reliability and efficiency of outcome evaluations.
While there exists several inferential methods for analyzing functional data in factorial designs, there is a lack of statistical tests that are valid (i) in general designs, (ii) under non-restrictive assumptions on the data generating process and (iii) allow for coherent post-hoc analyses. In particular, most existing methods assume Gaussianity or equal covariance functions across groups (homoscedasticity) and are only applicable for specific study designs that do not allow for evaluation of interactions. Moreover, all available strategies are only designed for testing global hypotheses and do not directly allow a more in-depth analysis of multiple local hypotheses. To address the first two problems (i)-(ii), we propose flexible integral-type test statistics that are applicable in general factorial designs under minimal assumptions on the data generating process. In particular, we neither postulate homoscedasticity nor Gaussianity. To approximate the statistics' null distribution, we adopt a resampling approach and validate it methodologically. Finally, we use our flexible testing framework to (iii) infer several local null hypotheses simultaneously. To allow for powerful data analysis, we thereby take the complex dependencies of the different local test statistics into account. In extensive simulations we confirm that the new methods are flexibly applicable. Two illustrate data analyses complete our study. The new testing procedures are implemented in the R package multiFANOVA, which will be available on CRAN soon.
This paper studies inference in two-stage randomized experiments under covariate-adaptive randomization. In the initial stage of this experimental design, clusters (e.g., households, schools, or graph partitions) are stratified and randomly assigned to control or treatment groups based on cluster-level covariates. Subsequently, an independent second-stage design is carried out, wherein units within each treated cluster are further stratified and randomly assigned to either control or treatment groups, based on individual-level covariates. Under the homogeneous partial interference assumption, I establish conditions under which the proposed difference-in-"average of averages" estimators are consistent and asymptotically normal for the corresponding average primary and spillover effects and develop consistent estimators of their asymptotic variances. Combining these results establishes the asymptotic validity of tests based on these estimators. My findings suggest that ignoring covariate information in the design stage can result in efficiency loss, and commonly used inference methods that ignore or improperly use covariate information can lead to either conservative or invalid inference. Finally, I apply these results to studying optimal use of covariate information under covariate-adaptive randomization in large samples, and demonstrate that a specific generalized matched-pair design achieves minimum asymptotic variance for each proposed estimator. The practical relevance of the theoretical results is illustrated through a simulation study and an empirical application.
With some regularity conditions maximum likelihood estimators (MLEs) always produce asymptotically optimal (in the sense of consistency, efficiency, sufficiency, and unbiasedness) estimators. But in general, the MLEs lead to non-robust statistical inference, for example, pricing models and risk measures. Actuarial claim severity is continuous, right-skewed, and frequently heavy-tailed. The data sets that such models are usually fitted to contain outliers that are difficult to identify and separate from genuine data. Moreover, due to commonly used actuarial "loss control strategies" in financial and insurance industries, the random variables we observe and wish to model are affected by truncation (due to deductibles), censoring (due to policy limits), scaling (due to coinsurance proportions) and other transformations. To alleviate the lack of robustness of MLE-based inference in risk modeling, here in this paper, we propose and develop a new method of estimation - method of truncated moments (MTuM) and generalize it for different scenarios of loss control mechanism. Various asymptotic properties of those estimates are established by using central limit theory. New connections between different estimators are found. A comparative study of newly-designed methods with the corresponding MLEs is performed. Detail investigation has been done for a single parameter Pareto loss model including a simulation study.