Understanding whether and how treatment effects vary across subgroups is crucial to inform clinical practice and recommendations. Accordingly, the assessment of heterogeneous treatment effects (HTE) based on pre-specified potential effect modifiers has become a common goal in modern randomized trials. However, when one or more potential effect modifiers are missing, complete-case analysis may lead to bias and under-coverage. While statistical methods for handling missing data have been proposed and compared for individually randomized trials with missing effect modifier data, few guidelines exist for the cluster-randomized setting, where intracluster correlations in the effect modifiers, outcomes, or even missingness mechanisms may introduce further threats to accurate assessment of HTE. In this article, the performance of several missing data methods are compared through a simulation study of cluster-randomized trials with continuous outcome and missing binary effect modifier data, and further illustrated using real data from the Work, Family, and Health Study. Our results suggest that multilevel multiple imputation (MMI) and Bayesian MMI have better performance than other available methods, and that Bayesian MMI has lower bias and closer to nominal coverage than standard MMI when there are model specification or compatibility issues.
Most existing neural network-based approaches for solving stochastic optimal control problems using the associated backward dynamic programming principle rely on the ability to simulate the underlying state variables. However, in some problems, this simulation is infeasible, leading to the discretization of state variable space and the need to train one neural network for each data point. This approach becomes computationally inefficient when dealing with large state variable spaces. In this paper, we consider a class of this type of stochastic optimal control problems and introduce an effective solution employing multitask neural networks. To train our multitask neural network, we introduce a novel scheme that dynamically balances the learning across tasks. Through numerical experiments on real-world derivatives pricing problems, we prove that our method outperforms state-of-the-art approaches.
Home-based physical therapies are effective if the prescribed exercises are correctly executed and patients adhere to these routines. This is specially important for older adults who can easily forget the guidelines from therapists. Inertial Measurement Units (IMUs) are commonly used for tracking exercise execution giving information of patients' motion data. In this work, we propose the use of Machine Learning techniques to recognize which exercise is being carried out and to assess if the recognized exercise is properly executed by using data from four IMUs placed on the person limbs. To the best of our knowledge, both tasks have never been addressed together as a unique complex task before. However, their combination is needed for the complete characterization of the performance of physical therapies. We evaluate the performance of six machine learning classifiers in three contexts: recognition and evaluation in a single classifier, recognition of correct exercises, excluding the wrongly performed exercises, and a two-stage approach that first recognizes the exercise and then evaluates it. We apply our proposal to a set of 8 exercises of the upper-and lower-limbs designed for maintaining elderly people health status. To do so, the motion of volunteers were monitored with 4 IMUs. We obtain accuracies of 88.4 \% and the 91.4 \% in the two initial scenarios. In the third one, the recognition provides an accuracy of 96.2 \%, whereas the exercise evaluation varies between 93.6 \% and 100.0 \%. This work proves the feasibility of IMUs for a complete monitoring of physical therapies in which we can get information of which exercise is being performed and its quality, as a basis for designing virtual coaches.
Non-significant randomized control trials can hide subgroups of good responders to experimental drugs, thus hindering subsequent development. Identifying such heterogeneous treatment effects is key for precision medicine and many post-hoc analysis methods have been developed for that purpose. While several benchmarks have been carried out to identify the strengths and weaknesses of these methods, notably for binary and continuous endpoints, similar systematic empirical evaluation of subgroup analysis for time-to-event endpoints are lacking. This work aims to fill this gap by evaluating several subgroup analysis algorithms in the context of time-to-event outcomes, by means of three different research questions: Is there heterogeneity? What are the biomarkers responsible for such heterogeneity? Who are the good responders to treatment? In this context, we propose a new synthetic and semi-synthetic data generation process that allows one to explore a wide range of heterogeneity scenarios with precise control on the level of heterogeneity. We provide an open source Python package, available on Github, containing our generation process and our comprehensive benchmark framework. We hope this package will be useful to the research community for future investigations of heterogeneity of treatment effects and subgroup analysis methods benchmarking.
Cluster randomization trials commonly employ multiple endpoints. When a single summary of treatment effects across endpoints is of primary interest, global hypothesis testing/effect estimation methods represent a common analysis strategy. However, specification of the joint distribution required by these methods is non-trivial, particularly when endpoint properties differ. We develop rank-based interval estimators for a global treatment effect referred to as the "global win probability," or the probability that a treatment individual responds better than a control individual on average. Using endpoint-specific ranks among the combined sample and within each arm, each individual-level observation is converted to a "win fraction" which quantifies the proportion of wins experienced over every observation in the comparison arm. An individual's multiple observations are then replaced by a single "global win fraction," constructed by averaging win fractions across endpoints. A linear mixed model is applied directly to the global win fractions to recover point, variance, and interval estimates of the global win probability adjusted for clustering. Simulation demonstrates our approach performs well concerning coverage and type I error, and methods are easily implemented using standard software. A case study using publicly available data is provided with corresponding R and SAS code.
Optimization under uncertainty is important in many applications, particularly to inform policy and decision making in areas such as public health. A key source of uncertainty arises from the incorporation of environmental variables as inputs into computational models or simulators. Such variables represent uncontrollable features of the optimization problem and reliable decision making must account for the uncertainty they propagate to the simulator outputs. Often, multiple, competing objectives are defined from these outputs such that the final optimal decision is a compromise between different goals. Here, we present emulation-based optimization methodology for such problems that extends expected quantile improvement (EQI) to address multi-objective optimization. Focusing on the practically important case of two objectives, we use a sequential design strategy to identify the Pareto front of optimal solutions. Uncertainty from the environmental variables is integrated out using Monte Carlo samples from the simulator. Interrogation of the expected output from the simulator is facilitated by use of (Gaussian process) emulators. The methodology is demonstrated on an optimization problem from public health involving the dispersion of anthrax spores across a spatial terrain. Environmental variables include meteorological features that impact the dispersion, and the methodology identifies the Pareto front even when there is considerable input uncertainty.
In this work some advances in the theory of curvature of two-dimensional probability manifolds corresponding to families of distributions are proposed. It is proved that location-scale distributions are hyperbolic in the Information Geometry sense even when the generatrix is non-even or non-smooth. A novel formula is obtained for the computation of curvature in the case of exponential families: this formula implies some new flatness criteria in dimension 2. Finally, it is observed that many two parameter distributions, widely used in applications, are locally hyperbolic, which highlights the role of hyperbolic geometry in the study of commonly employed probability manifolds. These results have benefited from the use of explainable computational tools, which can substantially boost scientific productivity.
We study the complexity (that is, the weight of the multiplication table) of the elliptic normal bases introduced by Couveignes and Lercier. We give an upper bound on the complexity of these elliptic normal bases, and we analyze the weight of some special vectors related to the multiplication table of those bases. This analysis leads us to some perspectives on the search for low complexity normal bases from elliptic periods.
Methods for estimating heterogeneous treatment effects (HTE) from observational data have largely focused on continuous or binary outcomes, with less attention paid to survival outcomes and almost none to settings with competing risks. In this work, we develop censoring unbiased transformations (CUTs) for survival outcomes both with and without competing risks.After converting time-to-event outcomes using these CUTs, direct application of HTE learners for continuous outcomes yields consistent estimates of heterogeneous cumulative incidence effects, total effects, and separable direct effects. Our CUTs enable application of a much larger set of state of the art HTE learners for censored outcomes than had previously been available, especially in competing risks settings. We provide generic model-free learner-specific oracle inequalities bounding the finite-sample excess risk. The oracle efficiency results depend on the oracle selector and estimated nuisance functions from all steps involved in the transformation. We demonstrate the empirical performance of the proposed methods in simulation studies.
Comparisons of frequency distributions often invoke the concept of shift to describe directional changes in properties such as the mean. In the present study, we sought to define shift as a property in and of itself. Specifically, we define distributional shift (DS) as the concentration of frequencies away from the discrete class having the greatest value (e.g., the right-most bin of a histogram). We derive a measure of DS using the normalized sum of exponentiated cumulative frequencies. We then define relative distributional shift (RDS) as the difference in DS between two distributions, revealing the magnitude and direction by which one distribution is concentrated to lesser or greater discrete classes relative to another. We find that RDS is highly related to popular measures that, while based on the comparison of frequency distributions, do not explicitly consider shift. While RDS provides a useful complement to other comparative measures, DS allows shift to be quantified as a property of individual distributions, similar in concept to a statistical moment.
Graph representation learning for hypergraphs can be used to extract patterns among higher-order interactions that are critically important in many real world problems. Current approaches designed for hypergraphs, however, are unable to handle different types of hypergraphs and are typically not generic for various learning tasks. Indeed, models that can predict variable-sized heterogeneous hyperedges have not been available. Here we develop a new self-attention based graph neural network called Hyper-SAGNN applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We perform extensive evaluations on multiple datasets, including four benchmark network datasets and two single-cell Hi-C datasets in genomics. We demonstrate that Hyper-SAGNN significantly outperforms the state-of-the-art methods on traditional tasks while also achieving great performance on a new task called outsider identification. Hyper-SAGNN will be useful for graph representation learning to uncover complex higher-order interactions in different applications.