In cluster-randomized trials, missing data can occur in various ways, including missing values in outcomes and baseline covariates at the individual or cluster level, or completely missing information for non-participants. Among the various types of missing data in CRTs, missing outcomes have attracted the most attention. However, no existing method comprehensively addresses all the aforementioned types of missing data simultaneously due to their complexity. This gap in methodology may lead to confusion and potential pitfalls in the analysis of CRTs. In this article, we propose a doubly-robust estimator for a variety of estimands that simultaneously handles missing outcomes under a missing-at-random assumption, missing covariates with the missing-indicator method (with no constraint on missing covariate distributions), and missing cluster-population sizes via a uniform sampling framework. Furthermore, we provide three approaches to improve precision by choosing the optimal weights for intracluster correlation, leveraging machine learning, and modeling the propensity score for treatment assignment. To evaluate the impact of violated missing data assumptions, we additionally propose a sensitivity analysis that measures when missing data alter the conclusion of treatment effect estimation. Simulation studies and data applications both show that our proposed method is valid and superior to the existing methods.
In many temporally ordered data sets, it is observed that the parameters of the underlying distribution change abruptly at unknown times. The detection of such changepoints is important for many applications. While this problem has been studied substantially in the linear data setup, not much work has been done for angular data. In this article, we utilize the intrinsic geometry of a torus to introduce the notion of the `square of an angle' and use it to propose a new measure of variation, called the `curved variance', of an angular random variable. Using the above ideas, we propose new tests for the existence of changepoint(s) in the concentration, mean direction, and/or both of these. The limiting distributions of the test statistics are derived and their powers are obtained using extensive simulation. It is seen that the tests have better power than the corresponding existing tests. The proposed methods have been implemented on three real-life data sets revealing interesting insights. In particular, our method when used to detect simultaneous changes in mean direction and concentration for hourly wind direction measurements of the cyclonic storm `Amphan' identified changepoints that could be associated with important meteorological events.
Weak supervision searches have in principle the advantages of both being able to train on experimental data and being able to learn distinctive signal properties. However, the practical applicability of such searches is limited by the fact that successfully training a neural network via weak supervision can require a large amount of signal. In this work, we seek to create neural networks that can learn from less experimental signal by using transfer and meta-learning. The general idea is to first train a neural network on simulations, thereby learning concepts that can be reused or becoming a more efficient learner. The neural network would then be trained on experimental data and should require less signal because of its previous training. We find that transfer and meta-learning can substantially improve the performance of weak supervision searches.
The comparison of frequency distributions is a common statistical task with broad applications. However, existing measures do not explicitly quantify the magnitude and direction by which one distribution is shifted relative to another. In the present study, we define distributional shift (DS) as the concentration of frequencies towards the lowest discrete class, e.g., the left-most bin of a histogram. We measure DS via the sum of cumulative frequencies and define relative distributional shift (RDS) as the difference in DS between distributions. Using simulated random sampling, we show that RDS is highly related to measures that are widely used to compare frequency distributions. Focusing on specific applications, we show that DS and RDS provide insights into healthcare billing distributions, ecological species-abundance distributions, and economic distributions of wealth. RDS has the unique advantage of being a signed (i.e., directional) measure based on a simple difference in an intuitive property that, in turn, serves as a measure of rarity, poverty, and scarcity.
Safe and reliable disclosure of information from confidential data is a challenging statistical problem. A common approach considers the generation of synthetic data, to be disclosed instead of the original data. Efficient approaches ought to deal with the trade-off between reliability and confidentiality of the released data. Ultimately, the aim is to be able to reproduce as accurately as possible statistical analysis of the original data using the synthetic one. Bayesian networks is a model-based approach that can be used to parsimoniously estimate the underlying distribution of the original data and generate synthetic datasets. These ought to not only approximate the results of analyses with the original data but also robustly quantify the uncertainty involved in the approximation. This paper proposes a fully Bayesian approach to generate and analyze synthetic data based on the posterior predictive distribution of statistics of the synthetic data, allowing for efficient uncertainty quantification. The methodology makes use of probability properties of the model to devise a computationally efficient algorithm to obtain the target predictive distributions via Monte Carlo. Model parsimony is handled by proposing a general class of penalizing priors for Bayesian network models. Finally, the efficiency and applicability of the proposed methodology is empirically investigated through simulated and real examples.
The deconfounder was proposed as a method for estimating causal parameters in a context with multiple causes and unobserved confounding. It is based on recovery of a latent variable from the observed causes. We disentangle the causal interpretation from the statistical estimation problem and show that the deconfounder in general estimates adjusted regression target parameters. It does so by outcome regression adjusted for the recovered latent variable termed the substitute. We refer to the general algorithm, stripped of causal assumptions, as substitute adjustment. We give theoretical results to support that substitute adjustment estimates adjusted regression parameters when the regressors are conditionally independent given the latent variable. We also introduce a variant of our substitute adjustment algorithm that estimates an assumption-lean target parameter with minimal model assumptions. We then give finite sample bounds and asymptotic results supporting substitute adjustment estimation in the case where the latent variable takes values in a finite set. A simulation study illustrates finite sample properties of substitute adjustment. Our results support that when the latent variable model of the regressors hold, substitute adjustment is a viable method for adjusted regression.
This paper presents a method for thematic agreement assessment of geospatial data products of different semantics and spatial granularities, which may be affected by spatial offsets between test and reference data. The proposed method uses a multi-scale framework allowing for a probabilistic evaluation whether thematic disagreement between datasets is induced by spatial offsets due to different nature of the datasets or not. We test our method using real-estate derived settlement locations and remote-sensing derived building footprint data.
Selecting an evaluation metric is fundamental to model development, but uncertainty remains about when certain metrics are preferable and why. This paper introduces the concept of resolving power to describe the ability of an evaluation metric to distinguish between binary classifiers of similar quality. This ability depends on two attributes: 1. The metric's response to improvements in classifier quality (its signal), and 2. The metric's sampling variability (its noise). The paper defines resolving power generically as a metric's sampling uncertainty scaled by its signal. The primary application of resolving power is to assess threshold-free evaluation metrics, such as the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). A simulation study compares the AUROC and the AUPRC in a variety of contexts. It finds that the AUROC generally has greater resolving power, but that the AUPRC is better when searching among high-quality classifiers applied to low prevalence outcomes. The paper concludes by proposing an empirical method to estimate resolving power that can be applied to any dataset and any initial classification model.
LLMs have become increasingly capable at accomplishing a range of specialized-tasks and can be utilized to expand equitable access to medical knowledge. Most medical LLMs have involved extensive fine-tuning, leveraging specialized medical data and significant, thus costly, amounts of computational power. Many of the top performing LLMs are proprietary and their access is limited to very few research groups. However, open-source (OS) models represent a key area of growth for medical LLMs due to significant improvements in performance and an inherent ability to provide the transparency and compliance required in healthcare. We present OpenMedLM, a prompting platform which delivers state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks. We evaluated a range of OS foundation LLMs (7B-70B) on four medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU medical-subset). We employed a series of prompting strategies, including zero-shot, few-shot, chain-of-thought (random selection and kNN selection), and ensemble/self-consistency voting. We found that OpenMedLM delivers OS SOTA results on three common medical LLM benchmarks, surpassing the previous best performing OS models that leveraged computationally costly extensive fine-tuning. The model delivers a 72.6% accuracy on the MedQA benchmark, outperforming the previous SOTA by 2.4%, and achieves 81.7% accuracy on the MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark. Our results highlight medical-specific emergent properties in OS LLMs which have not yet been documented to date elsewhere, and showcase the benefits of further leveraging prompt engineering to improve the performance of accessible LLMs for medical applications.
In large-scale, data-driven applications, parameters are often only known approximately due to noise and limited data samples. In this paper, we focus on high-dimensional optimization problems with linear constraints under uncertain conditions. To find high quality solutions for which the violation of the true constraints is limited, we develop a linear shrinkage method that blends random matrix theory and robust optimization principles. It aims to minimize the Frobenius distance between the estimated and the true parameter matrix, especially when dealing with a large and comparable number of constraints and variables. This data-driven method excels in simulations, showing superior noise resilience and more stable performance in both obtaining high quality solutions and adhering to the true constraints compared to traditional robust optimization. Our findings highlight the effectiveness of our method in improving the robustness and reliability of optimization in high-dimensional, data-driven scenarios.
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern 1) a taxonomy and extensive overview of the state-of-the-art, 2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner, 3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time, and storage.