Randomized controlled trials (RCTs) are increasingly prevalent in education research, and are often regarded as a gold standard of causal inference. Two main virtues of randomized experiments are that they (1) do not suffer from confounding, thereby allowing for an unbiased estimate of an intervention's causal impact, and (2) allow for design-based inference, meaning that the physical act of randomization largely justifies the statistical assumptions made. However, RCT sample sizes are often small, leading to low precision; in many cases RCT estimates may be too imprecise to guide policy or inform science. Observational studies, by contrast, have strengths and weaknesses complementary to those of RCTs. Observational studies typically offer much larger sample sizes, but may suffer confounding. In many contexts, experimental and observational data exist side by side, allowing the possibility of integrating "big observational data" with "small but high-quality experimental data" to get the best of both. Such approaches hold particular promise in the field of education, where RCT sample sizes are often small due to cost constraints, but automatic collection of observational data, such as in computerized educational technology applications, or in state longitudinal data systems (SLDS) with administrative data on hundreds of thousand of students, has made rich, high-dimensional observational data widely available. We outline an approach that allows one to employ machine learning algorithms to learn from the observational data, and use the resulting models to improve precision in randomized experiments. Importantly, there is no requirement that the machine learning models are "correct" in any sense, and the final experimental results are guaranteed to be exactly unbiased. Thus, there is no danger of confounding biases in the observational data leaking into the experiment.
The ICESat-2, launched in 2018, carries the ATLAS instrument, which is a photon-counting spaceborne lidar that provides strip samples over the terrain. While primarily designed for snow and ice monitoring, there has been a great interest in using ICESat-2 to predict forest above-ground biomass density (AGBD). As ICESat-2 is on a polar orbit, it provides good spatial coverage of boreal forests. The aim of this study is to evaluate the estimation of mean AGBD from ICESat-2 data using a hierarchical modeling approach combined with rigorous statistical inference. We propose a hierarchical hybrid inference approach for uncertainty quantification of the AGBD estimated from ICESat-2 lidar strips. Our approach models the errors coming from the multiple modeling steps, including the allometric models used for predicting tree-level AGB. For testing the procedure, we have data from two adjacent study sites, denoted Valtimo and Nurmes, of which Valtimo site is used for model training and Nurmes for validation. The ICESat-2 estimated mean AGBD in the Nurmes validation area was 63.2$\pm$1.9 Mg/ha (relative standard error of 2.9%). The local reference hierarchical model-based estimate obtained from wall-to-wall airborne lidar data was 63.9$\pm$0.6 Mg/ha (relative standard error of 1.0%). The reference estimate was within the 95% confidence interval of the ICESat-2 hierarchical hybrid estimate. The small standard errors indicate that the proposed method is useful for AGBD assessment. However, some sources of error were not accounted for in the study and thus the real uncertainties are probably slightly larger than those reported.
In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies. In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.
Randomized experiments (REs) are the cornerstone for treatment effect evaluation. However, due to practical considerations, REs may encounter difficulty recruiting sufficient patients. External controls (ECs) can supplement REs to boost estimation efficiency. Yet, there may be incomparability between ECs and concurrent controls (CCs), resulting in misleading treatment effect evaluation. We introduce a novel bias function to measure the difference in the outcome mean functions between ECs and CCs. We show that the ANCOVA model augmented by the bias function for ECs renders a consistent estimator of the average treatment effect, regardless of whether or not the ANCOVA model is correct. To accommodate possibly different structures of the ANCOVA model and the bias function, we propose a double penalty integration estimator (DPIE) with different penalization terms for the two functions. With an appropriate choice of penalty parameters, our DPIE ensures consistency, oracle property, and asymptotic normality even in the presence of model misspecification. DPIE is more efficient than the estimator derived from REs alone, validated through theoretical and experimental results.
In many applications in biology, engineering and economics, identifying similarities and differences between distributions of data from complex processes requires comparing finite categorical samples of discrete counts. Statistical divergences quantify the difference between two distributions. However, their estimation is very difficult and empirical methods often fail, especially when the samples are small. We develop a Bayesian estimator of the Kullback-Leibler divergence between two probability distributions that makes use of a mixture of Dirichlet priors on the distributions being compared. We study the properties of the estimator on two examples: probabilities drawn from Dirichlet distributions, and random strings of letters drawn from Markov chains. We extend the approach to the squared Hellinger divergence. Both estimators outperform other estimation techniques, with better results for data with a large number of categories and for higher values of divergences.
While federated learning (FL) promises to preserve privacy, recent works in the image and text domains have shown that training updates leak private client data. However, most high-stakes applications of FL (e.g., in healthcare and finance) use tabular data, where the risk of data leakage has not yet been explored. A successful attack for tabular data must address two key challenges unique to the domain: (i) obtaining a solution to a high-variance mixed discrete-continuous optimization problem, and (ii) enabling human assessment of the reconstruction as unlike for image and text data, direct human inspection is not possible. In this work we address these challenges and propose TabLeak, the first comprehensive reconstruction attack on tabular data. TabLeak is based on two key contributions: (i) a method which leverages a softmax relaxation and pooled ensembling to solve the optimization problem, and (ii) an entropy-based uncertainty quantification scheme to enable human assessment. We evaluate TabLeak on four tabular datasets for both FedSGD and FedAvg training protocols, and show that it successfully breaks several settings previously deemed safe. For instance, we extract large subsets of private data at >90% accuracy even at the large batch size of 128. Our findings demonstrate that current high-stakes tabular FL is excessively vulnerable to leakage attacks.
A methodology for high dimensional causal inference in a time series context is introduced. It is assumed that there is a monotonic transformation of the data such that the dynamics of the transformed variables are described by a Gaussian vector autoregressive process. This is tantamount to assume that the dynamics are captured by a Gaussian copula. No knowledge or estimation of the marginal distribution of the data is required. The procedure consistently identifies the parameters that describe the dynamics of the process and the conditional causal relations among the possibly high dimensional variables under sparsity conditions. The methodology allows us to identify such causal relations in the form of a directed acyclic graph. As illustrative applications we consider the impact of supply side oil shocks on the economy, and the causal relations between aggregated variables constructed from the limit order book on four stock constituents of the S&P500.
Manual assembly workers face increasing complexity in their work. Human-centered assistance systems could help, but object recognition as an enabling technology hinders sophisticated human-centered design of these systems. At the same time, activity recognition based on hand poses suffers from poor pose estimation in complex usage scenarios, such as wearing gloves. This paper presents a self-supervised pipeline for adapting hand pose estimation to specific use cases with minimal human interaction. This enables cheap and robust hand posebased activity recognition. The pipeline consists of a general machine learning model for hand pose estimation trained on a generalized dataset, spatial and temporal filtering to account for anatomical constraints of the hand, and a retraining step to improve the model. Different parameter combinations are evaluated on a publicly available and annotated dataset. The best parameter and model combination is then applied to unlabelled videos from a manual assembly scenario. The effectiveness of the pipeline is demonstrated by training an activity recognition as a downstream task in the manual assembly scenario.
Forward simulation-based uncertainty quantification that studies the distribution of quantities of interest (QoI) is a crucial component for computationally robust engineering design and prediction. There is a large body of literature devoted to accurately assessing statistics of QoIs, and in particular, multilevel or multifidelity approaches are known to be effective, leveraging cost-accuracy tradeoffs between a given ensemble of models. However, effective algorithms that can estimate the full distribution of QoIs are still under active development. In this paper, we introduce a general multifidelity framework for estimating the cumulative distribution function (CDF) of a vector-valued QoI associated with a high-fidelity model under a budget constraint. Given a family of appropriate control variates obtained from lower-fidelity surrogates, our framework involves identifying the most cost-effective model subset and then using it to build an approximate control variates estimator for the target CDF. We instantiate the framework by constructing a family of control variates using intermediate linear approximators and rigorously analyze the corresponding algorithm. Our analysis reveals that the resulting CDF estimator is uniformly consistent and asymptotically optimal as the budget tends to infinity, with only mild moment and regularity assumptions on the joint distribution of QoIs. The approach provides a robust multifidelity CDF estimator that is adaptive to the available budget, does not require \textit{a priori} knowledge of cross-model statistics or model hierarchy, and applies to multiple dimensions. We demonstrate the efficiency and robustness of the approach using test examples of parametric PDEs and stochastic differential equations including both academic instances and more challenging engineering problems.
Many food products involve mixtures of ingredients, where the mixtures can be expressed as combinations of ingredient proportions. In many cases, the quality and the consumer preference may also depend on the way in which the mixtures are processed. The processing is generally defined by the settings of one or more process variables. Experimental designs studying the joint impact of the mixture ingredient proportions and the settings of the process variables are called mixture-process variable experiments. In this article, we show how to combine mixture-process variable experiments and discrete choice experiments, to quantify and model consumer preferences for food products that can be viewed as processed mixtures. First, we describe the modeling of data from such combined experiments. Next, we describe how to generate D- and I-optimal designs for choice experiments involving mixtures and process variables, and we compare the two kinds of designs using two examples.
Analyzing observational data from multiple sources can be useful for increasing statistical power to detect a treatment effect; however, practical constraints such as privacy considerations may restrict individual-level information sharing across data sets. This paper develops federated methods that only utilize summary-level information from heterogeneous data sets. Our federated methods provide doubly-robust point estimates of treatment effects as well as variance estimates. We derive the asymptotic distributions of our federated estimators, which are shown to be asymptotically equivalent to the corresponding estimators from the combined, individual-level data. We show that to achieve these properties, federated methods should be adjusted based on conditions such as whether models are correctly specified and stable across heterogeneous data sets.