The paper proposes a supervised machine learning algorithm to uncover treatment effect heterogeneity in classical regression discontinuity (RD) designs. Extending Athey and Imbens (2016), I develop a criterion for building an honest "regression discontinuity tree", where each leaf of the tree contains the RD estimate of a treatment (assigned by a common cutoff rule) conditional on the values of some pre-treatment covariates. It is a priori unknown which covariates are relevant for capturing treatment effect heterogeneity, and it is the task of the algorithm to discover them, without invalidating inference. I study the performance of the method through Monte Carlo simulations and apply it to the data set compiled by Pop-Eleches and Urquiola (2013) to uncover various sources of heterogeneity in the impact of attending a better secondary school in Romania.
The ability to verifiably retrieve transaction or state data stored off-chain is crucial to blockchain scaling techniques such as rollups or sharding. We formalize the problem and design a storage- and communication-efficient protocol using linear erasure-correcting codes and homomorphic vector commitments. Motivated by application requirements for rollups, our solution departs from earlier Verifiable Information Dispersal schemes in that we do not require comprehensive termination properties or retrievability from any but only from some known sufficiently large set of storage nodes. Compared to Data Availability Oracles, under no circumstance do we fall back to returning empty blocks. Distributing a file of 28.8 MB among 900 storage nodes (up to 300 of which may be adversarial) requires in total approx. 95 MB of communication and storage and approx. 30 seconds of cryptographic computation on a single-threaded consumer-grade laptop computer. Our solution requires no modification to on-chain contracts of Validium rollups such as StarkWare's StarkEx. Additionally, it provides privacy of the dispersed data against honest-but-curious storage nodes.
We propose a model-free framework for sensitivity analysis of individual treatment effects (ITEs), building upon ideas from conformal inference. For any unit, our procedure reports the $\Gamma$-value, a number which quantifies the minimum strength of confounding needed to explain away the evidence for ITE. Our approach rests on the reliable predictive inference of counterfactuals and ITEs in situations where the training data is confounded. Under the marginal sensitivity model of Tan (2006), we characterize the shift between the distribution of the observations and that of the counterfactuals. We first develop a general method for predictive inference of test samples from a shifted distribution; we then leverage this to construct covariate-dependent prediction sets for counterfactuals. No matter the value of the shift, these prediction sets (resp. approximately) achieve marginal coverage if the propensity score is known exactly (resp. estimated). We describe a distinct procedure also attaining coverage, however, conditional on the training data. In the latter case, we prove a sharpness result showing that for certain classes of prediction problems, the prediction intervals cannot possibly be tightened. We verify the validity and performance of the new methods via simulation studies and apply them to analyze real datasets.
Satisfiability is considered the canonical NP-complete problem and is used as a starting point for hardness reductions in theory, while in practice heuristic SAT solving algorithms can solve large-scale industrial SAT instances very efficiently. This disparity between theory and practice is believed to be a result of inherent properties of industrial SAT instances that make them tractable. Two characteristic properties seem to be prevalent in the majority of real-world SAT instances, heterogeneous degree distribution and locality. To understand the impact of these two properties on SAT, we study the proof complexity of random k-SAT models that allow to control heterogeneity and locality. Our findings show that heterogeneity alone does not make SAT easy as heterogeneous random k-SAT instances have superpolynomial resolution size. This implies intractability of these instances for modern SAT-solvers. On the other hand, modeling locality with an underlying geometry leads to small unsatisfiable subformulas, which can be found within polynomial time. A key ingredient for the result on geometric random k-SAT can be found in the complexity of higher-order Voronoi diagrams. As an additional technical contribution, we show a linear upper bound on the number of non-empty Voronoi regions, that holds for points with random positions in a very general setting. In particular, it covers arbitrary p-norms, higher dimensions, and weights affecting the area of influence of each point multiplicatively. This is in stark contrast to quadratic lower bounds for the worst case.
Large observational data are increasingly available in disciplines such as health, economic and social sciences, where researchers are interested in causal questions rather than prediction. In this paper, we examine the problem of estimating heterogeneous treatment effects using non-parametric regression-based methods, starting from an empirical study aimed at investigating the effect of participation in school meal programs on health indicators. Firstly, we introduce the setup and the issues related to conducting causal inference with observational or non-fully randomized data, and how these issues can be tackled with the help of statistical learning tools. Then, we review and develop a unifying taxonomy of the existing state-of-the-art frameworks that allow for individual treatment effects estimation via non-parametric regression models. After presenting a brief overview on the problem of model selection, we illustrate the performance of some of the methods on three different simulated studies. We conclude by demonstrating the use of some of the methods on an empirical analysis of the school meal program data.
Randomized controlled trials are considered the gold standard to evaluate the treatment effect (estimand) for efficacy and safety. According to the recent International Council on Harmonisation (ICH)-E9 addendum (R1), intercurrent events (ICEs) need to be considered when defining an estimand, and principal stratum is one of the five strategies to handle ICEs. Qu et al. (2020, Statistics in Biopharmaceutical Research 12:1-18) proposed estimators for the adherer average causal effect (AdACE) for estimating the treatment difference for those who adhere to one or both treatments based on the causal-inference framework, and demonstrated the consistency of those estimators; however, this method requires complex custom programming related to high-dimensional numeric integrations. In this article, we implemented the AdACE estimators using multiple imputation (MI) and constructs CI through bootstrapping. A simulation study showed that the MI-based estimators provided consistent estimators with the nominal coverage probabilities of CIs for the treatment difference for the adherent populations of interest. As an illustrative example, the new method was applied to data from a real clinical trial comparing 2 types of basal insulin for patients with type 1 diabetes.
A space-time Trefftz discontinuous Galerkin method for the Schr\"odinger equation with piecewise-constant potential is proposed and analyzed. Following the spirit of Trefftz methods, trial and test spaces are spanned by non-polynomial complex wave functions that satisfy the Schro\"odinger equation locally on each element of the space-time mesh. This allows for a significant reduction in the number of degrees of freedom in comparison with full polynomial spaces. We prove well-posedness and stability of the method, and, for the one- and two- dimensional cases, optimal, high-order, h-convergence error estimates in a skeleton norm. Some numerical experiments validate the theoretical results presented.
The classical multilevel model fails to capture the proximity effect in epidemiological studies, where subjects are nested within geographical units. Multilevel Conditional Autoregressive models are alternatives to help explain the spatial effect better. They have been developed for cross-sectional studies but not for longitudinal studies so far. This paper has two goals. Firstly, it further develops the multilevel (growth) models for longitudinal data by adding existing area level random effect terms with CAR prior specification, whose structure is changing over time. We name these models MLM tCARs for longitudinal data. We compare the developed MLM tCARs to the classical multilevel growth model via simulation studies in common spatial data situations. The results indicate the better performance of the MLM tCARs, to retrieve the true regression coefficients and with better fit in general. Secondly, this paper provides a comprehensive decision tree for analysing data in epidemiological studies with spatially nested structure: we also consider the Multilevel Conditional Autoregressive models for cross-sectional studies (MLM CARs). We compare three models (for cross-sectional studies) via simulation studies: the classical multilevel model, the multilevel CAR model and the Restricted CAR model that accounts for spatial confounding. The MLM CARs, particularly the Restricted CAR show better results. We apply the models comparatively on the analysis of the association between greenness and depressive symptoms in the longitudinal Heinz Nixdorf Recall Study. The results show negative association between greenness and depression and a decreasing linear individual time trend for all models. We observe very weak spatial variation and moderate temporal autocorrelation.
Heatmap-based methods dominate in the field of human pose estimation by modelling the output distribution through likelihood heatmaps. In contrast, regression-based methods are more efficient but suffer from inferior performance. In this work, we explore maximum likelihood estimation (MLE) to develop an efficient and effective regression-based methods. From the perspective of MLE, adopting different regression losses is making different assumptions about the output density function. A density function closer to the true distribution leads to a better regression performance. In light of this, we propose a novel regression paradigm with Residual Log-likelihood Estimation (RLE) to capture the underlying output distribution. Concretely, RLE learns the change of the distribution instead of the unreferenced underlying distribution to facilitate the training process. With the proposed reparameterization design, our method is compatible with off-the-shelf flow models. The proposed method is effective, efficient and flexible. We show its potential in various human pose estimation tasks with comprehensive experiments. Compared to the conventional regression paradigm, regression with RLE bring 12.4 mAP improvement on MSCOCO without any test-time overhead. Moreover, for the first time, especially on multi-person pose estimation, our regression method is superior to the heatmap-based methods. Our code is available at //github.com/Jeff-sjtu/res-loglikelihood-regression
The availability of large microarray data has led to a growing interest in biclustering methods in the past decade. Several algorithms have been proposed to identify subsets of genes and conditions according to different similarity measures and under varying constraints. In this paper we focus on the exclusive row biclustering problem for gene expression data sets, in which each row can only be a member of a single bicluster while columns can participate in multiple ones. This type of biclustering may be adequate, for example, for clustering groups of cancer patients where each patient (row) is expected to be carrying only a single type of cancer, while each cancer type is associated with multiple (and possibly overlapping) genes (columns). We present a novel method to identify these exclusive row biclusters through a combination of existing biclustering algorithms and combinatorial auction techniques. We devise an approach for tuning the threshold for our algorithm based on comparison to a null model in the spirit of the Gap statistic approach. We demonstrate our approach on both synthetic and real-world gene expression data and show its power in identifying large span non-overlapping rows sub matrices, while considering their unique nature. The Gap statistic approach succeeds in identifying appropriate thresholds in all our examples.
Generative models (GMs) such as Generative Adversary Network (GAN) and Variational Auto-Encoder (VAE) have thrived these years and achieved high quality results in generating new samples. Especially in Computer Vision, GMs have been used in image inpainting, denoising and completion, which can be treated as the inference from observed pixels to corrupted pixels. However, images are hierarchically structured which are quite different from many real-world inference scenarios with non-hierarchical features. These inference scenarios contain heterogeneous stochastic variables and irregular mutual dependences. Traditionally they are modeled by Bayesian Network (BN). However, the learning and inference of BN model are NP-hard thus the number of stochastic variables in BN is highly constrained. In this paper, we adapt typical GMs to enable heterogeneous learning and inference in polynomial time.We also propose an extended autoregressive (EAR) model and an EAR with adversary loss (EARA) model and give theoretical results on their effectiveness. Experiments on several BN datasets show that our proposed EAR model achieves the best performance in most cases compared to other GMs. Except for black box analysis, we've also done a serial of experiments on Markov border inference of GMs for white box analysis and give theoretical results.