A recent cluster trial in Bangladesh randomized 600 villages into 300 treatment/control pairs, to evaluate the impact of an intervention to increase mask-wearing. Data was analyzed in a generalized linear model and significance asserted with parametric tests for the rate of the primary outcome (symptomatic and seropositive for COVID-19) between treatment and control villages. In this short note we re-analyze the data from this trial using standard non-parametric paired statistics tests on treatment/village pairs. With this approach, we find that behavioral outcomes like physical distancing are highly significant, while the primary outcome of the study is not. Importantly, we find that the behavior of unblinded staff when enrolling study participants is one of the most highly significant differences between treatment and control groups, contributing to a significant imbalance in denominators between treatment and control groups. The potential bias leading to this imbalance suggests caution is warranted when evaluating rates rather than counts. More broadly, the significant impacts on staff and participant behavior urge caution in interpreting small differences in the study outcomes that depended on survey response.
One of the most studied models of SAT is random SAT. In this model, instances are composed from clauses chosen uniformly randomly and independently of each other. This model may be unsatisfactory in that it fails to describe various features of SAT instances, arising in real-world applications. Various modifications have been suggested to define models of industrial SAT. Here, we focus mainly on the aspect of community structure. Namely, here the set of variables consists of a number of disjoint communities, and clauses tend to consist of variables from the same community. Thus, we suggest a model of random industrial SAT, in which the central generalization with respect to random SAT is the additional community structure. There has been a lot of work on the satisfiability threshold of random $k$-SAT, starting with the calculation of the threshold of $2$-SAT, up to the recent result that the threshold exists for sufficiently large $k$. In this paper, we endeavor to study the satisfiability threshold for the proposed model of random industrial SAT. Our main result is that the threshold in this model tends to be smaller than its counterpart for random SAT. Moreover, under some conditions, this threshold even vanishes.
Entropic regularization is a method for large-scale linear programming. Geometrically, one traces intersections of the feasible polytope with scaled toric varieties, starting at the Birch point. We compare this to log-barrier methods, with reciprocal linear spaces, starting at the analytic center. We revisit entropic regularization for unbalanced optimal transport, and we develop the use of optimal conic couplings. We compute the degree of the associated toric variety, and we explore algorithms like iterative scaling.
Classical two-sample permutation tests for equality of distributions have exact size in finite samples, but they fail to control size for testing equality of parameters that summarize each distribution. This paper proposes permutation tests for equality of parameters that are estimated at root-$n$ or slower rates. Our general framework applies to both parametric and nonparametric models, with two samples or one sample split into two subsamples. Our tests have correct size asymptotically while preserving exact size in finite samples when distributions are equal. They have no loss in local-asymptotic power compared to tests that use asymptotic critical values. We propose confidence sets with correct coverage in large samples that also have exact coverage in finite samples if distributions are equal up to a transformation. We apply our theory to four commonly-used hypothesis tests of nonparametric functions evaluated at a point. Lastly, simulations show good finite sample properties, and two empirical examples illustrate our tests in practice.
Recently, methods have been proposed that exploit the invariance of prediction models with respect to changing environments to infer subsets of the causal parents of a response variable. If the environments influence only few of the underlying mechanisms, the subset identified by invariant causal prediction, for example, may be small, or even empty. We introduce the concept of minimal invariance and propose invariant ancestry search (IAS). In its population version, IAS outputs a set which contains only ancestors of the response and is a superset of the output of ICP. When applied to data, corresponding guarantees hold asymptotically if the underlying test for invariance has asymptotic level and power. We develop scalable algorithms and perform experiments on simulated and real data.
Generalizability and transportability methods have been proposed to address the external validity bias of randomized clinical trials that results from differences in the distribution of treatment effect modifiers between trial and target populations. However, such studies present many challenges. We review and summarize state-of-the-art methodological considerations. We additionally provide investigators with a step-by-step guide to address these challenges, illustrated through a published case study. When conducted with rigor, such studies may play an integral role in regulatory decisions by providing key real-world evidence.
The focus of disentanglement approaches has been on identifying independent factors of variation in data. However, the causal variables underlying real-world observations are often not statistically independent. In this work, we bridge the gap to real-world scenarios by analyzing the behavior of the most prominent disentanglement approaches on correlated data in a large-scale empirical study (including 4260 models). We show and quantify that systematically induced correlations in the dataset are being learned and reflected in the latent representations, which has implications for downstream applications of disentanglement such as fairness. We also demonstrate how to resolve these latent correlations, either using weak supervision during training or by post-hoc correcting a pre-trained model with a small number of labels.
Contrastive learning (CL) is a popular technique for self-supervised learning (SSL) of visual representations. It uses pairs of augmentations of unlabeled training examples to define a classification task for pretext learning of a deep embedding. Despite extensive works in augmentation procedures, prior works do not address the selection of challenging negative pairs, as images within a sampled batch are treated independently. This paper addresses the problem, by introducing a new family of adversarial examples for constrastive learning and using these examples to define a new adversarial training algorithm for SSL, denoted as CLAE. When compared to standard CL, the use of adversarial examples creates more challenging positive pairs and adversarial training produces harder negative pairs by accounting for all images in a batch during the optimization. CLAE is compatible with many CL methods in the literature. Experiments show that it improves the performance of several existing CL baselines on multiple datasets.
We consider the question: how can you sample good negative examples for contrastive learning? We argue that, as with metric learning, learning contrastive representations benefits from hard negative samples (i.e., points that are difficult to distinguish from an anchor point). The key challenge toward using hard negatives is that contrastive methods must remain unsupervised, making it infeasible to adopt existing negative sampling strategies that use label information. In response, we develop a new class of unsupervised methods for selecting hard negative samples where the user can control the amount of hardness. A limiting case of this sampling results in a representation that tightly clusters each class, and pushes different classes as far apart as possible. The proposed method improves downstream performance across multiple modalities, requires only few additional lines of code to implement, and introduces no computational overhead.
Recently introduced generative adversarial network (GAN) has been shown numerous promising results to generate realistic samples. The essential task of GAN is to control the features of samples generated from a random distribution. While the current GAN structures, such as conditional GAN, successfully generate samples with desired major features, they often fail to produce detailed features that bring specific differences among samples. To overcome this limitation, here we propose a controllable GAN (ControlGAN) structure. By separating a feature classifier from a discriminator, the generator of ControlGAN is designed to learn generating synthetic samples with the specific detailed features. Evaluated with multiple image datasets, ControlGAN shows a power to generate improved samples with well-controlled features. Furthermore, we demonstrate that ControlGAN can generate intermediate features and opposite features for interpolated and extrapolated input labels that are not used in the training process. It implies that ControlGAN can significantly contribute to the variety of generated samples.
Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen "robustly": commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.