Treatment effect estimation from observational data is a central problem in causal inference. Methods based on potential outcomes framework solve this problem by exploiting inductive biases and heuristics from causal inference. Each existing technique addresses a specific aspect of treatment effect estimation, such as controlling propensity score, enforcing randomization, etc., by designing neural network architectures and regularizers. In this paper, we propose an adaptive method called Neurosymbolic Treatment Effect Estimator (NESTER), a generalized method for treatment effect estimation. NESTER brings together all the desiderata for treatment effect estimation into one framework. For this purpose, we design a Domain Specific Language (DSL) for the treatment effect estimation based on inductive biases used in literature. We also theoretically study NESTER's capability for the treatment effect estimation task. Our comprehensive empirical results show that NESTER performs better on benchmark datasets than state-of-the-art methods without compromising run time requirements.
CLIP, the first foundation model that connects images and text, has enabled many recent breakthroughs in computer vision. However, its associated training cost is prohibitively high, imposing a significant barrier to its widespread exploration. In this paper, we present a surprising finding that there exists an inverse scaling law for CLIP training, whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. Moreover, we showcase that the strategy for reducing image/text token length plays a crucial role in determining the quality of this scaling law. As a result of this finding, we are able to successfully train CLIP even by using academic resources. For example, on an A100 eight-GPU server, our CLIP models achieve zero-shot top-1 ImageNet accuracies of 63.2% in ~2 days, 67.8% in ~3 days, and 69.3% in ~4 days. By reducing the computation barrier associated with CLIP, we hope to inspire more research in this field, particularly from academics. Our code is available at //github.com/UCSC-VLAA/CLIPA.
Treatment effect estimation is of high-importance for both researchers and practitioners across many scientific and industrial domains. The abundance of observational data makes them increasingly used by researchers for the estimation of causal effects. However, these data suffer from biases, from several weaknesses, leading to inaccurate causal effect estimations, if not handled properly. Therefore, several machine learning techniques have been proposed, most of them focusing on leveraging the predictive power of neural network models to attain more precise estimation of causal effects. In this work, we propose a new methodology, named Nearest Neighboring Information for Causal Inference (NNCI), for integrating valuable nearest neighboring information on neural network-based models for estimating treatment effects. The proposed NNCI methodology is applied to some of the most well established neural network-based models for treatment effect estimation with the use of observational data. Numerical experiments and analysis provide empirical and statistical evidence that the integration of NNCI with state-of-the-art neural network models leads to considerably improved treatment effect estimations on a variety of well-known challenging benchmarks.
In this paper, we propose two new algorithms for maximum-likelihood estimation (MLE) of high dimensional sparse covariance matrices. Unlike most of the state of-the-art methods, which either use regularization techniques or penalize the likelihood to impose sparsity, we solve the MLE problem based on an estimated covariance graph. More specifically, we propose a two-stage procedure: in the first stage, we determine the sparsity pattern of the target covariance matrix (in other words the marginal independence in the covariance graph under a Gaussian graphical model) using the multiple hypothesis testing method of false discovery rate (FDR), and in the second stage we use either a block coordinate descent approach to estimate the non-zero values or a proximal distance approach that penalizes the distance between the estimated covariance graph and the target covariance matrix. Doing so gives rise to two different methods, each with its own advantage: the coordinate descent approach does not require tuning of any hyper-parameters, whereas the proximal distance approach is computationally fast but requires a careful tuning of the penalty parameter. Both methods are effective even in cases where the number of observed samples is less than the dimension of the data. For performance evaluation, we test the proposed methods on both simulated and real-world data and show that they provide more accurate estimates of the sparse covariance matrix than two state-of-the-art methods.
In many fields of biomedical sciences, it is common that random variables are measured repeatedly across different subjects. In such a repeated measurement setting, dependence structures among random variables that are between subjects and within a subject may differ and should be estimated differently. Ignoring this fact may lead to questionable or even erroneous scientific conclusions. In this paper, we study the problem of sparse and positive-definite estimation of between-subject and within-subject covariance matrices for high-dimensional repeated measurements. Our estimators are defined as solutions to convex optimization problems that can be solved efficiently. We establish estimation error rates for our proposed estimators of the two target matrices, and demonstrate their favorable performance through theoretical analysis and comprehensive simulation studies. We further apply our methods to recover two covariance graphs of clinical variables from hemodialysis patients.
A contiguous area cartogram is a geographic map in which the area of each region is proportional to numerical data (e.g., population size) while keeping neighboring regions connected. In this study, we investigated whether value-to-area legends (square symbols next to the values represented by the squares' areas) and grid lines aid map readers in making better area judgments. We conducted an experiment to determine the accuracy, speed, and confidence with which readers infer numerical data values for the mapped regions. We found that, when only informed about the total numerical value represented by the whole cartogram without any legend, the distribution of estimates for individual regions was centered near the true value with substantial spread. Legends with grid lines significantly reduced the spread but led to a tendency to underestimate the values. Comparing differences between regions or between cartograms revealed that legends and grid lines slowed the estimation without improving accuracy. However, participants were more likely to complete the tasks when legends and grid lines were present, particularly when the area units represented by these features could be interactively selected. We recommend considering the cartogram's use case and purpose before deciding whether to include grid lines or an interactive legend.
Gun violence is a major problem in contemporary American society. However, relatively little is known about the effects of firearm injuries on survivors and their family members and how these effects vary across subpopulations. To study these questions and, more generally, to address a gap in the causal inference literature, we present a framework for the study of effect modification or heterogeneous treatment effects in difference-in-differences designs. We implement a new matching technique, which combines profile matching and risk set matching, to (i) preserve the time alignment of covariates, exposure, and outcomes, avoiding pitfalls of other common approaches for difference-in-differences, and (ii) explicitly control biases due to imbalances in observed covariates in subgroups discovered from the data. Our case study shows significant and persistent effects of nonfatal firearm injuries on several health outcomes for those injured and on the mental health of their family members. Sensitivity analyses reveal that these results are moderately robust to unmeasured confounding bias. Finally, while the effects for those injured are modified largely by the severity of the injury and its documented intent, for families, effects are strongest for those whose relative's injury is documented as resulting from an assault, self-harm, or law enforcement intervention.
Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R- and DR-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.
The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the "VBE" step in a stochastic variational Bayesian expectation maximization (VBEM) procedure that we use for approxiamte Bayesian inference on the entire model. We use the method on a sample of WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional posthoc method.
We constructed a modular, biomimetic red panda paw with which to experimentally investigate the evolutionary reason for the existence of the false thumbs of red pandas. These thumbs were once believed to have shared a common origin with the similar false thumbs of giant pandas; however, the discovery of a carnivorous fossil ancestor of the red panda that had false thumbs implies that the red panda did not evolve its thumbs to assist in eating bamboo, as the giant panda did, but rather evolved its thumbs for some other purpose. The leading proposal for this purpose is that the thumbs developed to aid arboreal locomotion. To test this hypothesis, we conducted grasp tests on rods 5-15 mm in diameter using a biomimetic paw with 0-16 mm interchangeable thumb lengths. The results of these tests demonstrated an optimal thumb length of 7 mm, which is just above that of the red panda's true thumb length of 5.5 mm. Given trends in the data that suggest that smaller thumbs are better suited to grasping larger diameter rods, we conclude that the red panda's thumb being sized below the optimum length suggests an adaptation toward grasping branches as opposed to relatively thinner food items, supporting the new proposal that the red panda's thumbs are an adaptation primary to climbing rather than food manipulation.
We present self-supervised geometric perception (SGP), the first general framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels (e.g., camera poses, rigid transformations). Our first contribution is to formulate geometric perception as an optimization problem that jointly optimizes the feature descriptor and the geometric models given a large corpus of visual measurements (e.g., images, point clouds). Under this optimization formulation, we show that two important streams of research in vision, namely robust model fitting and deep feature learning, correspond to optimizing one block of the unknown variables while fixing the other block. This analysis naturally leads to our second contribution -- the SGP algorithm that performs alternating minimization to solve the joint optimization. SGP iteratively executes two meta-algorithms: a teacher that performs robust model fitting given learned features to generate geometric pseudo-labels, and a student that performs deep feature learning under noisy supervision of the pseudo-labels. As a third contribution, we apply SGP to two perception problems on large-scale real datasets, namely relative camera pose estimation on MegaDepth and point cloud registration on 3DMatch. We demonstrate that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.