We study the problem of causal structure learning with no assumptions on the functional relationships and noise. We develop DAG-FOCI, a computationally fast algorithm for this setting that is based on the FOCI variable selection algorithm in \cite{azadkia2019simple}. DAG-FOCI requires no tuning parameter and outputs the parents and the Markov boundary of a response variable of interest. We provide high-dimensional guarantees of our procedure when the underlying graph is a polytree. Furthermore, we demonstrate the applicability of DAG-FOCI on real data from computational biology \cite{sachs2005causal} and illustrate the robustness of our methods to violations of assumptions.
Recently, methods have been proposed that exploit the invariance of prediction models with respect to changing environments to infer subsets of the causal parents of a response variable. If the environments influence only few of the underlying mechanisms, the subset identified by invariant causal prediction, for example, may be small, or even empty. We introduce the concept of minimal invariance and propose invariant ancestry search (IAS). In its population version, IAS outputs a set which contains only ancestors of the response and is a superset of the output of ICP. When applied to data, corresponding guarantees hold asymptotically if the underlying test for invariance has asymptotic level and power. We develop scalable algorithms and perform experiments on simulated and real data.
We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider extensions to the sub-Gaussian case, and then study the estimation of the inverse correlation matrix under such models. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees.
We present a Bayesian nonparametric model for conditional distribution estimation using Bayesian additive regression trees (BART). The generative model we use is based on rejection sampling from a base model. Typical of BART models, our model is flexible, has a default prior specification, and is computationally convenient. To address the distinguished role of the response in the BART model we propose, we further introduce an approach to targeted smoothing which is possibly of independent interest for BART models. We study the proposed model theoretically and provide sufficient conditions for the posterior distribution to concentrate at close to the minimax optimal rate adaptively over smoothness classes in the high-dimensional regime in which many predictors are irrelevant. To fit our model we propose a data augmentation algorithm which allows for existing BART samplers to be extended with minimal effort. We illustrate the performance of our methodology on simulated data and use it to study the relationship between education and body mass index using data from the medical expenditure panel survey (MEPS).
Estimation of causal effects using machine learning methods has become an active research field in econometrics. In this paper, we study the finite sample performance of meta-learners for estimation of heterogeneous treatment effects under the usage of sample-splitting and cross-fitting to reduce the overfitting bias. In both synthetic and semi-synthetic simulations we find that the performance of the meta-learners in finite samples greatly depends on the estimation procedure. The results imply that sample-splitting and cross-fitting are beneficial in large samples for bias reduction and efficiency of the meta-learners, respectively, whereas full-sample estimation is preferable in small samples. Furthermore, we derive practical recommendations for application of specific meta-learners in empirical studies depending on particular data characteristics such as treatment shares and sample size.
Second-order optimization methods are among the most widely used optimization approaches for convex optimization problems, and have recently been used to optimize non-convex optimization problems such as deep learning models. The widely used second-order optimization methods such as quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton sketch-based stochastic optimization algorithm for large-scale empirical risk minimization. Specifically, we compute a partial column Hessian of size ($d\times m$) with $m\ll d$ randomly selected variables, then use the \emph{Nystr\"om method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($\Delta\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. We then integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradient methods. The results of numerical experiments on both convex and non-convex functions show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method, exhibiting performance competitive with that of state-of-the-art first-order and stochastic quasi-Newton methods. Furthermore, we provide a theoretical convergence analysis for convex functions.
Directed Acyclic Graphs (DAGs) provide a powerful framework to model causal relationships among variables in multivariate settings; in addition, through the do-calculus theory, they allow for the identification and estimation of causal effects between variables also from pure observational data. In this setting, the process of inferring the DAG structure from the data is referred to as causal structure learning or causal discovery. We introduce BCDAG, an R package for Bayesian causal discovery and causal effect estimation from Gaussian observational data, implementing the Markov chain Monte Carlo (MCMC) scheme proposed by Castelletti & Mascaro (2021). Our implementation scales efficiently with the number of observations and, whenever the DAGs are sufficiently sparse, with the number of variables in the dataset. The package also provides functions for convergence diagnostics and for visualizing and summarizing posterior inference. In this paper, we present the key features of the underlying methodology along with its implementation in BCDAG. We then illustrate the main functions and algorithms on both real and simulated datasets.
We consider the problem of discovering $K$ related Gaussian directed acyclic graphs (DAGs), where the involved graph structures share a consistent causal order and sparse unions of supports. Under the multi-task learning setting, we propose a $l_1/l_2$-regularized maximum likelihood estimator (MLE) for learning $K$ linear structural equation models. We theoretically show that the joint estimator, by leveraging data across related tasks, can achieve a better sample complexity for recovering the causal order (or topological order) than separate estimations. Moreover, the joint estimator is able to recover non-identifiable DAGs, by estimating them together with some identifiable DAGs. Lastly, our analysis also shows the consistency of union support recovery of the structures. To allow practical implementation, we design a continuous optimization problem whose optimizer is the same as the joint estimator and can be approximated efficiently by an iterative algorithm. We validate the theoretical analysis and the effectiveness of the joint estimator in experiments.
Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately, and develop methods for OOD prediction from a single training domain, which is common and challenging. The methods are based on the causal invariance principle, with a novel design for both efficient learning and easy prediction. Theoretically, we prove that under certain conditions, CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error and the success of adaptation. Empirical study shows improved OOD performance over prevailing baselines.
This paper focuses on the expected difference in borrower's repayment when there is a change in the lender's credit decisions. Classical estimators overlook the confounding effects and hence the estimation error can be magnificent. As such, we propose another approach to construct the estimators such that the error can be greatly reduced. The proposed estimators are shown to be unbiased, consistent, and robust through a combination of theoretical analysis and numerical testing. Moreover, we compare the power of estimating the causal quantities between the classical estimators and the proposed estimators. The comparison is tested across a wide range of models, including linear regression models, tree-based models, and neural network-based models, under different simulated datasets that exhibit different levels of causality, different degrees of nonlinearity, and different distributional properties. Most importantly, we apply our approaches to a large observational dataset provided by a global technology firm that operates in both the e-commerce and the lending business. We find that the relative reduction of estimation error is strikingly substantial if the causal effects are accounted for correctly.
Clustering and classification critically rely on distance metrics that provide meaningful comparisons between data points. We present mixed-integer optimization approaches to find optimal distance metrics that generalize the Mahalanobis metric extensively studied in the literature. Additionally, we generalize and improve upon leading methods by removing reliance on pre-designated "target neighbors," "triplets," and "similarity pairs." Another salient feature of our method is its ability to enable active learning by recommending precise regions to sample after an optimal metric is computed to improve classification performance. This targeted acquisition can significantly reduce computational burden by ensuring training data completeness, representativeness, and economy. We demonstrate classification and computational performance of the algorithms through several simple and intuitive examples, followed by results on real image and medical datasets.