To further develop the statistical inference problem for heterogeneous treatment effects, this paper builds on Breiman's (2001) random forest tree (RFT)and Wager et al.'s (2018) causal tree to parameterize the nonparametric problem using the excellent statistical properties of classical OLS and the division of local linear intervals based on covariate quantile points, while preserving the random forest trees with the advantages of constructible confidence intervals and asymptotic normality properties [Athey and Imbens (2016),Efron (2014),Wager et al.(2014)\citep{wager2014asymptotic}], we propose a decision tree using quantile classification according to fixed rules combined with polynomial estimation of local samples, which we call the quantile local linear causal tree (QLPRT) and forest (QLPRF).
Spectral clustering algorithms are very popular. Starting from a pairwise similarity matrix, spectral clustering gives a partition of data that approximately minimizes the total similarity scores across clusters. Since there is no need to model how data are distributed within each cluster, such a method enjoys algorithmic simplicity and robustness in clustering non-Gaussian data such as those near manifolds. Nevertheless, several important questions are unaddressed, such as how to estimate the similarity scores and cluster assignment probabilities, as important uncertainty estimates in clustering. In this article, we propose to solve these problems with a discovered generative modeling counterpart. Our clustering model is based on a spanning forest graph that consists of several disjoint spanning trees, with each tree corresponding to a cluster. Taking a Bayesian approach, we assign proper densities on the root and leaf nodes, and we prove that the posterior mode is almost the same as spectral clustering estimates. Further, we show that the associated generative process, named "forest process", is a continuous extension to the classic urn process, hence inheriting many nice properties such as having unbounded support for the number of clusters and being amenable to existing partition probability function; at the same time, we carefully characterize their differences. We demonstrate a novel application in joint clustering of multiple-subject functional magnetic resonance imaging scans of the human brain.
We study regression adjustments with additional covariates in randomized experiments under covariate-adaptive randomizations (CARs) when subject compliance is imperfect. We develop a regression-adjusted local average treatment effect (LATE) estimator that is proven to improve efficiency in the estimation of LATEs under CARs. Our adjustments can be parametric in linear and nonlinear forms, nonparametric, and high-dimensional. Even when the adjustments are misspecified, our proposed estimator is still consistent and asymptotically normal, and their inference method still achieves the exact asymptotic size under the null. When the adjustments are correctly specified, our estimator achieves the minimum asymptotic variance. When the adjustments are parametrically misspecified, we construct a new estimator which is weakly more efficient than linearly and nonlinearly adjusted estimators, as well as the one without any adjustments. Simulation evidence and empirical application confirm efficiency gains achieved by regression adjustments relative to both the estimator without adjustment and the standard two-stage least squares estimator.
Population adjustment methods such as matching-adjusted indirect comparison (MAIC) are increasingly used to compare marginal treatment effects when there are cross-trial differences in effect modifiers and limited patient-level data. MAIC is based on propensity score weighting, which is sensitive to poor covariate overlap and cannot extrapolate beyond the observed covariate space. Current outcome regression-based alternatives can extrapolate but target a conditional treatment effect that is incompatible in the indirect comparison. When adjusting for covariates, one must integrate or average the conditional estimate over the relevant population to recover a compatible marginal treatment effect. We propose a marginalization method based on parametric G-computation that can be easily applied where the outcome regression is a generalized linear model or a Cox model. The approach views the covariate adjustment regression as a nuisance model and separates its estimation from the evaluation of the marginal treatment effect of interest. The method can accommodate a Bayesian statistical framework, which naturally integrates the analysis into a probabilistic framework. A simulation study provides proof-of-principle and benchmarks the method's performance against MAIC and the conventional outcome regression. Parametric G-computation achieves more precise and more accurate estimates than MAIC, particularly when covariate overlap is poor, and yields unbiased marginal treatment effect estimates under no failures of assumptions. Furthermore, the marginalized regression-adjusted estimates provide greater precision and accuracy than the conditional estimates produced by the conventional outcome regression, which are systematically biased because the measure of effect is non-collapsible.
Classical methods for quantile regression fail in cases where the quantile of interest is extreme and only few or no training data points exceed it. Asymptotic results from extreme value theory can be used to extrapolate beyond the range of the data, and several approaches exist that use linear regression, kernel methods or generalized additive models. Most of these methods break down if the predictor space has more than a few dimensions or if the regression function of extreme quantiles is complex. We propose a method for extreme quantile regression that combines the flexibility of random forests with the theory of extrapolation. Our extremal random forest (ERF) estimates the parameters of a generalized Pareto distribution, conditional on the predictor vector, by maximizing a local likelihood with weights extracted from a quantile random forest. Under certain assumptions, we show consistency of the estimated parameters. Furthermore, we penalize the shape parameter in this likelihood to regularize its variability in the predictor space. Simulation studies show that our ERF outperforms both classical quantile regression methods and existing regression approaches from extreme value theory. We apply our methodology to extreme quantile prediction for U.S. wage data.
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
Recently, conditional average treatment effect (CATE) estimation has been attracting much attention due to its importance in various fields such as statistics, social and biomedical sciences. This study proposes a partially linear nonparametric Bayes model for the heterogeneous treatment effect estimation. A partially linear model is a semiparametric model that consists of linear and nonparametric components in an additive form. A nonparametric Bayes model that uses a Gaussian process to model the nonparametric component has already been studied. However, this model cannot handle the heterogeneity of the treatment effect. In our proposed model, not only the nonparametric component of the model but also the heterogeneous treatment effect of the treatment variable is modeled by a Gaussian process prior. We derive the analytic form of the posterior distribution of the CATE and prove that the posterior has the consistency property. That is, it concentrates around the true distribution. We show the effectiveness of the proposed method through numerical experiments based on synthetic data.
Long-term outcomes of experimental evaluations are necessarily observed after long delays. We develop semiparametric methods for combining the short-term outcomes of an experimental evaluation with observational measurements of the joint distribution of short-term and long-term outcomes to estimate long-term treatment effects. We characterize semiparametric efficiency bounds for estimation of the average effect of a treatment on a long-term outcome in several instances of this problem. These calculations facilitate the construction of semiparametrically efficient estimators. The finite-sample performance of these estimators is analyzed with a simulation calibrated to a randomized evaluation of the long-term effects of a poverty alleviation program.
We revisit widely used preferential Gaussian processes by Chu et al.(2005) and challenge their modelling assumption that imposes rankability of data items via latent utility function values. We propose a generalisation of pgp which can capture more expressive latent preferential structures in the data and thus be used to model inconsistent preferences, i.e. where transitivity is violated, or to discover clusters of comparable items via spectral decomposition of the learned preference functions. We also consider the properties of associated covariance kernel functions and its reproducing kernel Hilbert Space (RKHS), giving a simple construction that satisfies universality in the space of preference functions. Finally, we provide an extensive set of numerical experiments on simulated and real-world datasets showcasing the competitiveness of our proposed method with state-of-the-art. Our experimental findings support the conjecture that violations of rankability are ubiquitous in real-world preferential data.
Stock trend forecasting, aiming at predicting the stock future trends, is crucial for investors to seek maximized profits from the stock market. Many event-driven methods utilized the events extracted from news, social media, and discussion board to forecast the stock trend in recent years. However, existing event-driven methods have two main shortcomings: 1) overlooking the influence of event information differentiated by the stock-dependent properties; 2) neglecting the effect of event information from other related stocks. In this paper, we propose a relational event-driven stock trend forecasting (REST) framework, which can address the shortcoming of existing methods. To remedy the first shortcoming, we propose to model the stock context and learn the effect of event information on the stocks under different contexts. To address the second shortcoming, we construct a stock graph and design a new propagation layer to propagate the effect of event information from related stocks. The experimental studies on the real-world data demonstrate the efficiency of our REST framework. The results of investment simulation show that our framework can achieve a higher return of investment than baselines.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.