Conditional independence (CI) tests underlie many approaches to model testing and structure learning in causal inference. Most existing CI tests for categorical and ordinal data stratify the sample by the conditioning variables, perform simple independence tests in each stratum, and combine the results. Unfortunately, the statistical power of this approach degrades rapidly as the number of conditioning variables increases. Here we propose a simple unified CI test for ordinal and categorical data that maintains reasonable calibration and power in high dimensions. We show that our test outperforms existing baselines in model testing and structure learning for dense directed graphical models while being comparable for sparse models. Our approach could be attractive for causal model testing because it is easy to implement, can be used with non-parametric or parametric probability models, has the symmetry property, and has reasonable computational requirements.
Dimensionality reduction techniques aim at representing high-dimensional data in low-dimensional spaces to extract hidden and useful information or facilitate visual understanding and interpretation of the data. However, few of them take into consideration the potential cluster information contained implicitly in the high-dimensional data. In this paper, we propose LaptSNE, a new graph-layout nonlinear dimensionality reduction method based on t-SNE, one of the best techniques for visualizing high-dimensional data as 2D scatter plots. Specifically, LaptSNE leverages the eigenvalue information of the graph Laplacian to shrink the potential clusters in the low-dimensional embedding when learning to preserve the local and global structure from high-dimensional space to low-dimensional space. It is nontrivial to solve the proposed model because the eigenvalues of normalized symmetric Laplacian are functions of the decision variable. We provide a majorization-minimization algorithm with convergence guarantee to solve the optimization problem of LaptSNE and show how to calculate the gradient analytically, which may be of broad interest when considering optimization with Laplacian-composited objective. We evaluate our method by a formal comparison with state-of-the-art methods, both visually and via established quantitative measurements. The results demonstrate the superiority of our method over baselines such as t-SNE and UMAP. We also extend our method to spectral clustering and establish an accurate and parameter-free clustering algorithm, which provides us high reliability and convenience in real applications.
We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for $d$-dimensional Gaussian distributions with known covariance $\Sigma$, we can test whether the distribution comes from $\mathcal{N}(\mu^*, \Sigma)$ for some fixed $\mu^*$ or from some $\mathcal{N}(\mu, \Sigma)$ with total variation distance at least $\alpha$ from $\mathcal{N}(\mu^*, \Sigma)$ with $(\varepsilon, 0)$-differential privacy, using only \[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/3}}{\alpha^{4/3} \cdot \varepsilon^{2/3}} + \frac{1}{\alpha \cdot \varepsilon}\right)\] samples if the algorithm is allowed to be computationally inefficient, and only \[\tilde{O}\left(\frac{d^{1/2}}{\alpha^2} + \frac{d^{1/4}}{\alpha \cdot \varepsilon}\right)\] samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over $\{-1, 1\}^d$, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of $O\left(\frac{\sqrt{d}}{\alpha^2}\right)$ in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of $d$-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size $d$ \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.
The Receiver Operating Characteristic (ROC) curve is a useful tool that measures the discriminating power of a continuous variable or the accuracy of a pharmaceutical or medical test to distinguish between two conditions or classes. In certain situations, the practitioner may be able to measure some covariates related to the diagnostic variable which can increase the discriminating power of the ROC curve. To protect against the existence of atypical data among the observations, a procedure to obtain robust estimators for the ROC curve in presence of covariates is introduced. The considered proposal focusses on a semiparametric approach which fits a location-scale regression model to the diagnostic variable and considers empirical estimators of the regression residuals distributions. Robust parametric estimators are combined with adaptive weighted empirical distribution estimators to down-weight the influence of outliers. The uniform consistency of the proposal is derived under mild assumptions. A Monte Carlo study is carried out to compare the performance of the robust proposed estimators with the classical ones both, in clean and contaminated samples. A real data set is also analysed.
Causal effect estimation from observational data is a challenging problem, especially with high dimensional data and in the presence of unobserved variables. The available data-driven methods for tackling the problem either provide an estimation of the bounds of a causal effect (i.e. nonunique estimation) or have low efficiency. The major hurdle for achieving high efficiency while trying to obtain unique and unbiased causal effect estimation is how to find a proper adjustment set for confounding control in a fast way, given the huge covariate space and considering unobserved variables. In this paper, we approach the problem as a local search task for finding valid adjustment sets in data. We establish the theorems to support the local search for adjustment sets, and we show that unique and unbiased estimation can be achieved from observational data even when there exist unobserved variables. We then propose a data-driven algorithm that is fast and consistent under mild assumptions. We also make use of a frequent pattern mining method to further speed up the search of minimal adjustment sets for causal effect estimation. Experiments conducted on extensive synthetic and real-world datasets demonstrate that the proposed algorithm outperforms the state-of-the-art criteria/estimators in both accuracy and time-efficiency.
We develop methods for reducing the dimensionality of large data sets, common in biomedical applications. Learning about patients using genetic data often includes more features than observations, which makes direct supervised learning difficult. One method of reducing the feature space is to use latent Dirichlet allocation to group genetic variants in an unsupervised manner. Latent Dirichlet allocation describes a patient as a mixture of topics corresponding to genetic variants. This can be generalized as a Bayesian tensor decomposition to account for multiple feature variables. Our most significant contributions are with hierarchical topic modeling. We design distinct methods of incorporating hierarchical topic modeling, based on nested Chinese restaurant processes and Pachinko Allocation Machine, into Bayesian tensor decomposition. We apply these models to examine patients with one of four common types of cancer (breast, lung, prostate, and colorectal) and siblings with and without autism spectrum disorder. We linked the genes with their biological pathways and combine this information into a tensor of patients, counts of their genetic variants, and the genes' membership in pathways. We find that our trained models outperform baseline models, with respect to coherence, by up to 40%.
Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of $L_2$Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called \textquotedblleft post-Boosting\textquotedblright. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by $L_2$Boosting. Another variant is \textquotedblleft Orthogonal Boosting\textquotedblright\ where after each step an orthogonal projection is conducted. We show that both post-$L_2$Boosting and the orthogonal boosting achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. We show that the rate of convergence of the classical $L_2$Boosting depends on the design matrix described by a sparse eigenvalue constant. To show the latter results, we derive new approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of $L_2$Boosting. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Our results also allow a direct comparison between LASSO and boosting which has been missing from the literature. Finally, we present simulation studies and applications to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, post-$L_2$Boosting clearly outperforms LASSO.
Inference of the marginal probability distribution is defined as the calculation of the probability of a subset of the variables and is relevant for handling missing data and hidden variables. While inference of the marginal probability distribution is crucial for various problems in machine learning and statistics, its exact computation is generally not feasible for categorical variables in Bayesian networks due to the NP-hardness of this task. We develop a divide-and-conquer approach using the graphical properties of Bayesian networks to split the computation of the marginal probability distribution into sub-calculations of lower dimensionality, thus reducing the overall computational complexity. Exploiting this property, we present an efficient and scalable algorithm for calculating the marginal probability distribution for categorical variables. The novel method is compared against state-of-the-art approximate inference methods in a benchmarking study, where it displays superior performance. As an immediate application, we demonstrate how our method can be used to classify incomplete data against Bayesian networks and use this approach for identifying the cancer subtype of kidney cancer patient samples.
In the domain generalization literature, a common objective is to learn representations independent of the domain after conditioning on the class label. We show that this objective is not sufficient: there exist counter-examples where a model fails to generalize to unseen domains even after satisfying class-conditional domain invariance. We formalize this observation through a structural causal model and show the importance of modeling within-class variations for generalization. Specifically, classes contain objects that characterize specific causal features, and domains can be interpreted as interventions on these objects that change non-causal features. We highlight an alternative condition: inputs across domains should have the same representation if they are derived from the same object. Based on this objective, we propose matching-based algorithms when base objects are observed (e.g., through data augmentation) and approximate the objective when objects are not observed (MatchDG). Our simple matching-based algorithms are competitive to prior work on out-of-domain accuracy for rotated MNIST, Fashion-MNIST, PACS, and Chest-Xray datasets. Our method MatchDG also recovers ground-truth object matches: on MNIST and Fashion-MNIST, top-10 matches from MatchDG have over 50% overlap with ground-truth matches.
Clustering is one of the most fundamental and wide-spread techniques in exploratory data analysis. Yet, the basic approach to clustering has not really changed: a practitioner hand-picks a task-specific clustering loss to optimize and fit the given data to reveal the underlying cluster structure. Some types of losses---such as k-means, or its non-linear version: kernelized k-means (centroid based), and DBSCAN (density based)---are popular choices due to their good empirical performance on a range of applications. Although every so often the clustering output using these standard losses fails to reveal the underlying structure, and the practitioner has to custom-design their own variation. In this work we take an intrinsically different approach to clustering: rather than fitting a dataset to a specific clustering loss, we train a recurrent model that learns how to cluster. The model uses as training pairs examples of datasets (as input) and its corresponding cluster identities (as output). By providing multiple types of training datasets as inputs, our model has the ability to generalize well on unseen datasets (new clustering tasks). Our experiments reveal that by training on simple synthetically generated datasets or on existing real datasets, we can achieve better clustering performance on unseen real-world datasets when compared with standard benchmark clustering techniques. Our meta clustering model works well even for small datasets where the usual deep learning models tend to perform worse.
Medical image segmentation requires consensus ground truth segmentations to be derived from multiple expert annotations. A novel approach is proposed that obtains consensus segmentations from experts using graph cuts (GC) and semi supervised learning (SSL). Popular approaches use iterative Expectation Maximization (EM) to estimate the final annotation and quantify annotator's performance. Such techniques pose the risk of getting trapped in local minima. We propose a self consistency (SC) score to quantify annotator consistency using low level image features. SSL is used to predict missing annotations by considering global features and local image consistency. The SC score also serves as the penalty cost in a second order Markov random field (MRF) cost function optimized using graph cuts to derive the final consensus label. Graph cut obtains a global maximum without an iterative procedure. Experimental results on synthetic images, real data of Crohn's disease patients and retinal images show our final segmentation to be accurate and more consistent than competing methods.