In problems with large amounts of missing data one must model two distinct data generating processes: the outcome process which generates the response and the missing data mechanism which determines the data we observe. Under the ignorability condition of Rubin (1976), however, likelihood-based inference for the outcome process does not depend on the missing data mechanism so that only the former needs to be estimated; partially because of this simplification, ignorability is often used as a baseline assumption. We study the implications of Bayesian ignorability in the presence of high-dimensional nuisance parameters and argue that ignorability is typically incompatible with sensible prior beliefs about the amount of selection bias. We show that, for many problems, ignorability directly implies that the prior on the selection bias is tightly concentrated around zero. This is demonstrated on several models of practical interest, and the effect of ignorability on the posterior distribution is characterized for high-dimensional linear models with a ridge regression prior. We then show both how to build high-dimensional models which encode sensible beliefs about the selection bias and also show that under certain narrow circumstances ignorability is less problematic.
Neyman-Scott process (NSP) are point process models that generate clusters of points in time or space. They are natural models for a wide range of phenomena, ranging from neural spike trains to document streams. The clustering property is achieved via a doubly stochastic formulation: first, a set of latent events is drawn from a Poisson process; then, each latent event generates a set of observed data points according to another Poisson process. This construction is similar to Bayesian nonparametric mixture models like the Dirichlet process mixture model (DPMM) in that the number of latent events (i.e. clusters) is a random variable, but the point process formulation makes the NSP especially well suited to modeling spatiotemporal data. While many specialized algorithms have been developed for DPMMs, comparatively fewer works have focused on inference in NSPs. Here, we present novel connections between NSPs and DPMMs, with the key link being a third class of Bayesian mixture models called mixture of finite mixture models (MFMMs). Leveraging this connection, we adapt the standard collapsed Gibbs sampling algorithm for DPMMs to enable scalable Bayesian inference on NSP models. We demonstrate the potential of Neyman-Scott processes on a variety of applications including sequence detection in neural spike trains and event detection in document streams.
In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.
The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational biology and allied sciences. While the dimensionality of such datasets continues to grow, so too does the complexity of biomarker identification from exposure patterns in health studies measuring baseline confounders; moreover, doing so while avoiding model misspecification remains an issue only partially addressed. Efficient estimators capable of incorporating flexible, data adaptive regression techniques in estimating relevant components of the data-generating distribution provide an avenue for avoiding model misspecification; however, in the context of high-dimensional problems that require the simultaneous estimation of numerous parameters, standard variance estimators have proven unstable, resulting in unreliable Type-I error control even under standard multiple testing corrections. We present a general approach for applying empirical Bayes shrinkage to variance estimators of a family of efficient, asymptotically linear estimators of population intervention causal effects. Our generalization of shrinkage-based variance estimators increases inferential stability in high-dimensional settings, facilitating the application of these estimators for deriving nonparametric variable importance measures in high-dimensional biological datasets with modest sample sizes. The result is a data adaptive approach for robustly uncovering stable causal associations in high-dimensional data in studies with limited samples. Our generalized variance estimator is evaluated against alternative variance estimators in numerical experiments. Identification of biomarkers with the proposed methodology is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.
We study regression discontinuity designs in which many covariates, possibly much more than the number of observations, are available. We consider a two-step algorithm which first selects the set of covariates to be used through a localized Lasso-type procedure, and then, in a second step, estimates the treatment effect by including the selected covariates into the usual local linear estimator. We provide an in-depth analysis of the algorithm's theoretical properties, showing that, under an approximate sparsity condition, the resulting estimator is asymptotically normal, with asymptotic bias and variance that are conceptually similar to those obtained in low-dimensional settings. Bandwidth selection and inference can be carried out using standard methods. We also provide simulations and an empirical application.
Causal mediation analysis has historically been limited in two important ways: (i) a focus has traditionally been placed on binary treatments and static interventions, and (ii) direct and indirect effect decompositions have been pursued that are only identifiable in the absence of intermediate confounders affected by treatment. We present a theoretical study of an (in)direct effect decomposition of the population intervention effect, defined by stochastic interventions jointly applied to the treatment and mediators. In contrast to existing proposals, our causal effects can be evaluated regardless of whether a treatment is categorical or continuous and remain well-defined even in the presence of intermediate confounders affected by treatment. Our (in)direct effects are identifiable without a restrictive assumption on cross-world counterfactual independencies, allowing for substantive conclusions drawn from them to be validated in randomized controlled trials. Beyond the novel effects introduced, we provide a careful study of nonparametric efficiency theory relevant for the construction of flexible, multiply robust estimators of our (in)direct effects, while avoiding undue restrictions induced by assuming parametric models of nuisance parameter functionals. To complement our nonparametric estimation strategy, we introduce inferential techniques for constructing confidence intervals and hypothesis tests, and discuss open source software implementing the proposed methodology.
Bayesian mixture models are widely used for clustering of high-dimensional data with appropriate uncertainty quantification, but as the dimension increases, posterior inference often tends to favor too many or too few clusters. This article explains this behavior by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We show conditions under which the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Interestingly, the conditions do not depend on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and hold irrespective of the true data-generating model. We then propose a class of latent mixtures for Bayesian clustering (Lamb) on a set of low-dimensional latent variables inducing a partition on the observed data. The model is amenable to scalable posterior inference and we show that it avoids the pitfalls of high-dimensionality under reasonable and mild assumptions. The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.
Bayesian neural networks attempt to combine the strong predictive performance of neural networks with formal quantification of uncertainty associated with the predictive output in the Bayesian framework. However, it remains unclear how to endow the parameters of the network with a prior distribution that is meaningful when lifted into the output space of the network. A possible solution is proposed that enables the user to posit an appropriate Gaussian process covariance function for the task at hand. Our approach constructs a prior distribution for the parameters of the network, called a ridgelet prior, that approximates the posited Gaussian process in the output space of the network. In contrast to existing work on the connection between neural networks and Gaussian processes, our analysis is non-asymptotic, with finite sample-size error bounds provided. This establishes the universality property that a Bayesian neural network can approximate any Gaussian process whose covariance function is sufficiently regular. Our experimental assessment is limited to a proof-of-concept, where we demonstrate that the ridgelet prior can out-perform an unstructured prior on regression problems for which a suitable Gaussian process prior can be provided.
This work focuses on combining nonparametric topic models with Auto-Encoding Variational Bayes (AEVB). Specifically, we first propose iTM-VAE, where the topics are treated as trainable parameters and the document-specific topic proportions are obtained by a stick-breaking construction. The inference of iTM-VAE is modeled by neural networks such that it can be computed in a simple feed-forward manner. We also describe how to introduce a hyper-prior into iTM-VAE so as to model the uncertainty of the prior parameter. Actually, the hyper-prior technique is quite general and we show that it can be applied to other AEVB based models to alleviate the {\it collapse-to-prior} problem elegantly. Moreover, we also propose HiTM-VAE, where the document-specific topic distributions are generated in a hierarchical manner. HiTM-VAE is even more flexible and can generate topic distributions with better variability. Experimental results on 20News and Reuters RCV1-V2 datasets show that the proposed models outperform the state-of-the-art baselines significantly. The advantages of the hyper-prior technique and the hierarchical model construction are also confirmed by experiments.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.