We develop a new Bayesian modelling framework for the class of higher-order, variable-memory Markov chains, and introduce an associated collection of methodological tools for exact inference with discrete time series. We show that a version of the context tree weighting algorithm can compute the prior predictive likelihood exactly (averaged over both models and parameters), and two related algorithms are introduced, which identify the a posteriori most likely models and compute their exact posterior probabilities. All three algorithms are deterministic and have linear-time complexity. A family of variable-dimension Markov chain Monte Carlo samplers is also provided, facilitating further exploration of the posterior. The performance of the proposed methods in model selection, Markov order estimation and prediction is illustrated through simulation experiments and real-world applications with data from finance, genetics, neuroscience, and animal communication. The associated algorithms are implemented in the R package BCT.
Nonlinear metrics, such as the F1-score, Matthews correlation coefficient, and Fowlkes-Mallows index, are often used to evaluate the performance of machine learning models, in particular, when facing imbalanced datasets that contain more samples of one class than the other. Recent optimal decision tree algorithms have shown remarkable progress in producing trees that are optimal with respect to linear criteria, such as accuracy, but unfortunately nonlinear metrics remain a challenge. To address this gap, we propose a novel algorithm based on bi-objective optimisation, which treats misclassifications of each binary class as a separate objective. We show that, for a large class of metrics, the optimal tree lies on the Pareto frontier. Consequently, we obtain the optimal tree by using our method to generate the set of all nondominated trees. To the best of our knowledge, this is the first method to compute provably optimal decision trees for nonlinear metrics. Our approach leads to a trade-off when compared to optimising linear metrics: the resulting trees may be more desirable according to the given nonlinear metric at the expense of higher runtimes. Nevertheless, the experiments illustrate that runtimes are reasonable for majority of the tested datasets.
Implicit Processes (IPs) are flexible priors that can describe models such as Bayesian neural networks, neural samplers and data generators. IPs allow for approximate inference in function-space. This avoids some degenerate problems of parameter-space approximate inference due to the high number of parameters and strong dependencies. For this, an extra IP is often used to approximate the posterior of the prior IP. However, simultaneously adjusting the parameters of the prior IP and the approximate posterior IP is a challenging task. Existing methods that can tune the prior IP result in a Gaussian predictive distribution, which fails to capture important data patterns. By contrast, methods producing flexible predictive distributions by using another IP to approximate the posterior process cannot fit the prior IP to the observed data. We propose here a method that can carry out both tasks. For this, we rely on an inducing-point representation of the prior IP, as often done in the context of sparse Gaussian processes. The result is a scalable method for approximate inference with IPs that can tune the prior IP parameters to the data, and that provides accurate non-Gaussian predictive distributions.
Numerous challenges in science and engineering can be framed as optimization tasks, including the maximization of reaction yields, the optimization of molecular and materials properties, and the fine-tuning of automated hardware protocols. Design of experiment and optimization algorithms are often adopted to solve these tasks efficiently. Increasingly, these experiment planning strategies are coupled with automated hardware to enable autonomous experimental platforms. The vast majority of the strategies used, however, do not consider robustness against the variability of experiment and process conditions. In fact, it is generally assumed that these parameters are exact and reproducible. Yet some experiments may have considerable noise associated with some of their conditions, and process parameters optimized under precise control may be applied in the future under variable operating conditions. In either scenario, the optimal solutions found might not be robust against input variability, affecting the reproducibility of results and returning suboptimal performance in practice. Here, we introduce Golem, an algorithm that is agnostic to the choice of experiment planning strategy and that enables robust experiment and process optimization. Golem identifies optimal solutions that are robust to input uncertainty, thus ensuring the reproducible performance of optimized experimental protocols and processes. It can be used to analyze the robustness of past experiments, or to guide experiment planning algorithms toward robust solutions on the fly. We assess the performance and domain of applicability of Golem through extensive benchmark studies and demonstrate its practical relevance by optimizing an analytical chemistry protocol under the presence of significant noise in its experimental conditions.
We present a meta-algorithm for learning a posterior-inference algorithm for restricted probabilistic programs. Our meta-algorithm takes a training set of probabilistic programs that describe models with observations, and attempts to learn an efficient method for inferring the posterior of a similar program. A key feature of our approach is the use of what we call a white-box inference algorithm that extracts information directly from model descriptions themselves, given as programs. Concretely, our white-box inference algorithm is equipped with multiple neural networks, one for each type of atomic command, and computes an approximate posterior of a given probabilistic program by analysing individual atomic commands in the program using these networks. The parameters of the networks are learnt from a training set by our meta-algorithm. We empirically demonstrate that the learnt inference algorithm generalises well to programs that are new in terms of both parameters and model structures, and report cases where our approach achieves greater test-time efficiency than alternative approaches such as HMC. The overall results show the promise as well as remaining challenges of our approach.
Clustering has become a core technology in machine learning, largely due to its application in the field of unsupervised learning, clustering, classification, and density estimation. A frequentist approach exists to hand clustering based on mixture model which is known as the EM algorithm where the parameters of the mixture model are usually estimated into a maximum likelihood estimation framework. Bayesian approach for finite and infinite Gaussian mixture model generates point estimates for all variables as well as associated uncertainty in the form of the whole estimates' posterior distribution. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in Bayesian inference for finite and infinite Gaussian mixture model in order to seamlessly introduce their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning this field and given the paucity of scope to present this discussion, e.g., the separated analysis of the generation of Dirichlet samples by stick-breaking and Polya's Urn approaches. We refer the reader to literature in the field of the Dirichlet process mixture model for a much detailed introduction to the related fields. Some excellent examples include (Frigyik et al., 2010; Murphy, 2012; Gelman et al., 2014; Hoff, 2009). This survey is primarily a summary of purpose, significance of important background and techniques for Gaussian mixture model, e.g., Dirichlet prior, Chinese restaurant process, and most importantly the origin and complexity of the methods which shed light on their modern applications. The mathematical prerequisite is a first course in probability. Other than this modest background, the development is self-contained, with rigorous proofs provided throughout.
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.
The Bayesian paradigm has the potential to solve core issues of deep neural networks such as poor calibration and data inefficiency. Alas, scaling Bayesian inference to large weight spaces often requires restrictive approximations. In this work, we show that it suffices to perform inference over a small subset of model weights in order to obtain accurate predictive posteriors. The other weights are kept as point estimates. This subnetwork inference framework enables us to use expressive, otherwise intractable, posterior approximations over such subsets. In particular, we implement subnetwork linearized Laplace: We first obtain a MAP estimate of all weights and then infer a full-covariance Gaussian posterior over a subnetwork. We propose a subnetwork selection strategy that aims to maximally preserve the model's predictive uncertainty. Empirically, our approach is effective compared to ensembles and less expressive posterior approximations over full networks.
This work focuses on combining nonparametric topic models with Auto-Encoding Variational Bayes (AEVB). Specifically, we first propose iTM-VAE, where the topics are treated as trainable parameters and the document-specific topic proportions are obtained by a stick-breaking construction. The inference of iTM-VAE is modeled by neural networks such that it can be computed in a simple feed-forward manner. We also describe how to introduce a hyper-prior into iTM-VAE so as to model the uncertainty of the prior parameter. Actually, the hyper-prior technique is quite general and we show that it can be applied to other AEVB based models to alleviate the {\it collapse-to-prior} problem elegantly. Moreover, we also propose HiTM-VAE, where the document-specific topic distributions are generated in a hierarchical manner. HiTM-VAE is even more flexible and can generate topic distributions with better variability. Experimental results on 20News and Reuters RCV1-V2 datasets show that the proposed models outperform the state-of-the-art baselines significantly. The advantages of the hyper-prior technique and the hierarchical model construction are also confirmed by experiments.
A fundamental computation for statistical inference and accurate decision-making is to compute the marginal probabilities or most probable states of task-relevant variables. Probabilistic graphical models can efficiently represent the structure of such complex data, but performing these inferences is generally difficult. Message-passing algorithms, such as belief propagation, are a natural way to disseminate evidence amongst correlated variables while exploiting the graph structure, but these algorithms can struggle when the conditional dependency graphs contain loops. Here we use Graph Neural Networks (GNNs) to learn a message-passing algorithm that solves these inference tasks. We first show that the architecture of GNNs is well-matched to inference tasks. We then demonstrate the efficacy of this inference approach by training GNNs on a collection of graphical models and showing that they substantially outperform belief propagation on loopy graphs. Our message-passing algorithms generalize out of the training set to larger graphs and graphs with different structure.
Topic models have been widely explored as probabilistic generative models of documents. Traditional inference methods have sought closed-form derivations for updating the models, however as the expressiveness of these models grows, so does the difficulty of performing fast and accurate inference over their parameters. This paper presents alternative neural approaches to topic modelling by providing parameterisable distributions over topics which permit training by backpropagation in the framework of neural variational inference. In addition, with the help of a stick-breaking construction, we propose a recurrent network that is able to discover a notionally unbounded number of topics, analogous to Bayesian non-parametric topic models. Experimental results on the MXM Song Lyrics, 20NewsGroups and Reuters News datasets demonstrate the effectiveness and efficiency of these neural topic models.