We derive strong mixing conditions for many existing discrete-valued time series models that include exogenous covariates in the dynamic. Our main contribution is to study how a mixing condition on the covariate process transfers to a mixing condition for the response. Using a coupling method, we first derive mixing conditions for some Markov chains in random environments, which gives a first result for some autoregressive categorical processes with strictly exogenous regressors. Our result is then extended to some infinite memory categorical processes. In the second part of the paper, we study autoregressive models for which the covariates are sequentially exogenous. Using a general random mapping approach on finite sets, we get explicit mixing conditions that can be checked for many categorical time series found in the literature, including multinomial autoregressive processes, ordinal time series and dynamic multiple choice models. We also study some autoregressive count time series using a somewhat different contraction argument. Our contribution fill an important gap for such models, presented here under a more general form, since such a strong mixing condition is often assumed in some recent works but no general approach is available to check it.
We propose some extensions to semi-parametric models based on Bayesian additive regression trees (BART). In the semi-parametric BART paradigm, the response variable is approximated by a linear predictor and a BART model, where the linear component is responsible for estimating the main effects and BART accounts for non-specified interactions and non-linearities. Previous semi-parametric models based on BART have assumed that the set of covariates in the linear predictor and the BART model are mutually exclusive in an attempt to avoid bias and poor coverage properties. The main novelty in our approach lies in the way we change the tree-generation moves in BART to deal with bias/confounding between the parametric and non-parametric components, even when they have covariates in common. This allows us to model complex interactions involving the covariates of primary interest, both among themselves and with those in the BART component. Through synthetic and real-world examples, we demonstrate that the performance of our novel semi-parametric BART is competitive when compared to regression models, alternative formulations of semi-parametric BART, and other tree-based methods. The implementation of the proposed method is available at //github.com/ebprado/CSP-BART.
Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at //github.com/layer6ai-labs/BNPO .
The real-time analysis of infectious disease surveillance data, e.g. time-series of reported cases or fatalities, can help to provide situational awareness about the current state of a pandemic. This task is challenged by reporting delays that give rise to occurred-but-not-yet-reported events. If these events are not taken into consideration, this can lead to an under-estimation of the counts-to-be-reported and, hence, introduces misconceptions by the interpreter, the media or the general public -- as has been seen for example for reported fatalities during the COVID-19 pandemic. Nowcasting methods provide close to real-time estimates of the complete number of events using the incomplete time-series of currently reported events by using information about the reporting delays from the past. In this report, we consider nowcasting the number of COVID-19 related fatalities in Sweden. We propose a flexible Bayesian approach that considers temporal changes in the reporting delay distribution and, as an extension to existing nowcasting methods, incorporates a regression component for the (lagged) time-series of the number of ICU admissions. This results in a model considering both the past behavior of the time-series of fatalities as well as additional data streams that are in a time-lagged association with the number of fatalities.
We consider stochastic optimization problems where data is drawn from a Markov chain. Existing methods for this setting crucially rely on knowing the mixing time of the chain, which in real-world applications is usually unknown. We propose the first optimization method that does not require the knowledge of the mixing time, yet obtains the optimal asymptotic convergence rate when applied to convex problems. We further show that our approach can be extended to: (i) finding stationary points in non-convex optimization with Markovian data, and (ii) obtaining better dependence on the mixing time in temporal difference (TD) learning; in both cases, our method is completely oblivious to the mixing time. Our method relies on a novel combination of multi-level Monte Carlo (MLMC) gradient estimation together with an adaptive learning method.
Detecting anomalous time series is key for scientific, medical and industrial tasks, but is challenging due to its inherent unsupervised nature. In recent years, progress has been made on this task by learning increasingly more complex features, often using deep neural networks. In this work, we argue that shallow features suffice when combined with distribution distance measures. Our approach models each time series as a high dimensional empirical distribution of features, where each time-point constitutes a single sample. Modeling the distance between a test time series and the normal training set therefore requires efficiently measuring the distance between multivariate probability distributions. We show that by parameterizing each time series using cumulative Radon features, we are able to efficiently and effectively model the distribution of normal time series. Our theoretically grounded but simple-to-implement approach is evaluated on multiple datasets and shown to achieve better results than established, classical methods as well as complex, state-of-the-art deep learning methods. Code is provided.
We consider unsupervised classification by means of a latent multinomial variable which categorizes a scalar response into one of L components of a mixture model. This process can be thought as a hierarchical model with first level modelling a scalar response according to a mixture of parametric distributions, the second level models the mixture probabilities by means of a generalised linear model with functional and scalar covariates. The traditional approach of treating functional covariates as vectors not only suffers from the curse of dimensionality since functional covariates can be measured at very small intervals leading to a highly parametrised model but also does not take into account the nature of the data. We use basis expansion to reduce the dimensionality and a Bayesian approach to estimate the parameters while providing predictions of the latent classification vector. By means of a simulation study we investigate the behaviour of our approach considering normal mixture model and zero inflated mixture of Poisson distributions. We also compare the performance of the classical Gibbs sampling approach with Variational Bayes Inference.
Probabilistic finite mixture models are widely used for unsupervised clustering. These models can often be improved by adapting them to the topology of the data. For instance, in order to classify spatially adjacent data points similarly, it is common to introduce a Laplacian constraint on the posterior probability that each data point belongs to a class. Alternatively, the mixing probabilities can be treated as free parameters, while assuming Gauss-Markov or more complex priors to regularize those mixing probabilities. However, these approaches are constrained by the shape of the prior and often lead to complicated or intractable inference. Here, we propose a new parametrization of the Dirichlet distribution to flexibly regularize the mixing probabilities of over-parametrized mixture distributions. Using the Expectation-Maximization algorithm, we show that our approach allows us to define any linear update rule for the mixing probabilities, including spatial smoothing regularization as a special case. We then show that this flexible design can be extended to share class information between multiple mixture models. We apply our algorithm to artificial and natural image segmentation tasks, and we provide quantitative and qualitative comparison of the performance of Gaussian and Student-t mixtures on the Berkeley Segmentation Dataset. We also demonstrate how to propagate class information across the layers of deep convolutional neural networks in a probabilistically optimal way, suggesting a new interpretation for feedback signals in biological visual systems. Our flexible approach can be easily generalized to adapt probabilistic mixture models to arbitrary data topologies.
Several queries and scores have recently been proposed to explain individual predictions over ML models. Given the need for flexible, reliable, and easy-to-apply interpretability methods for ML models, we foresee the need for developing declarative languages to naturally specify different explainability queries. We do this in a principled way by rooting such a language in a logic, called FOIL, that allows for expressing many simple but important explainability queries, and might serve as a core for more expressive interpretability languages. We study the computational complexity of FOIL queries over two classes of ML models often deemed to be easily interpretable: decision trees and OBDDs. Since the number of possible inputs for an ML model is exponential in its dimension, the tractability of the FOIL evaluation problem is delicate but can be achieved by either restricting the structure of the models or the fragment of FOIL being evaluated. We also present a prototype implementation of FOIL wrapped in a high-level declarative language and perform experiments showing that such a language can be used in practice.
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.
Discrete random structures are important tools in Bayesian nonparametrics and the resulting models have proven effective in density estimation, clustering, topic modeling and prediction, among others. In this paper, we consider nested processes and study the dependence structures they induce. Dependence ranges between homogeneity, corresponding to full exchangeability, and maximum heterogeneity, corresponding to (unconditional) independence across samples. The popular nested Dirichlet process is shown to degenerate to the fully exchangeable case when there are ties across samples at the observed or latent level. To overcome this drawback, inherent to nesting general discrete random measures, we introduce a novel class of latent nested processes. These are obtained by adding common and group-specific completely random measures and, then, normalising to yield dependent random probability measures. We provide results on the partition distributions induced by latent nested processes, and develop an Markov Chain Monte Carlo sampler for Bayesian inferences. A test for distributional homogeneity across groups is obtained as a by product. The results and their inferential implications are showcased on synthetic and real data.