We provide a flexible framework for selecting among a class of additive partial linear models that allows both linear and nonlinear additive components. In practice, it is challenging to determine which additive components should be excluded from the model while simultaneously determining whether nonzero additive components should be represented as linear or non-linear components in the final model. In this paper, we propose a Bayesian model selection method that is facilitated by a carefully specified class of models, including the choice of a prior distribution and the nonparametric model used for the nonlinear additive components. We employ a series of latent variables that determine the effect of each variable among the three possibilities (no effect, linear effect, and nonlinear effect) and that simultaneously determine the knots of each spline for a suitable penalization of smooth functions. The use of a pseudo-prior distribution along with a collapsing scheme enables us to deploy well-behaved Markov chain Monte Carlo samplers, both for model selection and for fitting the preferred model. Our method and algorithm are deployed on a suite of numerical studies and are applied to a nutritional epidemiology study. The numerical results show that the proposed methodology outperforms previously available methods in terms of effective sample sizes of the Markov chain samplers and the overall misclassification rates.
Despite their widespread use in practice, the asymptotic properties of Bayesian penalized splines have not been investigated so far. We close this gap and study posterior concentration rates for Bayesian penalized splines in a Gaussian nonparametric regression model. A key feature of the approach is the hyperprior on the smoothing variance, which allows for adaptive smoothing in practice but complicates the theoretical analysis considerably. Our main tool for the derivation of posterior concentration rates with a general hyperprior on the smoothing variance is a novel spline estimator that projects the observations onto the first basis functions of a Demmler-Reinsch basis. Our results show that posterior concentration at near optimal rate can be achieved if the hyperprior on the smoothing variance strikes a fine balance between oversmoothing and undersmoothing. Another interesting finding is that the order of the roughness penalty must exactly match the regularity of the unknown regression function in order to achieve posterior concentration at near optimal rate. Overall, our results are the first posterior concentration results for Bayesian penalized splines and can be generalized in many directions.
We consider the problem of time series forecasting in an adaptive setting. We focus on the inference of state-space models under unknown and potentially time-varying noise variances. We introduce an augmented model in which the variances are represented as auxiliary gaussian latent variables in a tracking mode. As variances are nonnegative, a transformation is chosen and applied to these latent variables. The inference relies on the online variational Bayesian methodology, which consists in minimizing a Kullback-Leibler divergence at each time step. We observe that the minimum of the Kullback-Leibler divergence is an extension of the Kalman filter taking into account the variance uncertainty. We design a novel algorithm, named Viking, using these optimal recursive updates. For auxiliary latent variables, we use second-order bounds whose optimum admit closed-form solutions. Experiments on synthetic data show that Viking behaves well and is robust to misspecification.
Bayesian networks are probabilistic graphical models widely employed to understand dependencies in high dimensional data, and even to facilitate causal discovery. Learning the underlying network structure, which is encoded as a directed acyclic graph (DAG) is highly challenging mainly due to the vast number of possible networks in combination with the acyclicity constraint. Efforts have focussed on two fronts: constraint-based methods that perform conditional independence tests to exclude edges and score and search approaches which explore the DAG space with greedy or MCMC schemes. Here we synthesise these two fields in a novel hybrid method which reduces the complexity of MCMC approaches to that of a constraint-based method. Individual steps in the MCMC scheme only require simple table lookups so that very long chains can be efficiently obtained. Furthermore, the scheme includes an iterative procedure to correct for errors from the conditional independence tests. The algorithm offers markedly superior performance to alternatives, particularly because DAGs can also be sampled from the posterior distribution, enabling full Bayesian model averaging for much larger Bayesian networks.
Automated hyperparameter optimization (HPO) can support practitioners to obtain peak performance in machine learning models. However, there is often a lack of valuable insights into the effects of different hyperparameters on the final model performance. This lack of explainability makes it difficult to trust and understand the automated HPO process and its results. We suggest using interpretable machine learning (IML) to gain insights from the experimental data obtained during HPO with Bayesian optimization (BO). BO tends to focus on promising regions with potential high-performance configurations and thus induces a sampling bias. Hence, many IML techniques, such as the partial dependence plot (PDP), carry the risk of generating biased interpretations. By leveraging the posterior uncertainty of the BO surrogate model, we introduce a variant of the PDP with estimated confidence bands. We propose to partition the hyperparameter space to obtain more confident and reliable PDPs in relevant sub-regions. In an experimental study, we provide quantitative evidence for the increased quality of the PDPs within sub-regions.
Nonlinear state-space models are powerful tools to describe dynamical structures in complex time series. In a streaming setting where data are processed one sample at a time, simultaneous inference of the state and its nonlinear dynamics has posed significant challenges in practice. We develop a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering. Our method provides an approximation of the filtering posterior which can be made arbitrarily close to the true filtering distribution for a wide class of dynamics models and observation models. Specifically, the proposed framework can efficiently approximate a posterior over the dynamics using sparse Gaussian processes, allowing for an interpretable model of the latent dynamics. Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications.
Variable selection in Gaussian processes (GPs) is typically undertaken by thresholding the inverse lengthscales of `automatic relevance determination' kernels, but in high-dimensional datasets this approach can be unreliable. A more probabilistically principled alternative is to use spike and slab priors and infer a posterior probability of variable inclusion. However, existing implementations in GPs are extremely costly to run in both high-dimensional and large-$n$ datasets, or are intractable for most kernels. As such, we develop a fast and scalable variational inference algorithm for the spike and slab GP that is tractable with arbitrary differentiable kernels. We improve our algorithm's ability to adapt to the sparsity of relevant variables by Bayesian model averaging over hyperparameters, and achieve substantial speed ups using zero temperature posterior restrictions, dropout pruning and nearest neighbour minibatching. In experiments our method consistently outperforms vanilla and sparse variational GPs whilst retaining similar runtimes (even when $n=10^6$) and performs competitively with a spike and slab GP using MCMC but runs up to $1000$ times faster.
This study concerns probability distribution estimation of sample maximum. The traditional approach is the parametric fitting to the limiting distribution - the generalized extreme value distribution; however, the model in finite cases is misspecified to a certain extent. We propose a plug-in type of nonparametric estimator which does not need model specification. It is proved that both asymptotic convergence rates depend on the tail index and the second order parameter. As the tail gets light, the degree of misspecification of the parametric fitting becomes large, that means the convergence rate becomes slow. In the Weibull cases, which can be seen as the limit of tail-lightness, only the nonparametric distribution estimator keeps its consistency. Finally, we report the results of numerical experiments.
We propose a doubly robust approach to characterizing treatment effect heterogeneity in observational studies. We utilize posterior distributions for both the propensity score and outcome regression models to provide valid inference on the conditional average treatment effect even when high-dimensional or nonparametric models are used. We show that our approach leads to conservative inference in finite samples or under model misspecification, and provides a consistent variance estimator when both models are correctly specified. In simulations, we illustrate the utility of these results in difficult settings such as high-dimensional covariate spaces or highly flexible models for the propensity score and outcome regression. Lastly, we analyze environmental exposure data from NHANES to identify how the effects of these exposures vary by subject-level characteristics.
We introduce a framework for obtaining tight mixing times for Markov chains based on what we call restricted modified log-Sobolev inequalities. Modified log-Sobolev inequalities (MLSI) quantify the rate of relative entropy contraction for the Markov operator, and are notoriously difficult to establish. However, infinitesimally close to stationarity, entropy contraction becomes equivalent to variance contraction, a.k.a. a Poincare inequality, which is significantly easier to establish through, e.g., spectral analysis. Motivated by this observation, we study restricted modified log-Sobolev inequalities that guarantee entropy contraction not for all starting distributions, but for those in a large neighborhood of the stationary distribution. We show how to sample from the hardcore and Ising models on $n$-node graphs that have a constant $\delta$ relative gap to the tree-uniqueness threshold, in nearly-linear time $\widetilde O_{\delta}(n)$. Notably, our bound does not depend on the maximum degree $\Delta$, and is therefore optimal even for high-degree graphs. This improves on prior mixing time bounds of $\widetilde O_{\delta, \Delta}(n)$ and $\widetilde O_{\delta}(n^2)$, established via (non-restricted) modified log-Sobolev and Poincare inequalities respectively. We further show that optimal concentration inequalities can still be achieved from the restricted form of modified log-Sobolev inequalities. To establish restricted entropy contraction, we extend the entropic independence framework of Anari, Jain, Koehler, Pham, and Vuong to the setting of distributions that are spectrally independent under a restricted set of external fields. We also develop an orthogonal trick that might be of independent interest: utilizing Bernoulli factories we show how to implement Glauber dynamics updates on high-degree graphs in $O(1)$ time, assuming standard adjacency array representation of the graph.
We consider the task of learning the parameters of a {\em single} component of a mixture model, for the case when we are given {\em side information} about that component, we call this the "search problem" in mixture models. We would like to solve this with computational and sample complexity lower than solving the overall original problem, where one learns parameters of all components. Our main contributions are the development of a simple but general model for the notion of side information, and a corresponding simple matrix-based algorithm for solving the search problem in this general setting. We then specialize this model and algorithm to four common scenarios: Gaussian mixture models, LDA topic models, subspace clustering, and mixed linear regression. For each one of these we show that if (and only if) the side information is informative, we obtain parameter estimates with greater accuracy, and also improved computation complexity than existing moment based mixture model algorithms (e.g. tensor methods). We also illustrate several natural ways one can obtain such side information, for specific problem instances. Our experiments on real data sets (NY Times, Yelp, BSDS500) further demonstrate the practicality of our algorithms showing significant improvement in runtime and accuracy.