Survey data are often collected under multistage sampling designs where units are binned to clusters that are sampled in a first stage. The unit-indexed population variables of interest are typically dependent within cluster. We propose a Fully Bayesian method that constructs an exact likelihood for the observed sample to incorporate unit-level marginal sampling weights for performing unbiased inference for population parameters while simultaneously accounting for the dependence induced by sampling clusters of units to produce correct uncertainty quantification. Our approach parameterizes cluster-indexed random effects in both a marginal model for the response and a conditional model for published, unit-level sampling weights. We compare our method to plug-in Bayesian and frequentist alternatives in a simulation study and demonstrate that our method most closely achieves correct uncertainty quantification for model parameters, including the generating variances for cluster-indexed random effects. We demonstrate our method in an application with NHANES data.
When studying treatment effects in multilevel studies, investigators commonly use (semi-)parametric estimators, which make strong parametric assumptions about the outcome, the treatment, and/or the correlation between individuals. We propose two nonparametric, doubly robust, asymptotically Normal estimators of treatment effects that do not make such assumptions. The first estimator is an extension of the cross-fitting estimator applied to clustered settings. The second estimator is a new estimator that uses conditional propensity scores and an outcome covariance model to improve efficiency. We apply our estimators in simulation and empirical studies and find that they consistently obtain the smallest standard errors.
We address the problem of causal effect estimation in the presence of unobserved confounding, but where proxies for the latent confounder(s) are observed. We propose two kernel-based methods for nonlinear causal effect estimation in this setting: (a) a two-stage regression approach, and (b) a maximum moment restriction approach. We focus on the proximal causal learning setting, but our methods can be used to solve a wider class of inverse problems characterised by a Fredholm integral equation. In particular, we provide a unifying view of two-stage and moment restriction approaches for solving this problem in a nonlinear setting. We provide consistency guarantees for each algorithm, and we demonstrate these approaches achieve competitive results on synthetic data and data simulating a real-world task. In particular, our approach outperforms earlier methods that are not suited to leveraging proxy variables.
Learning graphs from sets of nodal observations represents a prominent problem formally known as graph topology inference. However, current approaches are limited by typically focusing on inferring single networks, and they assume that observations from all nodes are available. First, many contemporary setups involve multiple related networks, and second, it is often the case that only a subset of nodes is observed while the rest remain hidden. Motivated by these facts, we introduce a joint graph topology inference method that models the influence of the hidden variables. Under the assumptions that the observed signals are stationary on the sought graphs and the graphs are closely related, the joint estimation of multiple networks allows us to exploit such relationships to improve the quality of the learned graphs. Moreover, we confront the challenging problem of modeling the influence of the hidden nodes to minimize their detrimental effect. To obtain an amenable approach, we take advantage of the particular structure of the setup at hand and leverage the similarity between the different graphs, which affects both the observed and the hidden nodes. To test the proposed method, numerical simulations over synthetic and real-world graphs are provided.
Active inference, a corollary of the free energy principle, is a formal way of describing the behavior of certain kinds of random dynamical systems that have the appearance of sentience. In this chapter, we describe how active inference combines Bayesian decision theory and optimal Bayesian design principles under a single imperative to minimize expected free energy. It is this aspect of active inference that allows for the natural emergence of information-seeking behavior. When removing prior outcomes preferences from expected free energy, active inference reduces to optimal Bayesian design, i.e., information gain maximization. Conversely, active inference reduces to Bayesian decision theory in the absence of ambiguity and relative risk, i.e., expected utility maximization. Using these limiting cases, we illustrate how behaviors differ when agents select actions that optimize expected utility, expected information gain, and expected free energy. Our T-maze simulations show optimizing expected free energy produces goal-directed information-seeking behavior while optimizing expected utility induces purely exploitive behavior and maximizing information gain engenders intrinsically motivated behavior.
Causal effect estimation for dynamic treatment regimes (DTRs) contributes to sequential decision making. However, censoring and time-dependent confounding under DTRs are challenging as the amount of observational data declines over time due to a reducing sample size but the feature dimension increases over time. Long-term follow-up compounds these challenges. Another challenge is the highly complex relationships between confounders, treatments, and outcomes, which causes the traditional and commonly used linear methods to fail. We combine outcome regression models with treatment models for high dimensional features using uncensored subjects that are small in sample size and we fit deep Bayesian models for outcome regression models to reveal the complex relationships between confounders, treatments, and outcomes. Also, the developed deep Bayesian models can model uncertainty and output the prediction variance which is essential for the safety-aware applications, such as self-driving cars and medical treatment design. The experimental results on medical simulations of HIV treatment show the ability of the proposed method to obtain stable and accurate dynamic causal effect estimation from observational data, especially with long-term follow-up. Our technique provides practical guidance for sequential decision making, and policy-making.
Clustering has become a core technology in machine learning, largely due to its application in the field of unsupervised learning, clustering, classification, and density estimation. A frequentist approach exists to hand clustering based on mixture model which is known as the EM algorithm where the parameters of the mixture model are usually estimated into a maximum likelihood estimation framework. Bayesian approach for finite and infinite Gaussian mixture model generates point estimates for all variables as well as associated uncertainty in the form of the whole estimates' posterior distribution. The sole aim of this survey is to give a self-contained introduction to concepts and mathematical tools in Bayesian inference for finite and infinite Gaussian mixture model in order to seamlessly introduce their applications in subsequent sections. However, we clearly realize our inability to cover all the useful and interesting results concerning this field and given the paucity of scope to present this discussion, e.g., the separated analysis of the generation of Dirichlet samples by stick-breaking and Polya's Urn approaches. We refer the reader to literature in the field of the Dirichlet process mixture model for a much detailed introduction to the related fields. Some excellent examples include (Frigyik et al., 2010; Murphy, 2012; Gelman et al., 2014; Hoff, 2009). This survey is primarily a summary of purpose, significance of important background and techniques for Gaussian mixture model, e.g., Dirichlet prior, Chinese restaurant process, and most importantly the origin and complexity of the methods which shed light on their modern applications. The mathematical prerequisite is a first course in probability. Other than this modest background, the development is self-contained, with rigorous proofs provided throughout.
The Bayesian paradigm has the potential to solve core issues of deep neural networks such as poor calibration and data inefficiency. Alas, scaling Bayesian inference to large weight spaces often requires restrictive approximations. In this work, we show that it suffices to perform inference over a small subset of model weights in order to obtain accurate predictive posteriors. The other weights are kept as point estimates. This subnetwork inference framework enables us to use expressive, otherwise intractable, posterior approximations over such subsets. In particular, we implement subnetwork linearized Laplace: We first obtain a MAP estimate of all weights and then infer a full-covariance Gaussian posterior over a subnetwork. We propose a subnetwork selection strategy that aims to maximally preserve the model's predictive uncertainty. Empirically, our approach is effective compared to ensembles and less expressive posterior approximations over full networks.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
Many problems on signal processing reduce to nonparametric function estimation. We propose a new methodology, piecewise convex fitting (PCF), and give a two-stage adaptive estimate. In the first stage, the number and location of the change points is estimated using strong smoothing. In the second stage, a constrained smoothing spline fit is performed with the smoothing level chosen to minimize the MSE. The imposed constraint is that a single change point occurs in a region about each empirical change point of the first-stage estimate. This constraint is equivalent to requiring that the third derivative of the second-stage estimate has a single sign in a small neighborhood about each first-stage change point. We sketch how PCF may be applied to signal recovery, instantaneous frequency estimation, surface reconstruction, image segmentation, spectral estimation and multivariate adaptive regression.