In inverse problems, the parameters of a model are estimated based on observations of the model response. The Bayesian approach is powerful for solving such problems; one formulates a prior distribution for the parameter state that is updated with the observations to compute the posterior parameter distribution. Solving for the posterior distribution can be challenging when, e.g., prior and posterior significantly differ from one another and/or the parameter space is high-dimensional. We use a sequence of importance sampling measures that arise by tempering the likelihood to approach inverse problems exhibiting a significant distance between prior and posterior. Each importance sampling measure is identified by cross-entropy minimization as proposed in the context of Bayesian inverse problems in Engel et al. (2021). To efficiently address problems with high-dimensional parameter spaces we set up the minimization procedure in a low-dimensional subspace of the original parameter space. The principal idea is to analyse the spectrum of the second-moment matrix of the gradient of the log-likelihood function to identify a suitable subspace. Following Zahm et al. (2021), an upper bound on the Kullback-Leibler-divergence between full-dimensional and subspace posterior is provided, which can be utilized to determine the effective dimension of the inverse problem corresponding to a prescribed approximation error bound. We suggest heuristic criteria for optimally selecting the number of model and model gradient evaluations in each iteration of the importance sampling sequence. We investigate the performance of this approach using examples from engineering mechanics set in various parameter space dimensions.
Changing conditions or environments can cause system dynamics to vary over time. To ensure optimal control performance, controllers should adapt to these changes. When the underlying cause and time of change is unknown, we need to rely on online data for this adaptation. In this paper, we will use time-varying Bayesian optimization (TVBO) to tune controllers online in changing environments using appropriate prior knowledge on the control objective and its changes. Two properties are characteristic of many online controller tuning problems: First, they exhibit incremental and lasting changes in the objective due to changes to the system dynamics, e.g., through wear and tear. Second, the optimization problem is convex in the tuning parameters. Current TVBO methods do not explicitly account for these properties, resulting in poor tuning performance and many unstable controllers through over-exploration of the parameter space. We propose a novel TVBO forgetting strategy using Uncertainty-Injection (UI), which incorporates the assumption of incremental and lasting changes. The control objective is modeled as a spatio-temporal Gaussian process (GP) with UI through a Wiener process in the temporal domain. Further, we explicitly model the convexity assumptions in the spatial dimension through GP models with linear inequality constraints. In numerical experiments, we show that our model outperforms the state-of-the-art method in TVBO, exhibiting reduced regret and fewer unstable parameter configurations.
Implicit Processes (IPs) represent a flexible framework that can be used to describe a wide variety of models, from Bayesian neural networks, neural samplers and data generators to many others. IPs also allow for approximate inference in function-space. This change of formulation solves intrinsic degenerate problems of parameter-space approximate inference concerning the high number of parameters and their strong dependencies in large models. For this, previous works in the literature have attempted to employ IPs both to set up the prior and to approximate the resulting posterior. However, this has proven to be a challenging task. Existing methods that can tune the prior IP result in a Gaussian predictive distribution, which fails to capture important data patterns. By contrast, methods producing flexible predictive distributions by using another IP to approximate the posterior process cannot tune the prior IP to the observed data. We propose here the first method that can accomplish both goals. For this, we rely on an inducing-point representation of the prior IP, as often done in the context of sparse Gaussian processes. The result is a scalable method for approximate inference with IPs that can tune the prior IP parameters to the data, and that provides accurate non-Gaussian predictive distributions.
We propose a data-driven way to reduce the noise of covariance matrices of nonstationary systems. In the case of stationary systems, asymptotic approaches were proved to converge to the optimal solutions. Such methods produce eigenvalues that are highly dependent on the inputs, as common sense would suggest. Our approach proposes instead to use a set of eigenvalues totally independent from the inputs and that encode the long-term averaging of the influence of the future on present eigenvalues. Such an influence can be the predominant factor in nonstationary systems. Using real and synthetic data, we show that our data-driven method outperforms optimal methods designed for stationary systems for the filtering of both covariance matrix and its inverse, as illustrated by financial portfolio variance minimization, which makes out method generically relevant to many problems of multivariate inference.
Spatial data can exhibit dependence structures more complicated than can be represented using models that rely on the traditional assumptions of stationarity and isotropy. Several statistical methods have been developed to relax these assumptions. One in particular, the "spatial deformation approach" defines a transformation from the geographic space in which data are observed, to a latent space in which stationarity and isotropy are assumed to hold. Taking inspiration from this class of models, we develop a new model for spatially dependent data observed on graphs. Our method implies an embedding of the graph into Euclidean space wherein the covariance can be modeled using traditional covariance functions such as those from the Mat\'{e}rn family. This is done via a class of graph metrics compatible with such covariance functions. By estimating the edge weights which underlie these metrics, we can recover the "intrinsic distance" between nodes of a graph. We compare our model to existing methods for spatially dependent graph data, primarily conditional autoregressive (CAR) models and their variants and illustrate the advantages our approach has over traditional methods. We fit our model and competitors to bird abundance data for several species in North Carolina. We find that our model fits the data best, and provides insight into the interaction between species-specific spatial distributions and geography.
Boosting is one of the most significant developments in machine learning. This paper studies the rate of convergence of $L_2$Boosting, which is tailored for regression, in a high-dimensional setting. Moreover, we introduce so-called \textquotedblleft post-Boosting\textquotedblright. This is a post-selection estimator which applies ordinary least squares to the variables selected in the first stage by $L_2$Boosting. Another variant is \textquotedblleft Orthogonal Boosting\textquotedblright\ where after each step an orthogonal projection is conducted. We show that both post-$L_2$Boosting and the orthogonal boosting achieve the same rate of convergence as LASSO in a sparse, high-dimensional setting. We show that the rate of convergence of the classical $L_2$Boosting depends on the design matrix described by a sparse eigenvalue constant. To show the latter results, we derive new approximation results for the pure greedy algorithm, based on analyzing the revisiting behavior of $L_2$Boosting. We also introduce feasible rules for early stopping, which can be easily implemented and used in applied work. Our results also allow a direct comparison between LASSO and boosting which has been missing from the literature. Finally, we present simulation studies and applications to illustrate the relevance of our theoretical results and to provide insights into the practical aspects of boosting. In these simulation studies, post-$L_2$Boosting clearly outperforms LASSO.
We showcase a variety of functions and classes that implement sampling procedures with improved exploration of the parameter space assisted by machine learning. Special attention is paid to setting sane defaults with the objective that adjustments required by different problems remain minimal. This collection of routines can be employed for different types of analysis, from finding bounds on the parameter space to accumulating samples in areas of interest. In particular, we discuss two methods assisted by incorporating different machine learning models: regression and classification. We show that a machine learning classifier can provide higher efficiency for exploring the parameter space. Also, we introduce a boosting technique to improve the slow convergence at the start of the process. The use of these routines is better explained with the help of a few examples that illustrate the type of results one can obtain. We also include examples of the code used to obtain the examples as well as descriptions of the adjustments that can be made to adapt the calculation to other problems. We finalize by showing the impact of these techniques when exploring the parameter space of the two Higgs doublet model that matches the measured Higgs Boson signal strength. The code used for this paper and instructions on how to use it are available on the web.
Importance sampling (IS) is valuable in reducing the variance of Monte Carlo sampling for many areas, including finance, rare event simulation, and Bayesian inference. It is natural and obvious to combine quasi-Monte Carlo (QMC) methods with IS to achieve a faster rate of convergence. However, a naive replacement of Monte Carlo with QMC may not work well. This paper investigates the convergence rates of randomized QMC-based IS for estimating integrals with respect to a Gaussian measure, in which the IS measure is a Gaussian or $t$ distribution. We prove that if the target function satisfies the so-called boundary growth condition and the covariance matrix of the IS density has eigenvalues no smaller than 1, then randomized QMC with the Gaussian proposal has a root mean squared error of $O(N^{-1+\epsilon})$ for arbitrarily small $\epsilon>0$. Similar results of $t$ distribution as the proposal are also established. These sufficient conditions help to assess the effectiveness of IS in QMC. For some particular applications, we find that the Laplace IS, a very general approach to approximate the target function by a quadratic Taylor approximation around its mode, has eigenvalues smaller than 1, making the resulting integrand less favorable for QMC. From this point of view, when using Gaussian distributions as the IS proposal, a change of measure via Laplace IS may transform a favorable integrand into unfavorable one for QMC although the variance of Monte Carlo sampling is reduced. We also give some examples to verify our propositions and warn against naive replacement of MC with QMC under IS proposals. Numerical results suggest that using Laplace IS with $t$ distributions is more robust than that with Gaussian distributions.
Recent studies have utilized sparse classifications to predict categorical variables from high-dimensional brain activity signals to expose human's intentions and mental states, selecting the relevant features automatically in the model training process. However, existing sparse classification models will likely be prone to the performance degradation which is caused by noise inherent in the brain recordings. To address this issue, we aim to propose a new robust and sparse classification algorithm in this study. To this end, we introduce the correntropy learning framework into the automatic relevance determination based sparse classification model, proposing a new correntropy-based robust sparse logistic regression algorithm. To demonstrate the superior brain activity decoding performance of the proposed algorithm, we evaluate it on a synthetic dataset, an electroencephalogram (EEG) dataset, and a functional magnetic resonance imaging (fMRI) dataset. The extensive experimental results confirm that not only the proposed method can achieve higher classification accuracy in a noisy and high-dimensional classification task, but also it would select those more informative features for the decoding scenarios. Integrating the correntropy learning approach with the automatic relevance determination technique will significantly improve the robustness with respect to the noise, leading to more adequate robust sparse brain decoding algorithm. It provides a more powerful approach in the real-world brain activity decoding and the brain-computer interfaces.
The fundamental challenge of drawing causal inference is that counterfactual outcomes are not fully observed for any unit. Furthermore, in observational studies, treatment assignment is likely to be confounded. Many statistical methods have emerged for causal inference under unconfoundedness conditions given pre-treatment covariates, including propensity score-based methods, prognostic score-based methods, and doubly robust methods. Unfortunately for applied researchers, there is no `one-size-fits-all' causal method that can perform optimally universally. In practice, causal methods are primarily evaluated quantitatively on handcrafted simulated data. Such data-generative procedures can be of limited value because they are typically stylized models of reality. They are simplified for tractability and lack the complexities of real-world data. For applied researchers, it is critical to understand how well a method performs for the data at hand. Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods. The framework's novelty stems from its ability to generate synthetic data anchored at the empirical distribution for the observed sample, and therefore virtually indistinguishable from the latter. The approach allows the user to specify ground truth for the form and magnitude of causal effects and confounding bias as functions of covariates. Thus simulated data sets are used to evaluate the potential performance of various causal estimation methods when applied to data similar to the observed sample. We demonstrate Credence's ability to accurately assess the relative performance of causal estimation techniques in an extensive simulation study and two real-world data applications from Lalonde and Project STAR studies.
The dominating NLP paradigm of training a strong neural predictor to perform one task on a specific dataset has led to state-of-the-art performance in a variety of applications (eg. sentiment classification, span-prediction based question answering or machine translation). However, it builds upon the assumption that the data distribution is stationary, ie. that the data is sampled from a fixed distribution both at training and test time. This way of training is inconsistent with how we as humans are able to learn from and operate within a constantly changing stream of information. Moreover, it is ill-adapted to real-world use cases where the data distribution is expected to shift over the course of a model's lifetime. The first goal of this thesis is to characterize the different forms this shift can take in the context of natural language processing, and propose benchmarks and evaluation metrics to measure its effect on current deep learning architectures. We then proceed to take steps to mitigate the effect of distributional shift on NLP models. To this end, we develop methods based on parametric reformulations of the distributionally robust optimization framework. Empirically, we demonstrate that these approaches yield more robust models as demonstrated on a selection of realistic problems. In the third and final part of this thesis, we explore ways of efficiently adapting existing models to new domains or tasks. Our contribution to this topic takes inspiration from information geometry to derive a new gradient update rule which alleviate catastrophic forgetting issues during adaptation.