When factorized approximations are used for variational inference (VI), they tend to underestimate the uncertainty -- as measured in various ways -- of the distributions they are meant to approximate. We consider two popular ways to measure the uncertainty deficit of VI: (i) the degree to which it underestimates the componentwise variance, and (ii) the degree to which it underestimates the entropy. To better understand these effects, and the relationship between them, we examine an informative setting where they can be explicitly (and elegantly) analyzed: the approximation of a Gaussian,~$p$, with a dense covariance matrix, by a Gaussian,~$q$, with a diagonal covariance matrix. We prove that $q$ always underestimates both the componentwise variance and the entropy of $p$, \textit{though not necessarily to the same degree}. Moreover we demonstrate that the entropy of $q$ is determined by the trade-off of two competing forces: it is decreased by the shrinkage of its componentwise variances (our first measure of uncertainty) but it is increased by the factorized approximation which delinks the nodes in the graphical model of $p$. We study various manifestations of this trade-off, notably one where, as the dimension of the problem grows, the per-component entropy gap between $p$ and $q$ becomes vanishingly small even though $q$ underestimates every componentwise variance by a constant multiplicative factor. We also use the shrinkage-delinkage trade-off to bound the entropy gap in terms of the problem dimension and the condition number of the correlation matrix of $p$. Finally we present empirical results on both Gaussian and non-Gaussian targets, the former to validate our analysis and the latter to explore its limitations.
We introduce a physics-driven deep latent variable model (PDDLVM) to learn simultaneously parameter-to-solution (forward) and solution-to-parameter (inverse) maps of parametric partial differential equations (PDEs). Our formulation leverages conventional PDE discretization techniques, deep neural networks, probabilistic modelling, and variational inference to assemble a fully probabilistic coherent framework. In the posited probabilistic model, both the forward and inverse maps are approximated as Gaussian distributions with a mean and covariance parameterized by deep neural networks. The PDE residual is assumed to be an observed random vector of value zero, hence we model it as a random vector with a zero mean and a user-prescribed covariance. The model is trained by maximizing the probability, that is the evidence or marginal likelihood, of observing a residual of zero by maximizing the evidence lower bound (ELBO). Consequently, the proposed methodology does not require any independent PDE solves and is physics-informed at training time, allowing the real-time solution of PDE forward and inverse problems after training. The proposed framework can be easily extended to seamlessly integrate observed data to solve inverse problems and to build generative models. We demonstrate the efficiency and robustness of our method on finite element discretized parametric PDE problems such as linear and nonlinear Poisson problems, elastic shells with complex 3D geometries, and time-dependent nonlinear and inhomogeneous PDEs using a physics-informed neural network (PINN) discretization. We achieve up to three orders of magnitude speed-up after training compared to traditional finite element method (FEM), while outputting coherent uncertainty estimates.
We introduce a new quantum algorithm for computing the Betti numbers of a simplicial complex. In contrast to previous quantum algorithms that work by estimating the eigenvalues of the combinatorial Laplacian, our algorithm is an instance of the generic Incremental Algorithm for computing Betti numbers that incrementally adds simplices to the simplicial complex and tests whether or not they create a cycle. In contrast to existing quantum algorithms for computing Betti numbers that work best when the complex has close to the maximal number of simplices, our algorithm works best for sparse complexes. To test whether a simplex creates a cycle, we introduce a quantum span-program algorithm. We show that the query complexity of our span program is parameterized by quantities called the effective resistance and effective capacitance of the boundary of the simplex. Unfortunately, we also prove upper and lower bounds on the effective resistance and capacitance, showing both quantities can be exponentially large with respect to the size of the complex, implying that our algorithm would have to run for exponential time to exactly compute Betti numbers. However, as a corollary to these bounds, we show that the spectral gap of the combinatorial Laplacian can be exponentially small. As the runtime of all previous quantum algorithms for computing Betti numbers are parameterized by the inverse of the spectral gap, our bounds show that all quantum algorithms for computing Betti numbers must run for exponentially long to exactly compute Betti numbers. Finally, we prove some novel formulas for effective resistance and effective capacitance to give intuition for these quantities.
Mixtures of factor analysers (MFA) models represent a popular tool for finding structure in data, particularly high-dimensional data. While in most applications the number of clusters, and especially the number of latent factors within clusters, is mostly fixed in advance, in the recent literature models with automatic inference on both the number of clusters and latent factors have been introduced. The automatic inference is usually done by assigning a nonparametric prior and allowing the number of clusters and factors to potentially go to infinity. The MCMC estimation is performed via an adaptive algorithm, in which the parameters associated with the redundant factors are discarded as the chain moves. While this approach has clear advantages, it also bears some significant drawbacks. Running a separate factor-analytical model for each cluster involves matrices of changing dimensions, which can make the model and programming somewhat cumbersome. In addition, discarding the parameters associated with the redundant factors could lead to a bias in estimating cluster covariance matrices. At last, identification remains problematic for infinite factor models. The current work contributes to the MFA literature by providing for the automatic inference on the number of clusters and the number of cluster-specific factors while keeping both cluster and factor dimensions finite. This allows us to avoid many of the aforementioned drawbacks of the infinite models. For the automatic inference on the cluster structure, we employ the dynamic mixture of finite mixtures (MFM) model. Automatic inference on cluster-specific factors is performed by assigning an exchangeable shrinkage process (ESP) prior to the columns of the factor loading matrices. The performance of the model is demonstrated on several benchmark data sets as well as real data applications.
Analysis of high-dimensional data, where the number of covariates is larger than the sample size, is a topic of current interest. In such settings, an important goal is to estimate the signal level $\tau^2$ and noise level $\sigma^2$, i.e., to quantify how much variation in the response variable can be explained by the covariates, versus how much of the variation is left unexplained. This thesis considers the estimation of these quantities in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no responses $Y$. Our main research question is: how can one use the unlabeled data to better estimate $\tau^2$ and $\sigma^2$? We consider two frameworks: a linear regression model and a linear projection model in which linearity is not assumed. In the first framework, while linear regression is used, no sparsity assumptions on the coefficients are made. In the second framework, the linearity assumption is also relaxed and we aim to estimate the signal and noise levels defined by the linear projection. We first propose a naive estimator which is unbiased and consistent, under some assumptions, in both frameworks. We then show how the naive estimator can be improved by using zero-estimators, where a zero-estimator is a statistic arising from the unlabeled data, whose expected value is zero. In the first framework, we calculate the optimal zero-estimator improvement and discuss ways to approximate the optimal improvement. In the second framework, such optimality does no longer hold and we suggest two zero-estimators that improve the naive estimator although not necessarily optimally. Furthermore, we show that our approach reduces the variance for general initial estimators and we present an algorithm that potentially improves any initial estimator. Lastly, we consider four datasets and study the performance of our suggested methods.
Recently Chen and Poor initiated the study of learning mixtures of linear dynamical systems. While linear dynamical systems already have wide-ranging applications in modeling time-series data, using mixture models can lead to a better fit or even a richer understanding of underlying subpopulations represented in the data. In this work we give a new approach to learning mixtures of linear dynamical systems that is based on tensor decompositions. As a result, our algorithm succeeds without strong separation conditions on the components, and can be used to compete with the Bayes optimal clustering of the trajectories. Moreover our algorithm works in the challenging partially-observed setting. Our starting point is the simple but powerful observation that the classic Ho-Kalman algorithm is a close relative of modern tensor decomposition methods for learning latent variable models. This gives us a playbook for how to extend it to work with more complicated generative models.
For predictive modeling relying on Bayesian inversion, fully independent, or ``mean-field'', Gaussian distributions are often used as approximate probability density functions in variational inference since the number of variational parameters is twice the number of unknown model parameters. The resulting diagonal covariance structure coupled with unimodal behavior can be too restrictive when dealing with highly non-Gaussian behavior, including multimodality. High-fidelity surrogate posteriors in the form of Gaussian mixtures can capture any distribution to an arbitrary degree of accuracy while maintaining some analytical tractability. Variational inference with Gaussian mixtures with full-covariance structures suffers from a quadratic growth in variational parameters with the number of model parameters. Coupled with the existence of multiple local minima due to nonconvex trends in the loss functions often associated with variational inference, these challenges motivate the need for robust initialization procedures to improve the performance and scalability of variational inference with mixture models. In this work, we propose a method for constructing an initial Gaussian mixture model approximation that can be used to warm-start the iterative solvers for variational inference. The procedure begins with an optimization stage in model parameter space in which local gradient-based optimization, globalized through multistart, is used to determine a set of local maxima, which we take to approximate the mixture component centers. Around each mode, a local Gaussian approximation is constructed via the Laplace method. Finally, the mixture weights are determined through constrained least squares regression. Robustness and scalability are demonstrated using synthetic tests. The methodology is applied to an inversion problem in structural dynamics involving unknown viscous damping coefficients.
The role of cryptocurrencies within the financial systems has been expanding rapidly in recent years among investors and institutions. It is therefore crucial to investigate the phenomena and develop statistical methods able to capture their interrelationships, the links with other global systems, and, at the same time, the serial heterogeneity. For these reasons, this paper introduces hidden Markov regression models for jointly estimating quantiles and expectiles of cryptocurrency returns using regime-switching copulas. The proposed approach allows us to focus on extreme returns and describe their temporal evolution by introducing time-dependent coefficients evolving according to a latent Markov chain. Moreover to model their time-varying dependence structure, we consider elliptical copula functions defined by state-specific parameters. Maximum likelihood estimates are obtained via an Expectation-Maximization algorithm. The empirical analysis investigates the relationship between daily returns of five cryptocurrencies and major world market indices.
Traditional coverage grey-box fuzzers perform a breadth-first search of the state space of Program Under Test (PUT). This aimlessness wastes a lot of computing resources. Directed grey-box fuzzing focuses on the target of PUT and becomes one of the most popular topics of software testing. The early termination of unreachable test cases is a method to improve directed grey-box fuzzing. However, existing solutions have two problems: firstly, reachability analysis needs to introduce extra technologies (e.g., static analysis); secondly, the performance of reachability analysis and auxiliary technologies lack versatility. We propose FGo, a probabilistic exponential cut-the-loss directed grey-box fuzzer. FGo terminates unreachable test cases early with exponentially increasing probability. Compared to other technologies, FGo makes full use of the unreachable information contained in iCFG and doesn't generate any additional overhead caused by reachability analysis. Moreover, it is easy to generalize to all PUT. This strategy based on probability is perfectly adapted to the randomness of fuzzing. The experiment results show that FGo is 106% faster than AFLGo in reproducing crashes. We compare multiple parameters of probabilistic exponential cut-the-loss algorithm and analyze them in detail. In addition, for enhancing the inerpretability of FGo, this paper discusses the difference between the theoretical performance and the practical performance of probabilistic exponential cut-the-loss algorithm.
Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerNorm. We provide an explanation by showing that InstanceNorm serves as a preconditioner for GNNs, but such preconditioning effect is weaker with BatchNorm due to the heavy batch noise in graph datasets. Second, we show that the shift operation in InstanceNorm results in an expressiveness degradation of GNNs for highly regular graphs. We address this issue by proposing GraphNorm with a learnable shift. Empirically, GNNs with GraphNorm converge faster compared to GNNs using other normalization. GraphNorm also improves the generalization of GNNs, achieving better performance on graph classification benchmarks.
Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become an indispensable strategy to speed up training large-scale Graph Neural Networks (GNNs). However, existing sampling methods are mostly based on the graph structural information and ignore the dynamicity of optimization, which leads to high variance in estimating the stochastic gradients. The high variance issue can be very pronounced in extremely large graphs, where it results in slow convergence and poor generalization. In this paper, we theoretically analyze the variance of sampling methods and show that, due to the composite structure of empirical risk, the variance of any sampling method can be decomposed into \textit{embedding approximation variance} in the forward stage and \textit{stochastic gradient variance} in the backward stage that necessities mitigating both types of variance to obtain faster convergence rate. We propose a decoupled variance reduction strategy that employs (approximate) gradient information to adaptively sample nodes with minimal variance, and explicitly reduces the variance introduced by embedding approximation. We show theoretically and empirically that the proposed method, even with smaller mini-batch sizes, enjoys a faster convergence rate and entails a better generalization compared to the existing methods.