Predictive coding (PC) accounts of perception now form one of the dominant computational theories of the brain, where they prescribe a general algorithm for inference and learning over hierarchical latent probabilistic models. Despite this, they have enjoyed little export to the broader field of machine learning, where comparative generative modelling techniques have flourished. In part, this has been due to the poor performance of models trained with PC when evaluated by both sample quality and marginal likelihood. By adopting the perspective of PC as a variational Bayes algorithm under the Laplace approximation, we identify the source of these deficits to lie in the exclusion of an associated Hessian term in the PC objective function, which would otherwise regularise the sharpness of the probability landscape and prevent over-certainty in the approximate posterior. To remedy this, we make three primary contributions: we begin by suggesting a simple Monte Carlo estimated evidence lower bound which relies on sampling from the Hessian-parameterised variational posterior. We then derive a novel block diagonal approximation to the full Hessian matrix that has lower memory requirements and favourable mathematical properties. Lastly, we present an algorithm that combines our method with standard PC to reduce memory complexity further. We evaluate models trained with our approach against the standard PC framework on image benchmark datasets. Our approach produces higher log-likelihoods and qualitatively better samples that more closely capture the diversity of the data-generating distribution.
We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational overhead. The proposed method allows effective training even in small-batch stochastic regimes, which makes it competitive to first-order as well as second-order methods.
Approximate Bayesian Computation (ABC) is a widely applicable and popular approach to estimating unknown parameters of mechanistic models. As ABC analyses are computationally expensive, parallelization on high-performance infrastructure is often necessary. However, the existing parallelization strategies leave resources unused at times and thus do not optimally leverage them yet. We present look-ahead scheduling, a wall-time minimizing parallelization strategy for ABC Sequential Monte Carlo algorithms, which utilizes all available resources at practically all times by proactive sampling for prospective tasks. Our strategy can be integrated in e.g. adaptive distance function and summary statistic selection schemes, which is essential in practice. Evaluation of the strategy on different problems and numbers of parallel cores reveals speed-ups of typically 10-20% and up to 50% compared to the best established approach. Thus, the proposed strategy allows to substantially improve the cost and run-time efficiency of ABC methods on high-performance infrastructure.
We propose a posterior for Bayesian Likelihood-Free Inference (LFI) based on generalized Bayesian inference. To define the posterior, we use Scoring Rules (SRs), which evaluate probabilistic models given an observation. In LFI, we can sample from the model but not evaluate the likelihood; hence, we employ SRs which admit unbiased empirical estimates. We use the Energy and Kernel SRs, for which our posterior enjoys consistency in a well-specified setting and outlier robustness. We perform inference with pseudo-marginal (PM) Markov Chain Monte Carlo (MCMC) or stochastic-gradient (SG) MCMC. While PM-MCMC works satisfactorily for simple setups, it mixes poorly for concentrated targets. Conversely, SG-MCMC requires differentiating the simulator model but improves performance over PM-MCMC when both work and scales to higher-dimensional setups as it is rejection-free. Although both techniques target the SR posterior approximately, the error diminishes as the number of model simulations at each MCMC step increases. In our simulations, we employ automatic differentiation to effortlessly differentiate the simulator model. We compare our posterior with related approaches on standard benchmarks and a chaotic dynamical system from meteorology, for which SG-MCMC allows inferring the parameters of a neural network used to parametrize a part of the update equations of the dynamical system.
In this work, we use Deep Gaussian Processes (DGPs) as statistical surrogates for stochastic processes with complex distributions. Conventional inferential methods for DGP models can suffer from high computational complexity as they require large-scale operations with kernel matrices for training and inference. In this work, we propose an efficient scheme for accurate inference and efficient training based on a range of Gaussian Processes, called the Tensor Markov Gaussian Processes (TMGP). We construct an induced approximation of TMGP referred to as the hierarchical expansion. Next, we develop a deep TMGP (DTMGP) model as the composition of multiple hierarchical expansion of TMGPs. The proposed DTMGP model has the following properties: (1) the outputs of each activation function are deterministic while the weights are chosen independently from standard Gaussian distribution; (2) in training or prediction, only polylog(M) (out of M) activation functions have non-zero outputs, which significantly boosts the computational efficiency. Our numerical experiments on synthetic models and real datasets show the superior computational efficiency of DTMGP over existing DGP models.
In this article on variational regularization for ill-posed nonlinear problems, we are once again discussing the consequences of an oversmoothing penalty term. This means in our model that the searched-for solution of the considered nonlinear operator equation does not belong to the domain of definition of the penalty functional. In the past years, such variational regularization has been investigated comprehensively in Hilbert scales, but rarely in a Banach space setting. Our present results try to establish a theoretical justification of oversmoothing regularization in Banach scales. This new study includes convergence rates results for a priori and a posteriori choices of the regularization parameter, both for H\"older-type smoothness and low order-type smoothness. An illustrative example is intended to indicate the specificity of occurring non-reflexive Banach spaces.
We study numerical integration over bounded regions in $\mathbb{R}^s, s\ge1$ with respect to some probability measure. We replace random sampling with quasi-Monte Carlo methods, where the underlying point set is derived from deterministic constructions that aim to fill the space more evenly than random points. Such quasi-Monte Carlo point sets are ordinarily designed for the uniform measure, and the theory only works for product measures when a coordinate-wise transformation is applied. Going beyond this setting, we first consider the case where the target density is a mixture distribution where each term in the mixture comes from a product distribution. Next we consider target densities which can be approximated with such mixture distributions. We require the approximation to be a sum of coordinate-wise products and the approximation to be positive everywhere (so that they can be re-scaled to probability density functions). We use tensor product hat function approximations for this purpose here, since a hat function approximation of a positive function is itself positive. We also study more complex algorithms, where we first approximate the target density with a general Gaussian mixture distribution and approximate the mixtures with an adaptive hat function approximation on rotated intervals. The Gaussian mixture approximation allows us to locate the essential parts of the target density, whereas the adaptive hat function approximation allows us to approximate the finer structure of the target density. We prove convergence rates for each of the integration techniques based on quasi-Monte Carlo sampling for integrands with bounded partial mixed derivatives. The employed algorithms are based on digital $(t,s)$-sequences over the finite field $\mathbb{F}_2$ and an inversion method. Numerical examples illustrate the performance of the algorithms for some target densities and integrands.
For $h$-FEM discretisations of the Helmholtz equation with wavenumber $k$, we obtain $k$-explicit analogues of the classic local FEM error bounds of [Nitsche, Schatz 1974], [Wahlbin 1991], [Demlow, Guzm\'an, Schatz 2011], showing that these bounds hold with constants independent of $k$, provided one works in Sobolev norms weighted with $k$ in the natural way. We prove two main results: (i) a bound on the local $H^1$ error by the best approximation error plus the $L^2$ error, both on a slightly larger set, and (ii) the bound in (i) but now with the $L^2$ error replaced by the error in a negative Sobolev norm. The result (i) is valid for shape-regular triangulations, and is the $k$-explicit analogue of the main result of [Demlow, Guzm\'an, Schatz, 2011]. The result (ii) is valid when the mesh is locally quasi-uniform on the scale of the wavelength (i.e., on the scale of $k^{-1}$) and is the $k$-explicit analogue of the results of [Nitsche, Schatz 1974], [Wahlbin 1991]. Since our Sobolev spaces are weighted with $k$ in the natural way, the result (ii) indicates that the Helmholtz FEM solution is locally quasi-optimal modulo low frequencies (i.e., frequencies $\lesssim k$). Numerical experiments confirm this property, and also highlight interesting propagation phenomena in the Helmholtz FEM error.
Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions and discrepancies for each parameter. The efficient additive acquisition structure is combined with exponentiated loss -likelihood to provide a misspecification-robust characterisation of the marginal posterior distribution for all model parameters. The method successfully performs computationally efficient inference in a 100-dimensional space on canonical examples and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space.
We introduce a new computational framework for estimating parameters in generalized generalized linear models (GGLM), a class of models that extends the popular generalized linear models (GLM) to account for dependencies among observations in spatio-temporal data. The proposed approach uses a monotone operator-based variational inequality method to overcome non-convexity in parameter estimation and provide guarantees for parameter recovery. The results can be applied to GLM and GGLM, focusing on spatio-temporal models. We also present online instance-based bounds using martingale concentrations inequalities. Finally, we demonstrate the performance of the algorithm using numerical simulations and a real data example for wildfire incidents.
Recently, graph neural networks have been gaining a lot of attention to simulate dynamical systems due to their inductive nature leading to zero-shot generalizability. Similarly, physics-informed inductive biases in deep-learning frameworks have been shown to give superior performance in learning the dynamics of physical systems. There is a growing volume of literature that attempts to combine these two approaches. Here, we evaluate the performance of thirteen different graph neural networks, namely, Hamiltonian and Lagrangian graph neural networks, graph neural ODE, and their variants with explicit constraints and different architectures. We briefly explain the theoretical formulation highlighting the similarities and differences in the inductive biases and graph architecture of these systems. We evaluate these models on spring, pendulum, gravitational, and 3D deformable solid systems to compare the performance in terms of rollout error, conserved quantities such as energy and momentum, and generalizability to unseen system sizes. Our study demonstrates that GNNs with additional inductive biases, such as explicit constraints and decoupling of kinetic and potential energies, exhibit significantly enhanced performance. Further, all the physics-informed GNNs exhibit zero-shot generalizability to system sizes an order of magnitude larger than the training system, thus providing a promising route to simulate large-scale realistic systems.