An invertible function is bi-Lipschitz if both the function and its inverse have bounded Lipschitz constants. Nowadays, most Normalizing Flows are bi-Lipschitz by design or by training to limit numerical errors (among other things). In this paper, we discuss the expressivity of bi-Lipschitz Normalizing Flows and identify several target distributions that are difficult to approximate using such models. Then, we characterize the expressivity of bi-Lipschitz Normalizing Flows by giving several lower bounds on the Total Variation distance between these particularly unfavorable distributions and their best possible approximation. Finally, we discuss potential remedies which include using more complex latent distributions.
Let $\sigma$ be a first-order signature and let $\mathbf{W}_n$ be the set of all $\sigma$-structures with domain $[n] = \{1, \ldots, n\}$. We can think of each structure in $\mathbf{W}_n$ as representing a "possible (state of the) world". By an inference framework we mean a class $\mathbf{F}$ of pairs $(\mathbb{P}, L)$, where $\mathbb{P} = (\mathbb{P}_n : n = 1, 2, 3, \ldots)$ and each $\mathbb{P}_n$ is a probability distribution on $\mathbb{W}_n$, and $L$ is a logic with truth values in the unit interval $[0, 1]$. From the point of view of probabilistic and logical expressivity one may consider an inference framework as optimal if it allows any pair $(\mathbb{P}, L)$ where $\mathbb{P} = (\mathbb{P}_n : n = 1, 2, 3, \ldots)$ is a sequence of probability distributions on $\mathbb{W}_n$ and $L$ is a logic. But from the point of view of using a pair $(\mathbb{P}, L)$ from such an inference framework for making inferences on $\mathbb{W}_n$ when $n$ is large we face the problem of computational complexity. This motivates looking for an "optimal" trade-off (in a given context) between expressivity and computational efficiency. We define a notion that an inference framework is "asymptotically at least as expressive" as another inference framework. This relation is a preorder and we describe a (strict) partial order on the equivalence classes of some inference frameworks that in our opinion are natural in the context of machine learning and artificial intelligence. The results have bearing on issues concerning efficient learning and probabilistic inference, but are also new instances of results in finite model theory about "almost sure elimination" of extra syntactic features (e.g quantifiers) beyond the connectives. Often such a result has a logical convergence law as a corollary.
In this work, we introduce a novel approach to formulating an artificial viscosity for shock capturing in nonlinear hyperbolic systems by utilizing the property that the solutions of hyperbolic conservation laws are not reversible in time in the vicinity of shocks. The proposed approach does not require any additional governing equations or a priori knowledge of the hyperbolic system in question, is independent of the mesh and approximation order, and requires the use of only one tunable parameter. The primary novelty is that the resulting artificial viscosity is unique for each component of the conservation law which is advantageous for systems in which some components exhibit discontinuities while others do not. The efficacy of the method is shown in numerical experiments of multi-dimensional hyperbolic conservation laws such as nonlinear transport, Euler equations, and ideal magnetohydrodynamics using a high-order discontinuous spectral element method on unstructured grids.
Conductivity imaging represents one of the most important tasks in medical imaging. In this work we develop a neural network based reconstruction technique for imaging the conductivity from the magnitude of the internal current density. It is achieved by formulating the problem as a relaxed weighted least-gradient problem, and then approximating its minimizer by standard fully connected feedforward neural networks. We derive bounds on two components of the generalization error, i.e., approximation error and statistical error, explicitly in terms of properties of the neural networks (e.g., depth, total number of parameters, and the bound of the network parameters). We illustrate the performance and distinct features of the approach on several numerical experiments. Numerically, it is observed that the approach enjoys remarkable robustness with respect to the presence of data noise.
We study the problem of testing whether a function $f: \mathbb{R}^n \to \mathbb{R}$ is a polynomial of degree at most $d$ in the \emph{distribution-free} testing model. Here, the distance between functions is measured with respect to an unknown distribution $\mathcal{D}$ over $\mathbb{R}^n$ from which we can draw samples. In contrast to previous work, we do not assume that $\mathcal{D}$ has finite support. We design a tester that given query access to $f$, and sample access to $\mathcal{D}$, makes $(d/\varepsilon)^{O(1)}$ many queries to $f$, accepts with probability $1$ if $f$ is a polynomial of degree $d$, and rejects with probability at least $2/3$ if every degree-$d$ polynomial $P$ disagrees with $f$ on a set of mass at least $\varepsilon$ with respect to $\mathcal{D}$. Our result also holds under mild assumptions when we receive only a polynomial number of bits of precision for each query to $f$, or when $f$ can only be queried on rational points representable using a logarithmic number of bits. Along the way, we prove a new stability theorem for multivariate polynomials that may be of independent interest.
This paper makes the first attempt to apply newly developed upwind GFDM for the meshless solution of two-phase porous flow equations. In the presented method, node cloud is used to flexibly discretize the computational domain, instead of complicated mesh generation. Combining with moving least square approximation and local Taylor expansion, spatial derivatives of oil-phase pressure at a node are approximated by generalized difference operators in the local influence domain of the node. By introducing the first-order upwind scheme of phase relative permeability, and combining the discrete boundary conditions, fully-implicit GFDM-based nonlinear discrete equations of the immiscible two-phase porous flow are obtained and solved by the nonlinear solver based on the Newton iteration method with the automatic differentiation, to avoid the additional computational cost and possible computational instability caused by sequentially coupled scheme. Two numerical examples are implemented to test the computational performances of the presented method. Detailed error analysis finds the two sources of the calculation error, roughly studies the convergence order thus find that the low-order error of GFDM makes the convergence order of GFDM lower than that of FDM when node spacing is small, and points out the significant effect of the symmetry or uniformity of the node collocation in the node influence domain on the accuracy of generalized difference operators, and the radius of the node influence domain should be small to achieve high calculation accuracy, which is a significant difference between the studied hyperbolic two-phase porous flow problem and the elliptic problems when GFDM is applied.
Let $X^{(n)}$ be an observation sampled from a distribution $P_{\theta}^{(n)}$ with an unknown parameter $\theta,$ $\theta$ being a vector in a Banach space $E$ (most often, a high-dimensional space of dimension $d$). We study the problem of estimation of $f(\theta)$ for a functional $f:E\mapsto {\mathbb R}$ of some smoothness $s>0$ based on an observation $X^{(n)}\sim P_{\theta}^{(n)}.$ Assuming that there exists an estimator $\hat \theta_n=\hat \theta_n(X^{(n)})$ of parameter $\theta$ such that $\sqrt{n}(\hat \theta_n-\theta)$ is sufficiently close in distribution to a mean zero Gaussian random vector in $E,$ we construct a functional $g:E\mapsto {\mathbb R}$ such that $g(\hat \theta_n)$ is an asymptotically normal estimator of $f(\theta)$ with $\sqrt{n}$ rate provided that $s>\frac{1}{1-\alpha}$ and $d\leq n^{\alpha}$ for some $\alpha\in (0,1).$ We also derive general upper bounds on Orlicz norm error rates for estimator $g(\hat \theta)$ depending on smoothness $s,$ dimension $d,$ sample size $n$ and the accuracy of normal approximation of $\sqrt{n}(\hat \theta_n-\theta).$ In particular, this approach yields asymptotically efficient estimators in some high-dimensional exponential models.
The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to $\sqrt{n}$, the gradient complexity to achieve an $\epsilon$-precision is $\tilde{O}((n+dn^{1/2}\epsilon^{-1})\gamma^2 L^2\alpha^{-2})$, which is an improvement from any previous analyses. We also show some essential applications of our result to non-convex optimization.
One of the most important problems in system identification and statistics is how to estimate the unknown parameters of a given model. Optimization methods and specialized procedures, such as Empirical Minimization (EM) can be used in case the likelihood function can be computed. For situations where one can only simulate from a parametric model, but the likelihood is difficult or impossible to evaluate, a technique known as the Two-Stage (TS) Approach can be applied to obtain reliable parametric estimates. Unfortunately, there is currently a lack of theoretical justification for TS. In this paper, we propose a statistical decision-theoretical derivation of TS, which leads to Bayesian and Minimax estimators. We also show how to apply the TS approach on models for independent and identically distributed samples, by computing quantiles of the data as a first step, and using a linear function as the second stage. The proposed method is illustrated via numerical simulations.
We recall some of the history of the information-theoretic approach to deriving core results in probability theory and indicate parts of the recent resurgence of interest in this area with current progress along several interesting directions. Then we give a new information-theoretic proof of a finite version of de Finetti's classical representation theorem for finite-valued random variables. We derive an upper bound on the relative entropy between the distribution of the first $k$ in a sequence of $n$ exchangeable random variables, and an appropriate mixture over product distributions. The mixing measure is characterised as the law of the empirical measure of the original sequence, and de Finetti's result is recovered as a corollary. The proof is nicely motivated by the Gibbs conditioning principle in connection with statistical mechanics, and it follows along an appealing sequence of steps. The technical estimates required for these steps are obtained via the use of a collection of combinatorial tools known within information theory as `the method of types.'
The conjoining of dynamical systems and deep learning has become a topic of great interest. In particular, neural differential equations (NDEs) demonstrate that neural networks and differential equation are two sides of the same coin. Traditional parameterised differential equations are a special case. Many popular neural network architectures, such as residual networks and recurrent networks, are discretisations. NDEs are suitable for tackling generative problems, dynamical systems, and time series (particularly in physics, finance, ...) and are thus of interest to both modern machine learning and traditional mathematical modelling. NDEs offer high-capacity function approximation, strong priors on model space, the ability to handle irregular data, memory efficiency, and a wealth of available theory on both sides. This doctoral thesis provides an in-depth survey of the field. Topics include: neural ordinary differential equations (e.g. for hybrid neural/mechanistic modelling of physical systems); neural controlled differential equations (e.g. for learning functions of irregular time series); and neural stochastic differential equations (e.g. to produce generative models capable of representing complex stochastic dynamics, or sampling from complex high-dimensional distributions). Further topics include: numerical methods for NDEs (e.g. reversible differential equations solvers, backpropagation through differential equations, Brownian reconstruction); symbolic regression for dynamical systems (e.g. via regularised evolution); and deep implicit models (e.g. deep equilibrium models, differentiable optimisation). We anticipate this thesis will be of interest to anyone interested in the marriage of deep learning with dynamical systems, and hope it will provide a useful reference for the current state of the art.