Several widely-used first-order saddle point optimization methods yield an identical continuous-time ordinary differential equation (ODE) to that of the Gradient Descent Ascent (GDA) method when derived naively. However, their convergence properties are very different even on simple bilinear games. We use a technique from fluid dynamics called High-Resolution Differential Equations (HRDEs) to design ODEs of several saddle point optimization methods. On bilinear games, the convergence properties of the derived HRDEs correspond to that of the starting discrete methods. Using these techniques, we show that the HRDE of Optimistic Gradient Descent Ascent (OGDA) has last-iterate convergence for general monotone variational inequalities. To our knowledge, this is the first continuous-time dynamics shown to converge for such a general setting. Moreover, we provide the rates for the best-iterate convergence of the OGDA method, relying solely on the first-order smoothness of the monotone operator.
The monotone variational inequality is a central problem in mathematical programming that unifies and generalizes many important settings such as smooth convex optimization, two-player zero-sum games, convex-concave saddle point problems, etc. The extragradient method by Korpelevich [1976] is one of the most popular methods for solving monotone variational inequalities. Despite its long history and intensive attention from the optimization and machine learning community, the following major problem remains open. What is the last-iterate convergence rate of the extragradient method for monotone and Lipschitz variational inequalities with constraints? We resolve this open problem by showing a tight $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence rate for arbitrary convex feasible sets, which matches the lower bound by Golowich et al. [2020]. Our rate is measured in terms of the standard gap function. The technical core of our result is the monotonicity of a new performance measure -- the tangent residual, which can be viewed as an adaptation of the norm of the operator that takes the local constraints into account. To establish the monotonicity, we develop a new approach that combines the power of the sum-of-squares programming with the low dimensionality of the update rule of the extragradient method. We believe our approach has many additional applications in the analysis of iterative methods.
Stochastic partial differential equations (SPDEs) are the mathematical tool of choice for modelling spatiotemporal PDE-dynamics under the influence of randomness. Based on the notion of mild solution of an SPDE, we introduce a novel neural architecture to learn solution operators of PDEs with (possibly stochastic) forcing from partially observed data. The proposed Neural SPDE model provides an extension to two popular classes of physics-inspired architectures. On the one hand, it extends Neural CDEs and variants -- continuous-time analogues of RNNs -- in that it is capable of processing incoming sequential information arriving irregularly in time and observed at arbitrary spatial resolutions. On the other hand, it extends Neural Operators -- generalizations of neural networks to model mappings between spaces of functions -- in that it can parameterize solution operators of SPDEs depending simultaneously on the initial condition and a realization of the driving noise. By performing operations in the spectral domain, we show how a Neural SPDE can be evaluated in two ways, either by calling an ODE solver (emulating a spectral Galerkin scheme), or by solving a fixed point problem. Experiments on various semilinear SPDEs, including the stochastic Navier-Stokes equations, demonstrate how the Neural SPDE model is capable of learning complex spatiotemporal dynamics in a resolution-invariant way, with better accuracy and lighter training data requirements compared to alternative models, and up to 3 orders of magnitude faster than traditional solvers.
We consider the question of adaptive data analysis within the framework of convex optimization. We ask how many samples are needed in order to compute $\epsilon$-accurate estimates of $O(1/\epsilon^2)$ gradients queried by gradient descent, and we provide two intermediate answers to this question. First, we show that for a general analyst (not necessarily gradient descent) $\Omega(1/\epsilon^3)$ samples are required. This rules out the possibility of a foolproof mechanism. Our construction builds upon a new lower bound (that may be of interest of its own right) for an analyst that may ask several non adaptive questions in a batch of fixed and known $T$ rounds of adaptivity and requires a fraction of true discoveries. We show that for such an analyst $\Omega (\sqrt{T}/\epsilon^2)$ samples are necessary. Second, we show that, under certain assumptions on the oracle, in an interaction with gradient descent $\tilde \Omega(1/\epsilon^{2.5})$ samples are necessary. Our assumptions are that the oracle has only \emph{first order access} and is \emph{post-hoc generalizing}. First order access means that it can only compute the gradients of the sampled function at points queried by the algorithm. Our assumption of \emph{post-hoc generalization} follows from existing lower bounds for statistical queries. More generally then, we provide a generic reduction from the standard setting of statistical queries to the problem of estimating gradients queried by gradient descent. These results are in contrast with classical bounds that show that with $O(1/\epsilon^2)$ samples one can optimize the population risk to accuracy of $O(\epsilon)$ but, as it turns out, with spurious gradients.
Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.
This paper makes the first attempt to apply newly developed upwind GFDM for the meshless solution of two-phase porous flow equations. In the presented method, node cloud is used to flexibly discretize the computational domain, instead of complicated mesh generation. Combining with moving least square approximation and local Taylor expansion, spatial derivatives of oil-phase pressure at a node are approximated by generalized difference operators in the local influence domain of the node. By introducing the first-order upwind scheme of phase relative permeability, and combining the discrete boundary conditions, fully-implicit GFDM-based nonlinear discrete equations of the immiscible two-phase porous flow are obtained and solved by the nonlinear solver based on the Newton iteration method with the automatic differentiation, to avoid the additional computational cost and possible computational instability caused by sequentially coupled scheme. Two numerical examples are implemented to test the computational performances of the presented method. Detailed error analysis finds the two sources of the calculation error, roughly studies the convergence order thus find that the low-order error of GFDM makes the convergence order of GFDM lower than that of FDM when node spacing is small, and points out the significant effect of the symmetry or uniformity of the node collocation in the node influence domain on the accuracy of generalized difference operators, and the radius of the node influence domain should be small to achieve high calculation accuracy, which is a significant difference between the studied hyperbolic two-phase porous flow problem and the elliptic problems when GFDM is applied.
Multigrid is a powerful solver for large-scale linear systems arising from discretized partial differential equations. The convergence theory of multigrid methods for symmetric positive definite problems has been well developed over the past decades, while, for nonsymmetric problems, such theory is still not mature. As a foundation for multigrid analysis, two-grid convergence theory plays an important role in motivating multigrid algorithms. Regarding two-grid methods for nonsymmetric problems, most previous works focus on the spectral radius of iteration matrix or rely on convergence measures that are typically difficult to compute in practice. Moreover, the existing results are confined to two-grid methods with exact solution of the coarse-grid system. In this paper, we analyze the convergence of a two-grid method for nonsymmetric positive definite problems (e.g., linear systems arising from the discretizations of convection-diffusion equations). In the case of exact coarse solver, we establish an elegant identity for characterizing two-grid convergence factor, which is measured by a smoother-induced norm. The identity can be conveniently used to derive a class of optimal restriction operators and analyze how the convergence factor is influenced by restriction. More generally, we present some convergence estimates for an inexact variant of the two-grid method, in which both linear and nonlinear coarse solvers are considered.
We introduce a novel methodology for particle filtering in dynamical systems where the evolution of the signal of interest is described by a SDE and observations are collected instantaneously at prescribed time instants. The new approach includes the discretisation of the SDE and the design of efficient particle filters for the resulting discrete-time state-space model. The discretisation scheme converges with weak order 1 and it is devised to create a sequential dependence structure along the coordinates of the discrete-time state vector. We introduce a class of space-sequential particle filters that exploits this structure to improve performance when the system dimension is large. This is numerically illustrated by a set of computer simulations for a stochastic Lorenz 96 system with additive noise. The new space-sequential particle filters attain approximately constant estimation errors as the dimension of the Lorenz 96 system is increased, with a computational cost that increases polynomially, rather than exponentially, with the system dimension. Besides the new numerical scheme and particle filters, we provide in this paper a general framework for discrete-time filtering in continuous-time dynamical systems described by a SDE and instantaneous observations. Provided that the SDE is discretised using a weakly-convergent scheme, we prove that the marginal posterior laws of the resulting discrete-time state-space model converge to the posterior marginal posterior laws of the original continuous-time state-space model under a suitably defined metric. This result is general and not restricted to the numerical scheme or particle filters specifically studied in this manuscript.
The minimum energy path (MEP) describes the mechanism of reaction, and the energy barrier along the path can be used to calculate the reaction rate in thermal systems. The nudged elastic band (NEB) method is one of the most commonly used schemes to compute MEPs numerically. It approximates an MEP by a discrete set of configuration images, where the discretization size determines both computational cost and accuracy of the simulations. In this paper, we consider a discrete MEP to be a stationary state of the NEB method and prove an optimal convergence rate of the discrete MEP with respect to the number of images. Numerical simulations for the transitions of some several proto-typical model systems are performed to support the theory.
We prove linear convergence of gradient descent to a global minimum for the training of deep residual networks with constant layer width and smooth activation function. We further show that the trained weights, as a function of the layer index, admits a scaling limit which is H\"older continuous as the depth of the network tends to infinity. The proofs are based on non-asymptotic estimates of the loss function and of norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
The conjoining of dynamical systems and deep learning has become a topic of great interest. In particular, neural differential equations (NDEs) demonstrate that neural networks and differential equation are two sides of the same coin. Traditional parameterised differential equations are a special case. Many popular neural network architectures, such as residual networks and recurrent networks, are discretisations. NDEs are suitable for tackling generative problems, dynamical systems, and time series (particularly in physics, finance, ...) and are thus of interest to both modern machine learning and traditional mathematical modelling. NDEs offer high-capacity function approximation, strong priors on model space, the ability to handle irregular data, memory efficiency, and a wealth of available theory on both sides. This doctoral thesis provides an in-depth survey of the field. Topics include: neural ordinary differential equations (e.g. for hybrid neural/mechanistic modelling of physical systems); neural controlled differential equations (e.g. for learning functions of irregular time series); and neural stochastic differential equations (e.g. to produce generative models capable of representing complex stochastic dynamics, or sampling from complex high-dimensional distributions). Further topics include: numerical methods for NDEs (e.g. reversible differential equations solvers, backpropagation through differential equations, Brownian reconstruction); symbolic regression for dynamical systems (e.g. via regularised evolution); and deep implicit models (e.g. deep equilibrium models, differentiable optimisation). We anticipate this thesis will be of interest to anyone interested in the marriage of deep learning with dynamical systems, and hope it will provide a useful reference for the current state of the art.