In this work, we adapt the {\em micro-macro} methodology to stochastic differential equations for the purpose of numerically solving oscillatory evolution equations. The models we consider are addressed in a wide spectrum of regimes where oscillations may be slow or fast. We show that through an ad-hoc transformation (the micro-macro decomposition), it is possible to retain the usual orders of convergence of Euler-Maruyama method, that is to say, uniform weak order one and uniform strong order one half.
We propose and study a temporal, and spatio-temporal discretisation of the 2D stochastic Navier--Stokes equations in bounded domains supplemented with no-slip boundary conditions. Considering additive noise, we base its construction on the related nonlinear random PDE, which is solved by a transform of the solution of the stochastic Navier--Stokes equations. We show strong rate (up to) $1$ in probability for a corresponding discretisation in space and time (and space-time). Convergence of order (up to) 1 in time was previously only known for linear SPDEs.
A new explicit stochastic scheme of order 1 is proposed for solving commutative stochastic differential equations (SDEs) with non-globally Lipschitz continuous coefficients. The proposed method is a semi-tamed version of Milstein scheme to solve SDEs with the drift coefficient consisting of non-Lipschitz continuous term and globally Lipschitz continuous term. It is easily implementable and achieves higher strong convergence order. A stability criterion for this method is derived, which shows that the stability condition of the numerical methods and that of the solved equations keep uniform. Compared with some widely used numerical schemes, the proposed method has better performance in inheriting the mean square stability of the exact solution of SDEs. Numerical experiments are given to illustrate the obtained convergence and stability properties.
We consider a fully discrete scheme for nonlinear stochastic partial differential equations with non-globally Lipschitz coefficients driven by multiplicative noise in a multi-dimensional setting. Our method uses a polynomial based spectral method in space, so it does not require the elliptic operator $A$ and the covariance operator $Q$ of noise in the equation commute, and thus successfully alleviates a restriction of Fourier spectral method for SPDEs pointed out by Jentzen, Kloeden and Winkel. The discretization in time is a tamed semi-implicit scheme which treats the nonlinear term explicitly while being unconditionally stable. Under regular assumptions which are usually made for SPDEs with additive noise, we establish optimal strong convergence rates in both space and time for our fully discrete scheme. We also present numerical experiments which are consistent with our theoretical results.
Understanding the training dynamics of deep learning models is perhaps a necessary step toward demystifying the effectiveness of these models. In particular, how do data from different classes gradually become separable in their feature spaces when training neural networks using stochastic gradient descent? In this study, we model the evolution of features during deep learning training using a set of stochastic differential equations (SDEs) that each corresponds to a training sample. As a crucial ingredient in our modeling strategy, each SDE contains a drift term that reflects the impact of backpropagation at an input on the features of all samples. Our main finding uncovers a sharp phase transition phenomenon regarding the {intra-class impact: if the SDEs are locally elastic in the sense that the impact is more significant on samples from the same class as the input, the features of the training data become linearly separable, meaning vanishing training loss; otherwise, the features are not separable, regardless of how long the training time is. Moreover, in the presence of local elasticity, an analysis of our SDEs shows that the emergence of a simple geometric structure called the neural collapse of the features. Taken together, our results shed light on the decisive role of local elasticity in the training dynamics of neural networks. We corroborate our theoretical analysis with experiments on a synthesized dataset of geometric shapes and CIFAR-10.
Bayesian learning via Stochastic Gradient Langevin Dynamics (SGLD) has been suggested for differentially private learning. While previous research provides differential privacy bounds for SGLD when close to convergence or at the initial steps of the algorithm, the question of what differential privacy guarantees can be made in between remains unanswered. This interim region is essential, especially for Bayesian neural networks, as it is hard to guarantee convergence to the posterior. This paper will show that using SGLD might result in unbounded privacy loss for this interim region, even when sampling from the posterior is as differentially private as desired.
This paper revisits the temporal difference (TD) learning algorithm for the policy evaluation tasks in reinforcement learning. Typically, the performance of TD(0) and TD($\lambda$) is very sensitive to the choice of stepsizes. Oftentimes, TD(0) suffers from slow convergence. Motivated by the tight link between the TD(0) learning algorithm and the stochastic gradient methods, we develop a provably convergent adaptive projected variant of the TD(0) learning algorithm with linear function approximation that we term AdaTD(0). In contrast to the TD(0), AdaTD(0) is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an $\epsilon$ accuracy, the number of iterations needed is $\tilde{O}(\epsilon^{-2}\ln^4\frac{1}{\epsilon}/\ln^4\frac{1}{\rho})$ in the general case, where $\rho$ represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD(0) is no worse than that of TD(0) in the worst case. When the stochastic semi-gradients are sparse, we provide theoretical acceleration of AdaTD(0). Going beyond TD(0), we develop an adaptive variant of TD($\lambda$), which is referred to as AdaTD($\lambda$). Empirically, we evaluate the performance of AdaTD(0) and AdaTD($\lambda$) on several standard reinforcement learning tasks, which demonstrate the effectiveness of our new approaches.
In this article, I introduce the differential equation model and review their frequentist and Bayesian computation methods. A numerical example of the FitzHugh-Nagumo model is given.
In this paper, we consider possibly misspecified stochastic differential equation models driven by L\'{e}vy processes. Regardless of whether the driving noise is Gaussian or not, Gaussian quasi-likelihood estimator can estimate unknown parameters in the drift and scale coefficients. However, in the misspecified case, the asymptotic distribution of the estimator varies by the correction of the misspecification bias, and consistent estimators for the asymptotic variance proposed in the correctly specified case may lose theoretical validity. As one of its solutions, we propose a bootstrap method for approximating the asymptotic distribution. We show that our bootstrap method theoretically works in both correctly specified case and misspecified case without assuming the precise distribution of the driving noise.
We consider a generic and explicit tamed Euler--Maruyama scheme for multidimensional time-inhomogeneous stochastic differential equations with multiplicative Brownian noise. The diffusion coefficient is uniformly elliptic, H\"older continuous and weakly differentiable in the spatial variables while the drift satisfies the Ladyzhenskaya--Prodi--Serrin condition, as considered by Krylov and R\"ockner (2005). In the discrete scheme, the drift is tamed by replacing it by an approximation. A strong rate of convergence of the scheme is provided in terms of the approximation error of the drift in a suitable and possibly very weak topology. A few examples of approximating drifts are discussed in detail. The parameters of the approximating drifts can vary and be fine-tuned to achieve the standard $1/2$-strong convergence rate with a logarithmic factor.
Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) is one of the most popular algorithms in distributed machine learning. However, its convergence properties for these complicated nonconvex problems is still largely unknown, because of the current technical limit. Therefore, in this paper, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problem - streaming PCA, which helps us to understand Aync-MSGD better even for more general problems. Specifically, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA by diffusion approximation. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.