In this paper, we establish the almost sure convergence of two-timescale stochastic gradient descent algorithms in continuous time under general noise and stability conditions, extending well known results in discrete time. We analyse algorithms with additive noise and those with non-additive noise. In the non-additive case, our analysis is carried out under the assumption that the noise is a continuous-time Markov process, controlled by the algorithm states. The algorithms we consider can be applied to a broad class of bilevel optimisation problems. We study one such problem in detail, namely, the problem of joint online parameter estimation and optimal sensor placement for a partially observed diffusion process. We demonstrate how this can be formulated as a bilevel optimisation problem, and propose a solution in the form of a continuous-time, two-timescale, stochastic gradient descent algorithm. Furthermore, under suitable conditions on the latent signal, the filter, and the filter derivatives, we establish almost sure convergence of the online parameter estimates and optimal sensor placements to the stationary points of the asymptotic log-likelihood and asymptotic filter covariance, respectively. We also provide numerical examples, illustrating the application of the proposed methodology to a partially observed Bene\v{s} equation, and a partially observed stochastic advection-diffusion equation.
Bregman proximal point algorithm (BPPA), as one of the centerpieces in the optimization toolbox, has been witnessing emerging applications. With simple and easy to implement update rule, the algorithm bears several compelling intuitions for empirical successes, yet rigorous justifications are still largely unexplored. We study the computational properties of BPPA through classification tasks with separable data, and demonstrate provable algorithmic regularization effects associated with BPPA. We show that BPPA attains non-trivial margin, which closely depends on the condition number of the distance generating function inducing the Bregman divergence. We further demonstrate that the dependence on the condition number is tight for a class of problems, thus showing the importance of divergence in affecting the quality of the obtained solutions. In addition, we extend our findings to mirror descent (MD), for which we establish similar connections between the margin and Bregman divergence. We demonstrate through a concrete example, and show BPPA/MD converges in direction to the maximal margin solution with respect to the Mahalanobis distance. Our theoretical findings are among the first to demonstrate the benign learning properties BPPA/MD, and also provide corroborations for a careful choice of divergence in the algorithmic design.
We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2021) for PE to solve our PG problem. Based on this analysis, we propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation which involves future trajectories and hence is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.
We study Benamou's domain decomposition algorithm for optimal transport in the entropy regularized setting. The key observation is that the regularized variant converges to the globally optimal solution under very mild assumptions. We prove linear convergence of the algorithm with respect to the Kullback--Leibler divergence and illustrate the (potentially very slow) rates with numerical examples. On problems with sufficient geometric structure (such as Wasserstein distances between images) we expect much faster convergence. We then discuss important aspects of a computationally efficient implementation, such as adaptive sparsity, a coarse-to-fine scheme and parallelization, paving the way to numerically solving large-scale optimal transport problems. We demonstrate efficient numerical performance for computing the Wasserstein-2 distance between 2D images and observe that, even without parallelization, domain decomposition compares favorably to applying a single efficient implementation of the Sinkhorn algorithm in terms of runtime, memory and solution quality.
This paper deals with the Darcy-Forchheimer problem with two kinds of boundary conditions. We discretize the system by using the finite element methods and we propose two iterative schemes to solve the discrete problems. The well-posedness and the convergence of the corresponding iterative problems are then proven. Finally, several numerical experiments are performed to validate the proposed numerical schemes.
In sparse estimation, such as fused lasso and convex clustering, we apply either the proximal gradient method or the alternating direction method of multipliers (ADMM) to solve the problem. It takes time to include matrix division in the former case, while an efficient method such as FISTA (fast iterative shrinkage-thresholding algorithm) has been developed in the latter case. This paper proposes a general method for converting the ADMM solution to the proximal gradient method, assuming that assumption that the derivative of the objective function is Lipschitz continuous. Then, we apply it to sparse estimation problems, such as sparse convex clustering and trend filtering, and we show by numerical experiments that we can obtain a significant improvement in terms of efficiency.
We consider the computation of free energy-like quantities for diffusions in high dimension, when resorting to Monte Carlo simulation is necessary. Such stochastic computations typically suffer from high variance, in particular in a low noise regime, because the expectation is dominated by rare trajectories for which the observable reaches large values. Although importance sampling, or tilting of trajectories, is now a standard technique for reducing the variance of such estimators, quantitative criteria for proving that a given control reduces variance are scarce, and often do not apply to practical situations. The goal of this work is to provide a quantitative criterion for assessing whether a given bias reduces variance, and at which order. We rely for this on a recently introduced notion of stochastic solution for Hamilton-Jacobi-Bellman equations. Based on this tool, we introduce the notion of k-stochastic viscosity approximation of a HJB equation. We next prove that such approximate solutions are associated with estimators having a relative variance of order k-1 at log-scale. Finally, in order to show that our definition is relevant, we provide examples of stochastic viscosity approximations of order one and two, with a numerical illustration confirming our theoretical findings.
Stochastic gradient descent (SGD) and its variants have established themselves as the go-to algorithms for large-scale machine learning problems with independent samples due to their generalization performance and intrinsic computational advantage. However, the fact that the stochastic gradient is a biased estimator of the full gradient with correlated samples has led to the lack of theoretical understanding of how SGD behaves under correlated settings and hindered its use in such cases. In this paper, we focus on hyperparameter estimation for the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full log-likelihood loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ for $K$ iterations, up to a statistical error term depending on the minibatch size. Our theoretical guarantees hold provided that the kernel functions exhibit exponential or polynomial eigendecay which is satisfied by a wide range of kernels commonly used in GPs. Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs.
Solutions of certain partial differential equations (PDEs) are often represented by the steepest descent curves of corresponding functionals. Minimizing movement scheme was developed in order to study such curves in metric spaces. Especially, Jordan-Kinderlehrer-Otto studied the Fokker-Planck equation in this way with respect to the Wasserstein metric space. In this paper, we propose a deep learning-based minimizing movement scheme for approximating the solutions of PDEs. The proposed method is highly scalable for high-dimensional problems as it is free of mesh generation. We demonstrate through various kinds of numerical examples that the proposed method accurately approximates the solutions of PDEs by finding the steepest descent direction of a functional even in high dimensions.
In this paper, we treat estimation and prediction problems where negative multinomial variables are observed and in particular consider unbalanced settings. First, the problem of estimating multiple negative multinomial parameter vectors under the standardized squared error loss is treated and a new empirical Bayes estimator which dominates the UMVU estimator under suitable conditions is derived. Second, we consider estimation of the joint predictive density of several multinomial tables under the Kullback-Leibler divergence and obtain a sufficient condition under which the Bayesian predictive density with respect to a hierarchical shrinkage prior dominates the Bayesian predictive density with respect to the Jeffreys prior. Third, our proposed Bayesian estimator and predictive density give risk improvements in simulations. Finally, the problem of estimating the joint predictive density of negative multinomial variables is discussed.
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activiation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization of deep learning, and pave the way to study the optimization dynamics of training modern deep neural networks.