Nesterov's accelerated gradient method (NAG) is widely used in problems with machine learning background including deep learning, and is corresponding to a continuous-time differential equation. From this connection, the property of the differential equation and its numerical approximation can be investigated to improve the accelerated gradient method. In this work we present a new improvement of NAG in terms of stability inspired by numerical analysis. We give the precise order of NAG as a numerical approximation of its continuous-time limit and then present a new method with higher order. We show theoretically that our new method is more stable than NAG for large step size. Experiments of matrix completion and handwriting digit recognition demonstrate that the stability of our new method is better. Furthermore, better stability leads to higher computational speed in experiments.
The present paper continues our investigation of an implementation of a least-squares collocation method for higher-index differential-algebraic equations. In earlier papers, we were able to substantiate the choice of basis functions and collocation points for a robust implementation as well as algorithms for the solution of the discrete system. The present paper is devoted to an analytic estimation of condition numbers for different components of an implementation. We present error estimations, which show the sources for the different errors.
This paper studies the inverse problem of determination the history for a stochastic diffusion process, by means of the value at the final time $T$. By establishing a new Carleman estimate, the conditional stability of the problem is proven. Based on the idea of Tikhonov method, a regularized solution is proposed. The analysis of the existence and uniqueness of the regularized solution, and proof for error estimate under an a-proior assumption are present. Numerical verification of the regularization, including numerical algorithm and examples are also illustrated.
Verifying quantum systems has attracted a lot of interest in the last decades. In this paper, we study the quantitative model-checking of quantum continuous-time Markov chains (quantum CTMCs). The branching-time properties of quantum CTMCs are specified by continuous stochastic logic (CSL), which is famous for verifying real-time systems, including classical CTMCs. The core of checking the CSL formulas lies in tackling multiphase until formulas. We develop an algebraic method using proper projection, matrix exponentiation, and definite integration to symbolically calculate the probability measures of path formulas. Thus the decidability of CSL is established. To be efficient, numerical methods are incorporated to guarantee that the time complexity is polynomial in the encoding size of the input model and linear in the size of the input formula. A running example of Apollonian networks is further provided to demonstrate our method.
In wet-lab experiments, the slime mold Physarum polycephalum has demonstrated its ability to solve shortest path problems and to design efficient networks. For the shortest path problem, a mathematical model for the evolution of the slime is available and it has been shown in computer experiments and through mathematical analysis that the dynamics solves the shortest path problem. In this paper, we introduce a dynamics for the network design problem. We formulate network design as the problem of constructing a network that efficiently supports a multi-commodity flow problem. We investigate the dynamics in computer simulations and analytically. The simulations show that the dynamics is able to construct efficient and elegant networks. In the theoretical part we show that the dynamics minimizes an objective combining the cost of the network and the cost of routing the demands through the network. We also give alternative characterization of the optimum solution.
We present a novel hybrid strategy based on machine learning to improve curvature estimation in the level-set method. The proposed inference system couples enhanced neural networks with standard numerical schemes to compute curvature more accurately. The core of our hybrid framework is a switching mechanism that relies on well established numerical techniques to gauge curvature. If the curvature magnitude is larger than a resolution-dependent threshold, it uses a neural network to yield a better approximation. Our networks are multilayer perceptrons fitted to synthetic data sets composed of sinusoidal- and circular-interface samples at various configurations. To reduce data set size and training complexity, we leverage the problem's characteristic symmetry and build our models on just half of the curvature spectrum. These savings lead to a powerful inference system able to outperform any of its numerical or neural component alone. Experiments with stationary, smooth interfaces show that our hybrid solver is notably superior to conventional numerical methods in coarse grids and along steep interface regions. Compared to prior research, we have observed outstanding gains in precision after training the regression model with data pairs from more than a single interface type and transforming data with specialized input preprocessing. In particular, our findings confirm that machine learning is a promising venue for reducing or removing mass loss in the level-set method.
The training of deep neural networks (DNNs) is currently predominantly done using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose and analyze the convergence of an approximate natural gradient method, mini-block Fisher (MBF), that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is also block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. Finally, the performance of our proposed method is compared to that of several baseline methods, on both Auto-encoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power.
Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.
Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) is one of the most popular algorithms in distributed machine learning. However, its convergence properties for these complicated nonconvex problems is still largely unknown, because of the current technical limit. Therefore, in this paper, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problem - streaming PCA, which helps us to understand Aync-MSGD better even for more general problems. Specifically, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA by diffusion approximation. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD.
We propose accelerated randomized coordinate descent algorithms for stochastic optimization and online learning. Our algorithms have significantly less per-iteration complexity than the known accelerated gradient algorithms. The proposed algorithms for online learning have better regret performance than the known randomized online coordinate descent algorithms. Furthermore, the proposed algorithms for stochastic optimization exhibit as good convergence rates as the best known randomized coordinate descent algorithms. We also show simulation results to demonstrate performance of the proposed algorithms.
Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.