Accelerated gradient methods are the cornerstones of large-scale, data-driven optimization problems that arise naturally in machine learning and other fields concerning data analysis. We introduce a gradient-based optimization framework for achieving acceleration, based on the recently introduced notion of fixed-time stability of dynamical systems. The method presents itself as a generalization of simple gradient-based methods suitably scaled to achieve convergence to the optimizer in a fixed-time, independent of the initialization. We achieve this by first leveraging a continuous-time framework for designing fixed-time stable dynamical systems, and later providing a consistent discretization strategy, such that the equivalent discrete-time algorithm tracks the optimizer in a practically fixed number of iterations. We also provide a theoretical analysis of the convergence behavior of the proposed gradient flows, and their robustness to additive disturbances for a range of functions obeying strong convexity, strict convexity, and possibly nonconvexity but satisfying the Polyak-{\L}ojasiewicz inequality. We also show that the regret bound on the convergence rate is constant by virtue of the fixed-time convergence. The hyperparameters have intuitive interpretations and can be tuned to fit the requirements on the desired convergence rates. We validate the accelerated convergence properties of the proposed schemes on a range of numerical examples against the state-of-the-art optimization algorithms. Our work provides insights on developing novel optimization algorithms via discretization of continuous-time flows.
Computations of incompressible flows with velocity boundary conditions require solution of a Poisson equation for pressure with all Neumann boundary conditions. Discretization of such a Poisson equation results in a rank-deficient matrix of coefficients. When a non-conservative discretization method such as finite difference, finite element, or spectral scheme is used, such a matrix also generates an inconsistency which makes the residuals in the iterative solution to saturate at a threshold level that depends on the spatial resolution and order of the discretization scheme. In this paper, we examine inconsistency for a high-order meshless discretization scheme suitable for solving the equations on a complex domain. The high order meshless method uses polyharmonic spline radial basis functions (PHS-RBF) with appended polynomials to interpolate scattered data and constructs the discrete equations by collocation. The PHS-RBF provides the flexibility to vary the order of discretization by increasing the degree of the appended polynomial. In this study, we examine the convergence of the inconsistency for different spatial resolutions and for different degrees of the appended polynomials by solving the Poisson equation for a manufactured solution as well as the Navier-Stokes equations for several fluid flows. We observe that the inconsistency decreases faster than the error in the final solution, and eventually becomes vanishing small at sufficient spatial resolution. The rate of convergence of the inconsistency is observed to be similar or better than the rate of convergence of the discretization errors. This beneficial observation makes it unnecessary to regularize the Poisson equation by fixing either the mean pressure or pressure at an arbitrary point. A simple point solver such as the SOR is seen to be well-convergent, although it can be further accelerated using multilevel methods.
We develop a one-Newton-step-per-horizon, online, lag-$L$, model predictive control (MPC) algorithm for solving discrete-time, equality-constrained, nonlinear dynamic programs. Based on recent sensitivity analysis results for the target problems class, we prove that the approach exhibits a behavior that we call superconvergence; that is, the tracking error with respect to the full horizon solution is not only stable for successive horizon shifts, but also decreases with increasing shift order to a minimum value that decays exponentially in the length of the receding horizon. The key analytical step is the decomposition of the one-step error recursion of our algorithm into algorithmic error and perturbation error. We show that the perturbation error decays exponentially with the lag between two consecutive receding horizons, while~the algorithmic error, determined by Newton's method, achieves quadratic convergence instead. Overall this approach induces our local exponential convergence result in terms of the receding horizon length for suitable values of $L$. Numerical experiments validate our theoretical findings.
In this article, we have considered a nonlinear nonlocal time dependent fourth order equation demonstrating the deformation of a thin and narrow rectangular plate. We propose $C^1$ conforming virtual element method (VEM) of arbitrary order, $k\ge2$, to approximate the model problem numerically. We employ VEM to discretize the space variable and fully implicit scheme for temporal variable. Well-posedness of the fully discrete scheme is proved under certain conditions on the physical parameters, and we derive optimal order of convergence in both space and time variable. Finally, numerical experiments are presented to illustrate the behaviour of the proposed numerical scheme.
In this paper we first write a proof of the perceptron convergence algorithm for the complex multivalued neural networks (CMVNNs). Our primary goal is to formulate and prove the perceptron convergence algorithm for the bicomplex multivalued neural networks (BMVNNs) and other important results in the theory of neural networks based on a bicomplex algebra.
We study the convergence of random iterative sequence of a family of operators on infinite dimensional Hilbert spaces, which are inspired by the Stochastic Gradient Descent (SGD) algorithm in the case of the noiseless regression, as studied in [1]. We demonstrate that its polynomial convergence rate depends on the initial state, while the randomness plays a role only in the choice of the best constant factor and we close the gap between the upper and lower bounds.
We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.
Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order trajectory optimization algorithm rooted in the Approximate Dynamic Programming. In this vein, we propose a new variant of DDP that can accept batch optimization for training feedforward networks, while integrating naturally with the recent progress in curvature approximation. The resulting algorithm features layer-wise feedback policies which improve convergence rate and reduce sensitivity to hyper-parameter over existing methods. We show that the algorithm is competitive against state-ofthe-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.