In this paper, we propose new structured second-order methods and structured adaptive-gradient methods obtained by performing natural-gradient descent on structured parameter spaces. Natural-gradient descent is an attractive approach to design new algorithms in many settings such as gradient-free, adaptive-gradient, and second-order methods. Our structured methods not only enjoy a structural invariance but also admit a simple expression. Finally, we test the efficiency of our proposed methods on both deterministic non-convex problems and deep learning problems.
Adaptive gradient methods have shown excellent performances for solving many machine learning problems. Although multiple adaptive methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using some specific adaptive learning rates. It is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrate the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our SUPER-ADAM algorithm can achieve the best known gradient (i.e., stochastic first-order oracle (SFO)) complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms. Code is available at //github.com/LIJUNYI95/SuperAdam
We study a class of deterministic flows in ${\mathbb R}^{d\times k}$, parametrized by a random matrix ${\boldsymbol X}\in {\mathbb R}^{n\times d}$ with i.i.d. centered subgaussian entries. We characterize the asymptotic behavior of these flows over bounded time horizons, in the high-dimensional limit in which $n,d\to\infty$ with $k$ fixed and converging aspect ratios $n/d\to\delta$. The asymptotic characterization we prove is in terms of a system of a nonlinear stochastic process in $k$ dimensions, whose parameters are determined by a fixed point condition. This type of characterization is known in physics as dynamical mean field theory. Rigorous results of this type have been obtained in the past for a few spin glass models. Our proof is based on time discretization and a reduction to certain iterative schemes known as approximate message passing (AMP) algorithms, as opposed to earlier work that was based on large deviations theory and stochastic processes theory. The new approach allows for a more elementary proof and implies that the high-dimensional behavior of the flow is universal with respect to the distribution of the entries of ${\boldsymbol X}$. As specific applications, we obtain high-dimensional characterizations of gradient flow in some classical models from statistics and machine learning, under a random design assumption.
Despite the many advances in the use of weakly-compressible smoothed particle hydrodynamics (SPH) for the simulation of incompressible fluid flow, it is still challenging to obtain second-order convergence numerically. In this paper we perform a systematic numerical study of convergence and accuracy of kernel-based approximation, discretization operators, and weakly-compressible SPH (WCSPH) schemes. We explore the origins of the errors and issues preventing second-order convergence. Based on the study, we propose several new variations of the basic WCSPH scheme that are all second-order accurate. Additionally, we investigate the linear and angular momentum conservation property of the WCSPH schemes. Our results show that one may construct accurate WCSPH schemes that demonstrate second-order convergence through a judicious choice of kernel, smoothing length, and discretization operators in the discretization of the governing equations.
Numerical evaluations have definitively shown that, for deep learning optimizers such as stochastic gradient descent, momentum, and adaptive methods, the number of steps needed to train a deep neural network halves for each doubling of the batch size and that there is a region of diminishing returns beyond the critical batch size. In this paper, we determine the actual critical batch size by using the global minimizer of the stochastic first-order oracle (SFO) complexity of the optimizer. To prove the existence of the actual critical batch size, we set the lower and upper bounds of the SFO complexity and prove that there exist critical batch sizes in the sense of minimizing the lower and upper bounds. This proof implies that, if the SFO complexity fits the lower and upper bounds, then the existence of these critical batch sizes demonstrates the existence of the actual critical batch size. We also discuss the conditions needed for the SFO complexity to fit the lower and upper bounds and provide numerical results that support our theoretical results.
We present VPVnet, a deep neural network method for the Stokes' equations under reduced regularity. Different with recently proposed deep learning methods [40,51] which are based on the original form of PDEs, VPVnet uses the least square functional of the first-order velocity-pressure-vorticity (VPV) formulation ([30]) as loss functions. As such, only first-order derivative is required in the loss functions, hence the method is applicable to a much larger class of problems, e.g. problems with non-smooth solutions. Despite that several methods have been proposed recently to reduce the regularity requirement by transforming the original problem into a corresponding variational form, while for the Stokes' equations, the choice of approximating spaces for the velocity and the pressure has to satisfy the LBB condition additionally. Here by making use of the VPV formulation, lower regularity requirement is achieved with no need for considering the LBB condition. Convergence and error estimates have been established for the proposed method. It is worth emphasizing that the VPVnet method is divergence-free and pressure-robust, while classical inf-sup stable mixed finite elements for the Stokes' equations are not pressure-robust. Various numerical experiments including 2D and 3D lid-driven cavity test cases are conducted to demonstrate its efficiency and accuracy.
In this paper, we address a new problem of reversing the effect of an image filter, which can be linear or nonlinear. The assumption is that the algorithm of the filter is unknown and the filter is available as a black box. We formulate this inverse problem as minimizing a local patch-based cost function and use total derivative to approximate the gradient which is used in gradient descent to solve the problem. We analyze factors affecting the convergence and quality of the output in the Fourier domain. We also study the application of accelerated gradient descent algorithms in three gradient-free reverse filters, including the one proposed in this paper. We present results from extensive experiments to evaluate the complexity and effectiveness of the proposed algorithm. Results demonstrate that the proposed algorithm outperforms the state-of-the-art in that (1) it is at the same level of complexity as that of the fastest reverse filter, but it can reverse a larger number of filters, and (2) it can reverse the same list of filters as that of the very complex reverse filter, but its complexity is much smaller.
After substantial progress over the last 15 years, the "algebraic CSP-dichotomy conjecture" reduces to the following: every local constraint satisfaction problem (CSP) associated with a finite idempotent algebra is tractable if and only if the algebra has a Taylor term operation. Despite the tremendous achievements in this area (including recently announce proofs of the general conjecture), there remain examples of small algebras with just a single binary operation whose CSP resists direct classification as either tractable or NP-complete using known methods. In this paper we present some new methods for approaching such problems, with particular focus on those techniques that help us attack the class of finite algebras known as "commutative idempotent binars" (CIBs). We demonstrate the utility of these methods by using them to prove that every CIB of cardinality at most 4 yields a tractable CSP.
Many well-established anomaly detection methods use the distance of a sample to those in its local neighbourhood: so-called `local outlier methods', such as LOF and DBSCAN. They are popular for their simple principles and strong performance on unstructured, feature-based data that is commonplace in many practical applications. However, they cannot learn to adapt for a particular set of data due to their lack of trainable parameters. In this paper, we begin by unifying local outlier methods by showing that they are particular cases of the more general message passing framework used in graph neural networks. This allows us to introduce learnability into local outlier methods, in the form of a neural network, for greater flexibility and expressivity: specifically, we propose LUNAR, a novel, graph neural network-based anomaly detection method. LUNAR learns to use information from the nearest neighbours of each node in a trainable way to find anomalies. We show that our method performs significantly better than existing local outlier methods, as well as state-of-the-art deep baselines. We also show that the performance of our method is much more robust to different settings of the local neighbourhood size.
We introduce a collection of benchmark problems in 2D and 3D (geometry description and boundary conditions), including simple cases with known analytic solution, classical experimental setups, and complex geometries with fabricated solutions for evaluation of numerical schemes for incompressible Navier-Stokes equations in laminar flow regime. We compare the performance of a representative selection of most broadly used algorithms for Navier-Stokes equations on this set of problems. Where applicable, we compare the most common spatial discretization choices (unstructured triangle/tetrahedral meshes and structured or semi-structured quadrilateral/hexahedral meshes). The study shows that while the type of spatial discretization used has a minor impact on the accuracy of the solutions, the choice of time integration method, spatial discretization order, and the choice of solving the coupled equations or reducing them to simpler subproblems have very different properties. Methods that are directly solving the original equations tend to be more accurate than splitting approaches for the same number of degrees of freedom, but numerical or computational difficulty arise when they are scaled to larger problem sizes. Low-order splitting methods are less accurate, but scale more easily to large problems, while higher-order splitting methods are accurate but require dense time discretizations to be stable. We release the description of the experiments and an implementation of our benchmark, which we believe will enable statistically significant comparisons with the state of the art as new approaches for solving the incompressible Navier-Stokes equations are introduced.
Escaping saddle points is a central research topic in nonconvex optimization. In this paper, we propose a simple gradient-based algorithm such that for a smooth function $f\colon\mathbb{R}^n\to\mathbb{R}$, it outputs an $\epsilon$-approximate second-order stationary point in $\tilde{O}(\log n/\epsilon^{1.75})$ iterations. Compared to the previous state-of-the-art algorithms by Jin et al. with $\tilde{O}((\log n)^{4}/\epsilon^{2})$ or $\tilde{O}((\log n)^{6}/\epsilon^{1.75})$ iterations, our algorithm is polynomially better in terms of $\log n$ and matches their complexities in terms of $1/\epsilon$. For the stochastic setting, our algorithm outputs an $\epsilon$-approximate second-order stationary point in $\tilde{O}((\log n)^{2}/\epsilon^{4})$ iterations. Technically, our main contribution is an idea of implementing a robust Hessian power method using only gradients, which can find negative curvature near saddle points and achieve the polynomial speedup in $\log n$ compared to the perturbed gradient descent methods. Finally, we also perform numerical experiments that support our results.