In this paper we present a method for the solution of $\ell_1$-regularized convex quadratic optimization problems. It is derived by suitably combining a proximal method of multipliers strategy with a semi-smooth Newton method. The resulting linear systems are solved using a Krylov-subspace method, accelerated by appropriate general-purpose preconditioners, which are shown to be optimal with respect to the proximal parameters. Practical efficiency is further improved by warm-starting the algorithm using a proximal alternating direction method of multipliers. We show that the method achieves global convergence under feasibility assumptions. Furthermore, under additional standard assumptions, the method can achieve global linear and local superlinear convergence. The effectiveness of the approach is numerically demonstrated on $L^1$-regularized PDE-constrained optimization problems.
In this paper we present an algebraic dimension-oblivious two-level domain decomposition solver for discretizations of elliptic partial differential equations. The proposed parallel solver is based on a space-filling curve partitioning approach that is applicable to any discretization, i.e. it directly operates on the assembled matrix equations. Moreover, it allows for the effective use of arbitrary processor numbers independent of the dimension of the underlying partial differential equation while maintaining optimal convergence behavior. This is the core property required to attain a sparse grid based combination method with extreme scalability which can utilize exascale parallel systems efficiently. Moreover, this approach provides a basis for the development of a fault-tolerant solver for the numerical treatment of high-dimensional problems. To achieve the required data redundancy we are therefore concerned with large overlaps of our domain decomposition which we construct via space-filling curves. In this paper, we propose our space-filling curve based domain decomposition solver and present its convergence properties and scaling behavior. The results of numerical experiments clearly show that our approach provides optimal convergence and scaling behavior in arbitrary dimension utilizing arbitrary processor numbers.
We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as the Subsampled Newton and Newton Sketch, which can efficiently construct stochastic Hessian estimates for many tasks. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme enjoys local $Q$-superlinear convergence with a non-asymptotic rate of $(\Upsilon\sqrt{\log (t)/t}\,)^{t}$, where $\Upsilon$ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the iteration, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still enjoys a superlinear convergence~rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.
The monotone variational inequality is a central problem in mathematical programming that unifies and generalizes many important settings such as smooth convex optimization, two-player zero-sum games, convex-concave saddle point problems, etc. The extragradient method by Korpelevich [1976] is one of the most popular methods for solving monotone variational inequalities. Despite its long history and intensive attention from the optimization and machine learning community, the following major problem remains open. What is the last-iterate convergence rate of the extragradient method for monotone and Lipschitz variational inequalities with constraints? We resolve this open problem by showing a tight $O\left(\frac{1}{\sqrt{T}}\right)$ last-iterate convergence rate for arbitrary convex feasible sets, which matches the lower bound by Golowich et al. [2020]. Our rate is measured in terms of the standard gap function. The technical core of our result is the monotonicity of a new performance measure -- the tangent residual, which can be viewed as an adaptation of the norm of the operator that takes the local constraints into account. To establish the monotonicity, we develop a new approach that combines the power of the sum-of-squares programming with the low dimensionality of the update rule of the extragradient method. We believe our approach has many additional applications in the analysis of iterative methods.
In this work we describe an Adaptive Regularization using Cubics (ARC) method for large-scale nonconvex unconstrained optimization using Limited-memory Quasi-Newton (LQN) matrices. ARC methods are a relatively new family of optimization strategies that utilize a cubic-regularization (CR) term in place of trust-regions and line-searches. LQN methods offer a large-scale alternative to using explicit second-order information by taking identical inputs to those used by popular first-order methods such as stochastic gradient descent (SGD). Solving the CR subproblem exactly requires Newton's method, yet using properties of the internal structure of LQN matrices, we are able to find exact solutions to the CR subproblem in a matrix-free manner, providing large speedups and scaling into modern size requirements. Additionally, we expand upon previous ARC work and explicitly incorporate first-order updates into our algorithm. We provide experimental results when the SR1 update is used, which show substantial speed-ups and competitive performance compared to Adam and other second order optimizers on deep neural networks (DNNs). We find that our new approach, ARCLQN, compares to modern optimizers with minimal tuning, a common pain-point for second order methods.
We employ kernel-based approaches that use samples from a probability distribution to approximate a Kolmogorov operator on a manifold. The self-tuning variable-bandwidth kernel method [Berry & Harlim, Appl. Comput. Harmon. Anal., 40(1):68--96, 2016] computes a large, sparse matrix that approximates the differential operator. Here, we use the eigendecomposition of the discretization to (i) invert the operator, solving a differential equation, and (ii) represent gradient vector fields on the manifold. These methods only require samples from the underlying distribution and, therefore, can be applied in high dimensions or on geometrically complex manifolds when spatial discretizations are not available. We also employ an efficient $k$-$d$ tree algorithm to compute the sparse kernel matrix, which is a computational bottleneck.
Stein variational gradient descent (SVGD) is a general-purpose optimization-based sampling algorithm that has recently exploded in popularity, but is limited by two issues: it is known to produce biased samples, and it can be slow to converge on complicated distributions. A recently proposed stochastic variant of SVGD (sSVGD) addresses the first issue, producing unbiased samples by incorporating a special noise into the SVGD dynamics such that asymptotic convergence is guaranteed. Meanwhile, Stein variational Newton (SVN), a Newton-like extension of SVGD, dramatically accelerates the convergence of SVGD by incorporating Hessian information into the dynamics, but also produces biased samples. In this paper we derive, and provide a practical implementation of, a stochastic variant of SVN (sSVN) which is both asymptotically correct and converges rapidly. We demonstrate the effectiveness of our algorithm on a difficult class of test problems -- the Hybrid Rosenbrock density -- and show that sSVN converges using three orders of magnitude fewer gradient evaluations of the log likelihood than its stochastic SVGD counterpart. Our results show that sSVN is a promising approach to accelerating high-precision Bayesian inference tasks with modest-dimension, $d\sim\mathcal{O}(10)$.
This paper proposes a numerical method based on the Adomian decomposition approach for the time discretization, applied to Euler equations. A recursive property is demonstrated that allows to formulate the method in an appropriate and efficient way. To obtain a fully numerical scheme, the space discretization is achieved using the classical DG techniques. The efficiency of the obtained numerical scheme is demonstrated through numerical tests by comparison to exact solution and the popular Runge-Kutta DG method results.
We propose a stochastic conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. Existing CGM variants for this template either suffer from slow convergence rates, or require carefully increasing the batch size over the course of the algorithm's execution, which leads to computing full gradients. In contrast, the proposed method, equipped with a stochastic average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques. In applications we put special emphasis on problems with a large number of separable constraints. Such problems are prevalent among semidefinite programming (SDP) formulations arising in machine learning and theoretical computer science. We provide numerical experiments on matrix completion, unsupervised clustering, and sparsest-cut SDPs.
Multigrid is a powerful solver for large-scale linear systems arising from discretized partial differential equations. The convergence theory of multigrid methods for symmetric positive definite problems has been well developed over the past decades, while, for nonsymmetric problems, such theory is still not mature. As a foundation for multigrid analysis, two-grid convergence theory plays an important role in motivating multigrid algorithms. Regarding two-grid methods for nonsymmetric problems, most previous works focus on the spectral radius of iteration matrix or rely on convergence measures that are typically difficult to compute in practice. Moreover, the existing results are confined to two-grid methods with exact solution of the coarse-grid system. In this paper, we analyze the convergence of a two-grid method for nonsymmetric positive definite problems (e.g., linear systems arising from the discretizations of convection-diffusion equations). In the case of exact coarse solver, we establish an elegant identity for characterizing two-grid convergence factor, which is measured by a smoother-induced norm. The identity can be conveniently used to derive a class of optimal restriction operators and analyze how the convergence factor is influenced by restriction. More generally, we present some convergence estimates for an inexact variant of the two-grid method, in which both linear and nonlinear coarse solvers are considered.
We propose a First-Order System Least Squares (FOSLS) method based on deep-learning for numerically solving second-order elliptic PDEs. The method we propose is capable of dealing with either variational and non-variational problems, and because of its meshless nature, it can also deal with problems posed in high-dimensional domains. We prove the $\Gamma$-convergence of the neural network approximation towards the solution of the continuous problem, and extend the convergence proof to some well-known related methods. Finally, we present several numerical examples illustrating the performance of our discretization.