In this paper, we propose an accelerated quasi-Newton proximal extragradient (A-QPNE) method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective, we prove that our method can achieve a convergence rate of ${O}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$, where $d$ is the problem dimension and $k$ is the number of iterations. In particular, in the regime where $k = {O}(d)$, our method matches the optimal rate of ${O}(\frac{1}{k^2})$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $k = \Omega(d \log d)$, it outperforms NAG and converges at a faster rate of ${O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$. To the best of our knowledge, this result is the first to demonstrate a provable gain of a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices.
Stochastic gradient descent (SGD) is the simplest deep learning optimizer with which to train deep neural networks. While SGD can use various learning rates, such as constant or diminishing rates, the previous numerical results showed that SGD performs better than other deep learning optimizers using when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization. The analysis indicates that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results. The numerical results indicate that the number of steps needed for training deep neural networks decreases as the batch size increases and that there exist the critical batch sizes that can be estimated from the theoretical results.
Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. In particular, no current analysis for quasi-Newton methods guarantees global convergence with an explicit superlinear convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non-asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and we relate the bounded regret of the online problem to the superlinear convergence of our method.
We study the open question of how players learn to play a social optimum pure-strategy Nash equilibrium (PSNE) through repeated interactions in general-sum coordination games. A social optimum of a game is the stable Pareto-optimal state that provides a maximum return in the sum of all players' payoffs (social welfare) and always exists. We consider finite repeated games where each player only has access to its own utility (or payoff) function but is able to exchange information with other players. We develop a novel regret matching (RM) based algorithm for computing an efficient PSNE solution that could approach a desired Pareto-optimal outcome yielding the highest social welfare among all the attainable equilibria in the long run. Our proposed learning procedure follows the regret minimization framework but extends it in three major ways: (1) agents use global, instead of local, utility for calculating regrets, (2) each agent maintains a small and diminishing exploration probability in order to explore various PSNEs, and (3) agents stay with the actions that achieve the best global utility thus far, regardless of regrets. We prove that these three extensions enable the algorithm to select the stable social optimum equilibrium instead of converging to an arbitrary or cyclic equilibrium as in the conventional RM approach. We demonstrate the effectiveness of our approach through a set of applications in multi-agent distributed control, including a large-scale resource allocation game and a hard combinatorial task assignment problem for which no efficient (polynomial) solution exists.
We revisit the well-studied problem of learning a linear combination of $k$ ReLU activations given labeled examples drawn from the standard $d$-dimensional Gaussian measure. Chen et al. [CDG+23] recently gave the first algorithm for this problem to run in $\text{poly}(d,1/\varepsilon)$ time when $k = O(1)$, where $\varepsilon$ is the target error. More precisely, their algorithm runs in time $(d/\varepsilon)^{\mathrm{quasipoly}(k)}$ and learns over multiple stages. Here we show that a much simpler one-stage version of their algorithm suffices, and moreover its runtime is only $(d/\varepsilon)^{O(k^2)}$.
The Independent Cutset problem asks whether there is a set of vertices in a given graph that is both independent and a cutset. Such a problem is $\textsf{NP}$-complete even when the input graph is planar and has maximum degree five. In this paper, we first present a $\mathcal{O}^*(1.4423^{n})$-time algorithm for the problem. We also show how to compute a minimum independent cutset (if any) in the same running time. Since the property of having an independent cutset is MSO$_1$-expressible, our main results are concerned with structural parameterizations for the problem considering parameters that are not bounded by a function of the clique-width of the input. We present $\textsf{FPT}$-time algorithms for the problem considering the following parameters: the dual of the maximum degree, the dual of the solution size, the size of a dominating set (where a dominating set is given as an additional input), the size of an odd cycle transversal, the distance to chordal graphs, and the distance to $P_5$-free graphs. We close by introducing the notion of $\alpha$-domination, which allows us to identify more fixed-parameter tractable and polynomial-time solvable cases.
We study the convex hulls of reachable sets of nonlinear systems with bounded disturbances. Reachable sets play a critical role in control, but remain notoriously challenging to compute, and existing over-approximation tools tend to be conservative or computationally expensive. In this work, we exactly characterize the convex hulls of reachable sets as the convex hulls of solutions of an ordinary differential equation from all possible initial values of the disturbances. This finite-dimensional characterization unlocks a fast sampling-based method to accurately over-approximate reachable sets. We give applications to neural feedback loop analysis and robust model predictive control.
A new algorithm for regret minimization in online convex optimization is described. The regret of the algorithm after $T$ time periods is $O(\sqrt{T \log T})$ - which is the minimum possible up to a logarithmic term. In addition, the new algorithm is adaptive, in the sense that the regret bounds hold not only for the time periods $1,\ldots,T$ but also for every sub-interval $s,s+1,\ldots,t$. The running time of the algorithm matches that of newly introduced interior point algorithms for regret minimization: in $n$-dimensional space, during each iteration the new algorithm essentially solves a system of linear equations of order $n$, rather than solving some constrained convex optimization problem in $n$ dimensions and possibly many constraints.
Convergence rate analyses of random walk Metropolis-Hastings Markov chains on general state spaces have largely focused on establishing sufficient conditions for geometric ergodicity or on analysis of mixing times. Geometric ergodicity is a key sufficient condition for the Markov chain Central Limit Theorem and allows rigorous approaches to assessing Monte Carlo error. The sufficient conditions for geometric ergodicity of the random walk Metropolis-Hastings Markov chain are refined and extended, which allows the analysis of previously inaccessible settings such as Bayesian Poisson regression. The key technical innovation is the development of explicit drift and minorization conditions for random walk Metropolis-Hastings, which allows explicit upper and lower bounds on the geometric rate of convergence. Further, lower bounds on the geometric rate of convergence are also developed using spectral theory. The existing sufficient conditions for geometric ergodicity, to date, have not provided explicit constraints on the rate of geometric rate of convergence because the method used only implies the existence of drift and minorization conditions. The theoretical results are applied to random walk Metropolis-Hastings algorithms for a class of exponential families and generalized linear models that address Bayesian Regression problems.
High-dimensional data arises in numerous applications, and the rapidly developing field of geometric deep learning seeks to develop neural network architectures to analyze such data in non-Euclidean domains, such as graphs and manifolds. Recent work by Z. Wang, L. Ruiz, and A. Ribeiro has introduced a method for constructing manifold neural networks using the spectral decomposition of the Laplace Beltrami operator. Moreover, in this work, the authors provide a numerical scheme for implementing such neural networks when the manifold is unknown and one only has access to finitely many sample points. The authors show that this scheme, which relies upon building a data-driven graph, converges to the continuum limit as the number of sample points tends to infinity. Here, we build upon this result by establishing a rate of convergence that depends on the intrinsic dimension of the manifold but is independent of the ambient dimension. We also discuss how the rate of convergence depends on the depth of the network and the number of filters used in each layer.
This manuscript portrays optimization as a process. In many practical applications the environment is so complex that it is infeasible to lay out a comprehensive theoretical model and use classical algorithmic theory and mathematical optimization. It is necessary as well as beneficial to take a robust approach, by applying an optimization method that learns as one goes along, learning from experience as more aspects of the problem are observed. This view of optimization as a process has become prominent in varied fields and has led to some spectacular success in modeling and systems that are now part of our daily lives.