We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we show that the tracking efficiency of the proximal stochastic gradient method depends only logarithmically on the initialization quality, when equipped with a step-decay schedule. Numerical experiments illustrate our results.
We introduce an online convex optimization algorithm using projected subgradient descent with optimal adaptive learning rates, with sequential and efficient first-order updates. Our method provides a subgradient adaptive minimax optimal dynamic regret guarantee for a sequence of general convex functions with no known additional properties such as strong-convexity, smoothness, exp-concavity or even Lipschitz-continuity. The guarantee is against any comparator decision sequence with bounded "complexity", defined by the cumulative distance traveled via changes between successive decisions. We show optimality by generating a lower bound of the worst-case second-order dynamic regret, which incorporates actual subgradient norms and matches with our guarantees within a constant factor. We also derive the extension for independent learning in each decision coordinate separately. Additionally, we demonstrate how to best preserve our guarantees when the bound on total successive changes in the dynamic comparator sequence grows in time or the feedback regarding such bound arrives partially with time, both in a truly online manner. Then, as a major contribution, we examine the scenario when we receive no information regarding the successive changes, but instead, by a unique re-purposing of the expert mixture framework with novel additions, we eliminate the need of such information in, again, a truly online manner. Moreover, we show the ability to compete against all dynamic comparator sequences simultaneously (universally) with minimax optimality, where the guarantees depend on the "complexity" of each comparator separately. We also discuss potential modifications to our approach which addresses further complexity reductions for time, computation, memory, and we also further the universal competitiveness via guarantees taking into account concentrations of a comparator sequence in the decision set.
Empirical risk minimization (ERM) is known in practice to be non-robust to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to the class of GRW approaches.
We consider a subclass of $n$-player stochastic games, in which players have their own internal state/action spaces while they are coupled through their payoff functions. It is assumed that players' internal chains are driven by independent transition probabilities. Moreover, players can only receive realizations of their payoffs but not the actual functions, nor can they observe each others' states/actions. Under some assumptions on the structure of the payoff functions, we develop efficient learning algorithms based on Dual Averaging and Dual Mirror Descent, which provably converge almost surely or in expectation to the set of $\epsilon$-Nash equilibrium policies. In particular, we derive upper bounds on the number of iterates that scale polynomially in terms of the game parameters to achieve an $\epsilon$-Nash equilibrium policy. Besides Markov potential games and linear-quadratic stochastic games, this work provides another interesting subclass of $n$-player stochastic games that under some assumption provably admit polynomial-time learning algorithm for finding their $\epsilon$-Nash equilibrium policies.
Iterative distributed optimization algorithms involve multiple agents that communicate with each other, over time, in order to minimize/maximize a global objective. In the presence of unreliable communication networks, the Age-of-Information (AoI), which measures the freshness of data received, may be large and hence hinder algorithmic convergence. In this paper, we study the convergence of general distributed gradient-based optimization algorithms in the presence of communication that neither happens periodically nor at stochastically independent points in time. We show that convergence is guaranteed provided the random variables associated with the AoI processes are stochastically dominated by a random variable with finite first moment. This improves on previous requirements of boundedness of more than the first moment. We then introduce stochastically strongly connected (SSC) networks, a new stochastic form of strong connectedness for time-varying networks. We show: If for any $p \ge0$ the processes that describe the success of communication between agents in a SSC network are $\alpha$-mixing with $n^{p-1}\alpha(n)$ summable, then the associated AoI processes are stochastically dominated by a random variable with finite $p$-th moment. In combination with our first contribution, this implies that distributed stochastic gradient descend converges in the presence of AoI, if $\alpha(n)$ is summable.
In this work, we study stochastic non-cooperative games, where only noisy black-box function evaluations are available to estimate the cost function for each player. Since each player's cost function depends on both its own decision variables and its rivals' decision variables, local information needs to be exchanged through a center/network in most existing work for seeking the Nash equilibrium. We propose a new stochastic distributed learning algorithm that does not require communications among players. The proposed algorithm uses simultaneous perturbation method to estimate the gradient of each cost function, and uses mirror descent method to search for the Nash equilibrium. We provide asymptotic analysis for the bias and variance of gradient estimates, and show the proposed algorithm converges to the Nash equilibrium in mean square for the class of strictly monotone games at a rate faster than the existing algorithms. The effectiveness of the proposed method is buttressed in a numerical experiment.
Functional constrained optimization is becoming more and more important in machine learning and operations research. Such problems have potential applications in risk-averse machine learning, semisupervised learning, and robust optimization among others. In this paper, we first present a novel Constraint Extrapolation (ConEx) method for solving convex functional constrained problems, which utilizes linear approximations of the constraint functions to define the extrapolation (or acceleration) step. We show that this method is a unified algorithm that achieves the best-known rate of convergence for solving different functional constrained convex composite problems, including convex or strongly convex, and smooth or nonsmooth problems with a stochastic objective and/or stochastic constraints. Many of these rates of convergence were in fact obtained for the first time in the literature. In addition, ConEx is a single-loop algorithm that does not involve any penalty subproblems. Contrary to existing primal-dual methods, it does not require the projection of Lagrangian multipliers into a (possibly unknown) bounded set. Second, for nonconvex functional constrained problems, we introduce a new proximal point method that transforms the initial nonconvex problem into a sequence of convex problems by adding quadratic terms to both the objective and constraints. Under a certain MFCQ-type assumption, we establish the convergence and rate of convergence of this method to KKT points when the convex subproblems are solved exactly or inexactly. For large-scale and stochastic problems, we present a more practical proximal point method in which the approximate solutions of the subproblems are computed by the aforementioned ConEx method. To the best of our knowledge, most of these convergence and complexity results of the proximal point method for nonconvex problems also seem to be new in the literature.
We study the computational complexity of zigzag sampling algorithm for strongly log-concave distributions. The zigzag process has the advantage of not requiring time discretization for implementation, and that each proposed bouncing event requires only one evaluation of partial derivative of the potential, while its convergence rate is dimension independent. Using these properties, we prove that the zigzag sampling algorithm achieves $\varepsilon$ error in chi-square divergence with a computational cost equivalent to $O\bigl(\kappa^2 d^\frac{1}{2}(\log\frac{1}{\varepsilon})^{\frac{3}{2}}\bigr)$ gradient evaluations in the regime $\kappa \ll \frac{d}{\log d}$ under a warm start assumption, where $\kappa$ is the condition number and $d$ is the dimension.
We propose a general and scalable approximate sampling strategy for probabilistic models with discrete variables. Our approach uses gradients of the likelihood function with respect to its discrete inputs to propose updates in a Metropolis-Hastings sampler. We show empirically that this approach outperforms generic samplers in a number of difficult settings including Ising models, Potts models, restricted Boltzmann machines, and factorial hidden Markov models. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data. This approach outperforms variational auto-encoders and existing energy-based models. Finally, we give bounds showing that our approach is near-optimal in the class of samplers which propose local updates.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
In this paper, we study the optimal convergence rate for distributed convex optimization problems in networks. We model the communication restrictions imposed by the network as a set of affine constraints and provide optimal complexity bounds for four different setups, namely: the function $F(\xb) \triangleq \sum_{i=1}^{m}f_i(\xb)$ is strongly convex and smooth, either strongly convex or smooth or just convex. Our results show that Nesterov's accelerated gradient descent on the dual problem can be executed in a distributed manner and obtains the same optimal rates as in the centralized version of the problem (up to constant or logarithmic factors) with an additional cost related to the spectral gap of the interaction matrix. Finally, we discuss some extensions to the proposed setup such as proximal friendly functions, time-varying graphs, improvement of the condition numbers.