This work is devoted to solving the composite optimization problem with the mixture oracle: for the smooth part of the problem, we have access to the gradient, and for the non-smooth part, only to the one-point zero-order oracle. For such a setup, we present a new method based on the sliding algorithm. Our method allows to separate the oracle complexities and compute the gradient for one of the function as rarely as possible. The paper also present the applicability of our new method to the problems of distributed optimization and federated learning. Experimental results confirm the theory.
We study a new two-time-scale stochastic gradient method for solving optimization problems, where the gradients are computed with the aid of an auxiliary variable under samples generated by time-varying Markov random processes parameterized by the underlying optimization variable. These time-varying samples make gradient directions in our update biased and dependent, which can potentially lead to the divergence of the iterates. In our two-time-scale approach, one scale is to estimate the true gradient from these samples, which is then used to update the estimate of the optimal solution. While these two iterates are implemented simultaneously, the former is updated "faster" (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different structural assumptions, namely, strong convexity, convexity, the Polyak-Lojasiewicz condition, and general non-convexity. We apply our framework to two problems in control and reinforcement learning. First, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of O(k^(-2/5)), which recovers the best known rate derived specifically for this problem. Second, we study an online actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of O(k^(-2/3)) is achieved. This is the first time such a result is known in the literature. Finally, we support our theoretical analysis with numerical simulations where the convergence rates are visualized.
We present AUQ-ADMM, an adaptive uncertainty-weighted consensus ADMM method for solving large-scale convex optimization problems in a distributed manner. Our key contribution is a novel adaptive weighting scheme that empirically increases the progress made by consensus ADMM scheme and is attractive when using a large number of subproblems. The weights are related to the uncertainty associated with the solutions of each subproblem, and are efficiently computed using low-rank approximations. We show AUQ-ADMM provably converges and demonstrate its effectiveness on a series of machine learning applications, including elastic net regression, multinomial logistic regression, and support vector machines. We provide an implementation based on the PyTorch package.
We study the distributed minimum spanning tree (MST) problem, a fundamental problem in distributed computing. It is well-known that distributed MST can be solved in $\tilde{O}(D+\sqrt{n})$ rounds in the standard CONGEST model (where $n$ is the network size and $D$ is the network diameter) and this is essentially the best possible round complexity (up to logarithmic factors). However, in resource-constrained networks such as ad hoc wireless and sensor networks, nodes spending so much time can lead to significant spending of resources such as energy. Motivated by the above consideration, we study distributed algorithms for MST under the \emph{sleeping model} [Chatterjee et al., PODC 2020], a model for design and analysis of resource-efficient distributed algorithms. In the sleeping model, a node can be in one of two modes in any round -- \emph{sleeping} or \emph{awake} (unlike the traditional model where nodes are always awake). Only the rounds in which a node is \emph{awake} are counted, while \emph{sleeping} rounds are ignored. A node spends resources only in the awake rounds and hence the main goal is to minimize the \emph{awake complexity} of a distributed algorithm, the worst-case number of rounds any node is awake. We present deterministic and randomized distributed MST algorithms that have an \emph{optimal} awake complexity of $O(\log n)$ time with a matching lower bound. We also show that our randomized awake-optimal algorithm has essentially the best possible round complexity by presenting a lower bound of $\tilde{\Omega}(n)$ on the product of the awake and round complexity of any distributed algorithm (including randomized) that outputs an MST, where $\tilde{\Omega}$ hides a $1/(\text{polylog } n)$ factor.
Stochastic optimization algorithms implemented on distributed computing architectures are increasingly used to tackle large-scale machine learning applications. A key bottleneck in such distributed systems is the communication overhead for exchanging information such as stochastic gradients between different workers. Sparse communication with memory and the adaptive aggregation methodology are two successful frameworks among the various techniques proposed to address this issue. In this paper, we exploit the advantages of Sparse communication and Adaptive aggregated Stochastic Gradients to design a communication-efficient distributed algorithm named SASG. Specifically, we determine the workers who need to communicate with the parameter server based on the adaptive aggregation rule and then sparsify the transmitted information. Therefore, our algorithm reduces both the overhead of communication rounds and the number of communication bits in the distributed system. We define an auxiliary sequence and provide convergence results of the algorithm with the help of Lyapunov function analysis. Experiments on training deep neural networks show that our algorithm can significantly reduce the communication overhead compared to the previous methods, with little impact on training and testing accuracy.
Federated Learning has promised a new approach to resolve the challenges in machine learning by bringing computation to the data. The popularity of the approach has led to rapid progress in the algorithmic aspects and the emergence of systems capable of simulating Federated Learning. State of art systems in Federated Learning support a single node aggregator that is insufficient to train a large corpus of devices or train larger-sized models. As the model size or the number of devices increase the single node aggregator incurs memory and computation burden while performing fusion tasks. It also faces communication bottlenecks when a large number of model updates are sent to a single node. We classify the workload for the aggregator into categories and propose a new aggregation service for handling each load. Our aggregation service is based on a holistic approach that chooses the best solution depending on the model update size and the number of clients. Our system provides a fault-tolerant, robust and efficient aggregation solution utilizing existing parallel and distributed frameworks. Through evaluation, we show the shortcomings of the state of art approaches and how a single solution is not suitable for all aggregation requirements. We also provide a comparison of current frameworks with our system through extensive experiments.
The stochastic gradient Langevin Dynamics is one of the most fundamental algorithms to solve sampling problems and non-convex optimization appearing in several machine learning applications. Especially, its variance reduced versions have nowadays gained particular attention. In this paper, we study two variants of this kind, namely, the Stochastic Variance Reduced Gradient Langevin Dynamics and the Stochastic Recursive Gradient Langevin Dynamics. We prove their convergence to the objective distribution in terms of KL-divergence under the sole assumptions of smoothness and Log-Sobolev inequality which are weaker conditions than those used in prior works for these algorithms. With the batch size and the inner loop length set to $\sqrt{n}$, the gradient complexity to achieve an $\epsilon$-precision is $\tilde{O}((n+dn^{1/2}\epsilon^{-1})\gamma^2 L^2\alpha^{-2})$, which is an improvement from any previous analyses. We also show some essential applications of our result to non-convex optimization.
With the increasing penetration of distributed energy resources, distributed optimization algorithms have attracted significant attention for power systems applications due to their potential for superior scalability, privacy, and robustness to a single point-of-failure. The Alternating Direction Method of Multipliers (ADMM) is a popular distributed optimization algorithm; however, its convergence performance is highly dependent on the selection of penalty parameters, which are usually chosen heuristically. In this work, we use reinforcement learning (RL) to develop an adaptive penalty parameter selection policy for the AC optimal power flow (ACOPF) problem solved via ADMM with the goal of minimizing the number of iterations until convergence. We train our RL policy using deep Q-learning, and show that this policy can result in significantly accelerated convergence (up to a 59% reduction in the number of iterations compared to existing, curvature-informed penalty parameter selection methods). Furthermore, we show that our RL policy demonstrates promise for generalizability, performing well under unseen loading schemes as well as under unseen losses of lines and generators (up to a 50% reduction in iterations). This work thus provides a proof-of-concept for using RL for parameter selection in ADMM for power systems applications.
The aim of this work is to develop a fully-distributed algorithmic framework for training graph convolutional networks (GCNs). The proposed method is able to exploit the meaningful relational structure of the input data, which are collected by a set of agents that communicate over a sparse network topology. After formulating the centralized GCN training problem, we first show how to make inference in a distributed scenario where the underlying data graph is split among different agents. Then, we propose a distributed gradient descent procedure to solve the GCN training problem. The resulting model distributes computation along three lines: during inference, during back-propagation, and during optimization. Convergence to stationary solutions of the GCN training problem is also established under mild conditions. Finally, we propose an optimization criterion to design the communication topology between agents in order to match with the graph describing data relationships. A wide set of numerical results validate our proposal. To the best of our knowledge, this is the first work combining graph convolutional neural networks with distributed optimization.
Since deep neural networks were developed, they have made huge contributions to everyday lives. Machine learning provides more rational advice than humans are capable of in almost every aspect of daily life. However, despite this achievement, the design and training of neural networks are still challenging and unpredictable procedures. To lower the technical thresholds for common users, automated hyper-parameter optimization (HPO) has become a popular topic in both academic and industrial areas. This paper provides a review of the most essential topics on HPO. The first section introduces the key hyper-parameters related to model training and structure, and discusses their importance and methods to define the value range. Then, the research focuses on major optimization algorithms and their applicability, covering their efficiency and accuracy especially for deep learning networks. This study next reviews major services and toolkits for HPO, comparing their support for state-of-the-art searching algorithms, feasibility with major deep learning frameworks, and extensibility for new modules designed by users. The paper concludes with problems that exist when HPO is applied to deep learning, a comparison between optimization algorithms, and prominent approaches for model evaluation with limited computational resources.
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and normalization methods. Second, we review generic optimization methods used in training neural networks, such as SGD, adaptive gradient methods and distributed methods, and theoretical results for these algorithms. Third, we review existing research on the global issues of neural network training, including results on bad local minima, mode connectivity, lottery ticket hypothesis and infinite-width analysis.