This paper studies a distributed multi-agent convex optimization problem. The system comprises multiple agents in this problem, each with a set of local data points and an associated local cost function. The agents are connected to a server, and there is no inter-agent communication. The agents' goal is to learn a parameter vector that optimizes the aggregate of their local costs without revealing their local data points. In principle, the agents can solve this problem by collaborating with the server using the traditional distributed gradient-descent method. However, when the aggregate cost is ill-conditioned, the gradient-descent method (i) requires a large number of iterations to converge, and (ii) is highly unstable against process noise. We propose an iterative pre-conditioning technique to mitigate the deleterious effects of the cost function's conditioning on the convergence rate of distributed gradient-descent. Unlike the conventional pre-conditioning techniques, the pre-conditioner matrix in our proposed technique updates iteratively to facilitate implementation on the distributed network. In the distributed setting, we provably show that the proposed algorithm converges linearly with an improved rate of convergence than the traditional and adaptive gradient-descent methods. Additionally, for the special case when the minimizer of the aggregate cost is unique, our algorithm converges superlinearly. We demonstrate our algorithm's superior performance compared to prominent distributed algorithms for solving real logistic regression problems and emulating neural network training via a noisy quadratic model, thereby signifying the proposed algorithm's efficiency for distributively solving non-convex optimization. Moreover, we empirically show that the proposed algorithm results in faster training without compromising the generalization performance.
We develop two distributed downlink resource allocation algorithms for user-centric, cell-free, spatially-distributed, multiple-input multiple-output (MIMO) networks. In such networks, each user is served by a subset of nearby transmitters that we call distributed units or DUs. The operation of the DUs in a region is controlled by a central unit (CU). Our first scheme is implemented at the DUs, while the second is implemented at the CUs controlling these DUs. We define a hybrid quality of service metric that enables distributed optimization of system resources in a proportional fair manner. Specifically, each of our algorithms performs user scheduling, beamforming, and power control while accounting for channel estimation errors. Importantly, our algorithm does not require information exchange amongst DUs (CUs) for the DU-distributed (CU-distributed) system, while also smoothly converging. Our results show that our CU-distributed system provides 1.3- to 1.8-fold network throughput compared to the DU-distributed system, with minor increases in complexity and front-haul load - and substantial gains over benchmark schemes like local zero-forcing. We also analyze the trade-offs provided by the CU-distributed system, hence highlighting the significance of deploying multiple CUs in user-centric cell-free networks.
For practical deep neural network design on mobile devices, it is essential to consider the constraints incurred by the computational resources and the inference latency in various applications. Among deep network acceleration related approaches, pruning is a widely adopted practice to balance the computational resource consumption and the accuracy, where unimportant connections can be removed either channel-wisely or randomly with a minimal impact on model accuracy. The channel pruning instantly results in a significant latency reduction, while the random weight pruning is more flexible to balance the latency and accuracy. In this paper, we present a unified framework with Joint Channel pruning and Weight pruning (JCW), and achieves a better Pareto-frontier between the latency and accuracy than previous model compression approaches. To fully optimize the trade-off between the latency and accuracy, we develop a tailored multi-objective evolutionary algorithm in the JCW framework, which enables one single search to obtain the optimal candidate architectures for various deployment requirements. Extensive experiments demonstrate that the JCW achieves a better trade-off between the latency and accuracy against various state-of-the-art pruning methods on the ImageNet classification dataset. Our codes are available at //github.com/jcw-anonymous/JCW.
Distributed Constraint Optimization Problems (DCOPs) are a frequently used framework in which a set of independent agents choose values from their respective discrete domains to maximize their utility. Although this formulation is typically appropriate, there are a number of real-world applications in which the decision variables are continuous-valued and the constraints are represented in functional form. To address this, Continuous Distributed Constraint Optimization Problems (C-DCOPs), an extension of the DCOPs paradigm, have recently grown the interest of the multi-agent systems field. To date, among different approaches, population-based algorithms are shown to be most effective for solving C-DCOPs. Considering the potential of population-based approaches, we propose a new C-DCOPs solver inspired by a well-known population-based algorithm Artificial Bee Colony (ABC). Additionally, we provide a new exploration method that aids in the further improvement of the algorithm's solution quality. Finally, We theoretically prove that our approach is an anytime algorithm and empirically show it produces significantly better results than the state-of-the-art C-DCOPs algorithms.
Much recent interest has focused on the design of optimization algorithms from the discretization of an associated optimization flow, i.e., a system of differential equations (ODEs) whose trajectories solve an associated optimization problem. Such a design approach poses an important problem: how to find a principled methodology to design and discretize appropriate ODEs. This paper aims to provide a solution to this problem through the use of contraction theory. We first introduce general mathematical results that explain how contraction theory guarantees the stability of the implicit and explicit Euler integration methods. Then, we propose a novel system of ODEs, namely the Accelerated-Contracting-Nesterov flow, and use contraction theory to establish it is an optimization flow with exponential convergence rate, from which the linear convergence rate of its associated optimization algorithm is immediately established. Remarkably, a simple explicit Euler discretization of this flow corresponds to the Nesterov acceleration method. Finally, we present how our approach leads to performance guarantees in the design of optimization algorithms for time-varying optimization problems.
Federated learning (FL) is an emerging privacy-preserving distributed learning scheme. Due to the large model size and frequent model aggregation, FL suffers from critical communication bottleneck. Many techniques have been proposed to reduce the communication volume, including model compression and quantization. Existing adaptive quantization schemes use ascending-trend quantization where the quantizaion level increases with the training stages. In this paper, we formulate the problem as optimizing the training convergence rate for a given communication volume. The result shows that the optimal quantizaiton level can be represented by two factors, i.e., the training loss and the range of model updates, and it is preferable to decrease the quantization level rather than increase. Then, we propose two descending quantization schemes based on the training loss and model range. Experimental results show that proposed schemes not only reduce the communication volume but also help FL converge faster, when compared with current ascending quantization.
The aim of this work is to develop a fully-distributed algorithmic framework for training graph convolutional networks (GCNs). The proposed method is able to exploit the meaningful relational structure of the input data, which are collected by a set of agents that communicate over a sparse network topology. After formulating the centralized GCN training problem, we first show how to make inference in a distributed scenario where the underlying data graph is split among different agents. Then, we propose a distributed gradient descent procedure to solve the GCN training problem. The resulting model distributes computation along three lines: during inference, during back-propagation, and during optimization. Convergence to stationary solutions of the GCN training problem is also established under mild conditions. Finally, we propose an optimization criterion to design the communication topology between agents in order to match with the graph describing data relationships. A wide set of numerical results validate our proposal. To the best of our knowledge, this is the first work combining graph convolutional neural networks with distributed optimization.
Most Deep Reinforcement Learning (Deep RL) algorithms require a prohibitively large number of training samples for learning complex tasks. Many recent works on speeding up Deep RL have focused on distributed training and simulation. While distributed training is often done on the GPU, simulation is not. In this work, we propose using GPU-accelerated RL simulations as an alternative to CPU ones. Using NVIDIA Flex, a GPU-based physics engine, we show promising speed-ups of learning various continuous-control, locomotion tasks. With one GPU and CPU core, we are able to train the Humanoid running task in less than 20 minutes, using 10-1000x fewer CPU cores than previous works. We also demonstrate the scalability of our simulator to multi-GPU settings to train more challenging locomotion tasks.
We propose accelerated randomized coordinate descent algorithms for stochastic optimization and online learning. Our algorithms have significantly less per-iteration complexity than the known accelerated gradient algorithms. The proposed algorithms for online learning have better regret performance than the known randomized online coordinate descent algorithms. Furthermore, the proposed algorithms for stochastic optimization exhibit as good convergence rates as the best known randomized coordinate descent algorithms. We also show simulation results to demonstrate performance of the proposed algorithms.
In this work, we consider the distributed optimization of non-smooth convex functions using a network of computing units. We investigate this problem under two regularity assumptions: (1) the Lipschitz continuity of the global objective function, and (2) the Lipschitz continuity of local individual functions. Under the local regularity assumption, we provide the first optimal first-order decentralized algorithm called multi-step primal-dual (MSPD) and its corresponding optimal convergence rate. A notable aspect of this result is that, for non-smooth functions, while the dominant term of the error is in $O(1/\sqrt{t})$, the structure of the communication network only impacts a second-order term in $O(1/t)$, where $t$ is time. In other words, the error due to limits in communication resources decreases at a fast rate even in the case of non-strongly-convex objective functions. Under the global regularity assumption, we provide a simple yet efficient algorithm called distributed randomized smoothing (DRS) based on a local smoothing of the objective function, and show that DRS is within a $d^{1/4}$ multiplicative factor of the optimal convergence rate, where $d$ is the underlying dimension.
Policy gradient methods are widely used in reinforcement learning algorithms to search for better policies in the parameterized policy space. They do gradient search in the policy space and are known to converge very slowly. Nesterov developed an accelerated gradient search algorithm for convex optimization problems. This has been recently extended for non-convex and also stochastic optimization. We use Nesterov's acceleration for policy gradient search in the well-known actor-critic algorithm and show the convergence using ODE method. We tested this algorithm on a scheduling problem. Here an incoming job is scheduled into one of the four queues based on the queue lengths. We see from experimental results that algorithm using Nesterov's acceleration has significantly better performance compared to algorithm which do not use acceleration. To the best of our knowledge this is the first time Nesterov's acceleration has been used with actor-critic algorithm.