We consider a multi-agent reinforcement learning problem where each agent seeks to maximize a shared reward while interacting with other agents, and they may or may not be able to communicate. Typically the agents do not have access to other agent policies and thus each agent is situated in a non-stationary and partially-observable environment. In order to obtain multi-agents that act in a decentralized manner, we introduce a novel algorithm under the popular framework of centralized training, but decentralized execution. This training framework first obtains solutions to a multi-agent problem with a single centralized joint-space learner, which is then used to guide imitation learning for independent decentralized multi-agents. This framework has the flexibility to use any reinforcement learning algorithm to obtain the expert as well as any imitation learning algorithm to obtain the decentralized agents. This is in contrast to other multi-agent learning algorithms that, for example, can require more specific structures. We present some theoretical bounds for our method, and we show that one can obtain decentralized solutions to a multi-agent problem through imitation learning.
This paper considers decentralized stochastic optimization over a network of $n$ nodes, where each node possesses a smooth non-convex local cost function and the goal of the networked nodes is to find an $\epsilon$-accurate first-order stationary point of the sum of the local costs. We focus on an online setting, where each node accesses its local cost only by means of a stochastic first-order oracle that returns a noisy version of the exact gradient. In this context, we propose a novel single-loop decentralized hybrid variance-reduced stochastic gradient method, called GT-HSGD, that outperforms the existing approaches in terms of both the oracle complexity and practical implementation. The GT-HSGD algorithm implements specialized local hybrid stochastic gradient estimators that are fused over the network to track the global gradient. Remarkably, GT-HSGD achieves a network topology-independent oracle complexity of $O(n^{-1}\epsilon^{-3})$ when the required error tolerance $\epsilon$ is small enough, leading to a linear speedup with respect to the centralized optimal online variance-reduced approaches that operate on a single node. Numerical experiments are provided to illustrate our main technical results.
Personalized Federated Learning has recently seen tremendous progress, allowing the design of novel machine learning applications preserving privacy of the data used for training. Existing theoretical results in this field mainly focus on distributed optimization under minimization problems. This paper is the first to study PFL for saddle point problems, which cover a broader class of optimization tasks and are thus of more relevance for applications than the minimization. In this work, we consider a recently proposed PFL setting with the mixing objective function, an approach combining the learning of a global model together with local distributed learners. Unlike most of the previous papers, which considered only the centralized setting, we work in a more general and decentralized setup. This allows to design and to analyze more practical and federated ways to connect devices to the network. We present two new algorithms for our problem. A theoretical analysis of the methods is presented for smooth (strongly-)convex-(strongly-)concave saddle point problems. We also demonstrate the effectiveness of our problem formulation and the proposed algorithms on experiments with neural networks with adversarial noise.
Multiple autonomous agents interact over a random communication network to maximize their individual utility functions which depend on the actions of other agents. We consider decentralized best-response with inertia type algorithms in which agents form beliefs about the future actions of other players based on local information, and take an action that maximizes their expected utility computed with respect to these beliefs or continue to take their previous action. We show convergence of these types of algorithms to a Nash equilibrium in weakly acyclic games under the condition that the belief update and information exchange protocols successfully learn the actions of other players with positive probability in finite time given a static environment, i.e., when other agents' actions do not change. We design a decentralized fictitious play algorithm with voluntary and limited communication (DFP-VL) protocols that satisfy this condition. In the voluntary communication protocol, each agent decides whom to exchange information with by assessing the novelty of its information and the potential effect of its information on others' assessments of their utility functions. The limited communication protocol entails agents sending only their most frequent action to agents that they decide to communicate with. Numerical experiments on a target assignment game demonstrate that the voluntary and limited communication protocol can more than halve the number of communication attempts while retaining the same convergence rate as DFP in which agents constantly attempt to communicate.
Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.
In this paper, we introduce a game that allows one to assess the potential loss of efficiency induced by a decentralized control or local management of a global epidemic. Each player typically represents a region or a country which is assumed to choose its control action to implement a tradeoff between socioeconomic aspects and the health aspect. We conduct the Nash equilibrium analysis of this game. Since the analysis is not trivial in general, sufficient conditions for existence and uniqueness are provided. Then we quantify through numerical results the loss induced by decentralization, measured in terms of price of anarchy (PoA) and price of connectedness (PoC). These results allow one to clearly identify scenarios where decentralization is acceptable or not regarding to the retained global efficiency measures.
It has been a challenge to learning skills for an agent from long-horizon unannotated demonstrations. Existing approaches like Hierarchical Imitation Learning(HIL) are prone to compounding errors or suboptimal solutions. In this paper, we propose Option-GAIL, a novel method to learn skills at long horizon. The key idea of Option-GAIL is modeling the task hierarchy by options and train the policy via generative adversarial optimization. In particular, we propose an Expectation-Maximization(EM)-style algorithm: an E-step that samples the options of expert conditioned on the current learned policy, and an M-step that updates the low- and high-level policies of agent simultaneously to minimize the newly proposed option-occupancy measurement between the expert and the agent. We theoretically prove the convergence of the proposed algorithm. Experiments show that Option-GAIL outperforms other counterparts consistently across a variety of tasks.
Min-max saddle point games have recently been intensely studied, due to their wide range of applications, including training Generative Adversarial Networks~(GANs). However, most of the recent efforts for solving them are limited to special regimes such as convex-concave games. Further, it is customarily assumed that the underlying optimization problem is solved either by a single machine or in the case of multiple machines connected in centralized fashion, wherein each one communicates with a central node. The latter approach becomes challenging, when the underlying communications network has low bandwidth. In addition, privacy considerations may dictate that certain nodes can communicate with a subset of other nodes. Hence, it is of interest to develop methods that solve min-max games in a decentralized manner. To that end, we develop a decentralized adaptive momentum (ADAM)-type algorithm for solving min-max optimization problem under the condition that the objective function satisfies a Minty Variational Inequality condition, which is a generalization to convex-concave case. The proposed method overcomes shortcomings of recent non-adaptive gradient-based decentralized algorithms for min-max optimization problems that do not perform well in practice and require careful tuning. In this paper, we obtain non-asymptotic rates of convergence of the proposed algorithm (coined DADAM$^3$) for finding a (stochastic) first-order Nash equilibrium point and subsequently evaluate its performance on training GANs. The extensive empirical evaluation shows that DADAM$^3$ outperforms recently developed methods, including decentralized optimistic stochastic gradient for solving such min-max problems.
This paper considers the decentralized composite optimization problem. We propose a novel decentralized variance-reduction proximal-gradient algorithmic framework, called PMGT-VR, which is based on a combination of several techniques including multi-consensus, gradient tracking, and variance reduction. The proposed framework relies on an imitation of centralized algorithms and we demonstrate that algorithms under this framework achieve convergence rates similar to that of their centralized counterparts. We also describe and analyze two representative algorithms, PMGT-SAGA and PMGT-LSVRG, and compare them to existing state-of-the-art proximal algorithms. To the best of our knowledge, PMGT-VR is the first linearly convergent decentralized stochastic algorithm that can solve decentralized composite optimization problems. Numerical experiments are provided to demonstrate the effectiveness of the proposed algorithms.
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning (RL) setting. We consider the scenario where: (i) we have a dataset collected under a known baseline policy, (ii) multiple reward signals are received from the environment inducing as many objectives to optimize. We present an SPI formulation for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals while ensuring that the new policy performs at least as well as the baseline policy along each individual objective. We build on traditional SPI algorithms and propose a novel method based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et al., 2019) that provides high probability guarantees on the performance of the agent in the true environment. We show the effectiveness of our method on a synthetic grid-world safety task as well as in a real-world critical care context to learn a policy for the administration of IV fluids and vasopressors to treat sepsis.
Recent successes of value-based multi-agent deep reinforcement learning employ optimism in value function by carefully controlling learning rate(Omidshafiei et al., 2017) or reducing update prob-ability (Palmer et al., 2018). We introduce a de-centralized quantile estimator: Responsible Implicit Quantile Network (RIQN), while robust to teammate-environment interactions, able to reduce the amount of imposed optimism. Upon benchmarking against related Hysteretic-DQN(HDQN) and Lenient-DQN (LDQN), we findRIQN agents more stable, sample efficient and more likely to converge to the optimal policy.