Self-stabilization is an excellent approach for adding fault tolerance to a distributed multi-agent system. However, two properties of self-stabilization theory, convergence and closure, may not be satisfied if agents are selfish. To guarantee convergence, we formulate the problem as a stochastic Bayesian game and introduce probabilistic self-stabilization to adjust the probabilities of rules with behavior strategies. This satisfies agents' self-interests such that no agent deviates the rules. To guarantee closure in the presence of selfish agents, we propose fault-containment as a method to constrain legitimate configurations of the self-stabilizing system to be Nash equilibria. We also assume selfish agents as capable of performing unauthorized actions at any time, which threatens both properties, and present a stepwise solution to handle it. As a case study, we consider the problem of distributed clustering and propose five self-stabilizing algorithms for forming clusters. Simulation results show that our algorithms react correctly to rule deviations and outperform comparable schemes in terms of fairness and stabilization time.
A challenge in reinforcement learning (RL) is minimizing the cost of sampling associated with exploration. Distributed exploration reduces sampling complexity in multi-agent RL (MARL). We investigate the benefits to performance in MARL when exploration is fully decentralized. Specifically, we consider a class of online, episodic, tabular $Q$-learning problems under time-varying reward and transition dynamics, in which agents can communicate in a decentralized manner.We show that group performance, as measured by the bound on regret, can be significantly improved through communication when each agent uses a decentralized message-passing protocol, even when limited to sending information up to its $\gamma$-hop neighbors. We prove regret and sample complexity bounds that depend on the number of agents, communication network structure and $\gamma.$ We show that incorporating more agents and more information sharing into the group learning scheme speeds up convergence to the optimal policy. Numerical simulations illustrate our results and validate our theoretical claims.
We consider the problem of maximizing the minimum (weighted) value of all components of a vector over a polymatroid. This is a special case of the lexicographically optimal base problem introduced and solved by Fujishige. We give an alternative formulation of the problem as a zero-sum game between a maximizing player whose mixed strategy set is the base of the polymatroid and a minimizing player whose mixed strategy set is a simplex. We show that this game and three variations of it unify several problems in search, sequential testing and queuing. We give a new, short derivation of optimal strategies for both players and an expression for the value of the game. Furthermore, we give a characterization of the set of optimal strategies for the minimizing player and we consider special cases for which optimal strategies can be found particularly easily.
Much recent interest has focused on the design of optimization algorithms from the discretization of an associated optimization flow, i.e., a system of differential equations (ODEs) whose trajectories solve an associated optimization problem. Such a design approach poses an important problem: how to find a principled methodology to design and discretize appropriate ODEs. This paper aims to provide a solution to this problem through the use of contraction theory. We first introduce general mathematical results that explain how contraction theory guarantees the stability of the implicit and explicit Euler integration methods. Then, we propose a novel system of ODEs, namely the Accelerated-Contracting-Nesterov flow, and use contraction theory to establish it is an optimization flow with exponential convergence rate, from which the linear convergence rate of its associated optimization algorithm is immediately established. Remarkably, a simple explicit Euler discretization of this flow corresponds to the Nesterov acceleration method. Finally, we present how our approach leads to performance guarantees in the design of optimization algorithms for time-varying optimization problems.
In this paper, we consider a mobility system of travelers and providers, and propose a ``mobility game" to study when a traveler is matched to a provider. Each traveler seeks to travel using the services of only one provider, who manages one specific mode of transportation (car, bus, train, bike). The services of each provider are capacitated and can serve up to a fixed number of travelers at any instant of time. Thus, our problem falls under the category of many-to-one assignment problems, where the goal is to find the conditions that guarantee the stability of assignments. We formulate a linear program of maximizing the social welfare of travelers and providers and show how it is equivalent to the original problem and relate its solutions to stable assignments. We also investigate our results under informational asymmetry and provide a ``mechanism" that elicits the information of travelers and providers. Finally, we investigate and validate the advantages of our method by providing a numerical simulation example.
Modern reinforcement learning (RL) commonly engages practical problems with large state spaces, where function approximation must be deployed to approximate either the value function or the policy. While recent progresses in RL theory address a rich set of RL problems with general function approximation, such successes are mostly restricted to the single-agent setting. It remains elusive how to extend these results to multi-agent RL, especially due to the new challenges arising from its game-theoretical nature. This paper considers two-player zero-sum Markov Games (MGs). We propose a new algorithm that can provably find the Nash equilibrium policy using a polynomial number of samples, for any MG with low multi-agent Bellman-Eluder dimension -- a new complexity measure adapted from its single-agent version (Jin et al., 2021). A key component of our new algorithm is the exploiter, which facilitates the learning of the main player by deliberately exploiting her weakness. Our theoretical framework is generic, which applies to a wide range of models including but not limited to tabular MGs, MGs with linear or kernel function approximation, and MGs with rich observations.
The graph invariant twin-width was recently introduced by Bonnet, Kim, Thomass\'e, and Watrigan. Problems expressible in first-order logic, which includes many prominent NP-hard problems, are tractable on graphs of bounded twin-width if a certificate for the twin-width bound is provided as an input. Computing such a certificate, however, is an intrinsic problem, for which no nontrivial algorithm is known. In this paper, we propose the first practical approach for computing the twin-width of graphs together with the corresponding certificate. We propose efficient SAT-encodings that rely on a characterization of twin-width based on elimination sequences. This allows us to determine the twin-width of many famous graphs with previously unknown twin-width. We utilize our encodings to identify the smallest graphs for a given twin-width bound $d \in \{1,\dots,4\}$.
Many real-world applications of multi-agent reinforcement learning (RL), such as multi-robot navigation and decentralized control of cyber-physical systems, involve the cooperation of agents as a team with aligned objectives. We study multi-agent RL in the most basic cooperative setting -- Markov teams -- a class of Markov games where the cooperating agents share a common reward. We propose an algorithm in which each agent independently runs stage-based V-learning (a Q-learning style algorithm) to efficiently explore the unknown environment, while using a stochastic gradient descent (SGD) subroutine for policy updates. We show that the agents can learn an $\epsilon$-approximate Nash equilibrium policy in at most $\propto\widetilde{O}(1/\epsilon^4)$ episodes. Our results advocate the use of a novel \emph{stage-based} V-learning approach to create a stage-wise stationary environment. We also show that under certain smoothness assumptions of the team, our algorithm can achieve a nearly \emph{team-optimal} Nash equilibrium. Simulation results corroborate our theoretical findings. One key feature of our algorithm is being \emph{decentralized}, in the sense that each agent has access to only the state and its local actions, and is even \emph{oblivious} to the presence of the other agents. Neither communication among teammates nor coordination by a central controller is required during learning. Hence, our algorithm can readily generalize to an arbitrary number of agents, without suffering from the exponential dependence on the number of agents.
We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous works. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements. Density constraints can also avoid the time-consuming process of designing and tuning cost functions required by value function-based constraints to encode system specifications. We leverage the duality between density functions and Q functions to develop an effective algorithm to solve the density constrained RL problem optimally and the constrains are guaranteed to be satisfied. We prove that the proposed algorithm converges to a near-optimal solution with a bounded error even when the policy update is imperfect. We use a set of comprehensive experiments to demonstrate the advantages of our approach over state-of-the-art CRL methods, with a wide range of density constrained tasks as well as standard CRL benchmarks such as Safety-Gym.
Exploration-exploitation is a powerful and practical tool in multi-agent learning (MAL), however, its effects are far from understood. To make progress in this direction, we study a smooth analogue of Q-learning. We start by showing that our learning model has strong theoretical justification as an optimal model for studying exploration-exploitation. Specifically, we prove that smooth Q-learning has bounded regret in arbitrary games for a cost model that explicitly captures the balance between game and exploration costs and that it always converges to the set of quantal-response equilibria (QRE), the standard solution concept for games under bounded rationality, in weighted potential games with heterogeneous learning agents. In our main task, we then turn to measure the effect of exploration in collective system performance. We characterize the geometry of the QRE surface in low-dimensional MAL systems and link our findings with catastrophe (bifurcation) theory. In particular, as the exploration hyperparameter evolves over-time, the system undergoes phase transitions where the number and stability of equilibria can change radically given an infinitesimal change to the exploration parameter. Based on this, we provide a formal theoretical treatment of how tuning the exploration parameter can provably lead to equilibrium selection with both positive as well as negative (and potentially unbounded) effects to system performance.
Many recommendation algorithms rely on user data to generate recommendations. However, these recommendations also affect the data obtained from future users. This work aims to understand the effects of this dynamic interaction. We propose a simple model where users with heterogeneous preferences arrive over time. Based on this model, we prove that naive estimators, i.e. those which ignore this feedback loop, are not consistent. We show that consistent estimators are efficient in the presence of myopic agents. Our results are validated using extensive simulations.