One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.
We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
Let $N$ be the number of triangles in an Erd\H{o}s-R\'enyi graph $\mathcal{G}(n,p)$ on $n$ vertices with edge density $p=d/n,$ where $d>0$ is a fixed constant. It is well known that $N$ weakly converges to the Poisson distribution with mean ${d^3}/{6}$ as $n\rightarrow \infty$. We address the upper tail problem for $N,$ namely, we investigate how fast $k$ must grow, so that the probability of $\{N\ge k\}$ is not well approximated anymore by the tail of the corresponding Poisson variable. Proving that the tail exhibits a sharp phase transition, we essentially show that the upper tail is governed by Poisson behavior only when $k^{1/3} \log k< (\frac{3}{\sqrt{2}})^{2/3} \log n$ (sub-critical regime) as well as pin down the tail behavior when $k^{1/3} \log k> (\frac{3}{\sqrt{2}})^{2/3} \log n$ (super-critical regime). We further prove a structure theorem, showing that the sub-critical upper tail behavior is dictated by the appearance of almost $k$ vertex-disjoint triangles whereas in the supercritical regime, the excess triangles arise from a clique like structure of size approximately $(6k)^{1/3}$. This settles the long-standing upper-tail problem in this case, answering a question of Aldous, complementing a long sequence of works, spanning multiple decades, culminating in (Harel, Moussat, Samotij,'19) which analyzed the problem only in the regime $p\gg \frac{1}{n}.$ The proofs rely on several novel graph theoretical results which could have other applications.
We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of $\Omega(HSA)$ is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a \emph{reward-free} exploration algorithm with an optimal switching cost of $O(HSA)$.
Determining whether saddle points exist or are approximable for nonconvex-nonconcave problems is usually intractable. We take a step towards understanding certain nonconvex-nonconcave minimax problems that do remain tractable. Specifically, we study minimax problems cast in geodesic metric spaces, which provide a vast generalization of the usual convex-concave saddle point problems. The first main result of the paper is a geodesic metric space version of Sion's minimax theorem; we believe our proof is novel and transparent, as it relies on Helly's theorem only. In our second main result, we specialize to geodesically complete Riemannian manifolds: we devise and analyze the complexity of first-order methods for smooth minimax problems.
Triangular decomposition with different properties has been used for various types of problem solving, e.g. geometry theorem proving, real solution isolation of zero-dimensional polynomial systems, etc. In this paper, the concepts of strong chain and square-free strong triangular decomposition (SFSTD) of zero-dimensional polynomial systems are defined. Because of its good properties, SFSTD may be a key way to many problems related to zero-dimensional polynomial systems, such as real solution isolation and computing radicals of zero-dimensional ideals. Inspired by the work of Wang and of Dong and Mou, we propose an algorithm for computing SFSTD based on Gr\"obner bases computation. The novelty of the algorithm is that we make use of saturated ideals and separant to ensure that the zero sets of any two strong chains have no intersection and every strong chain is square-free, respectively. On one hand, we prove that the arithmetic complexity of the new algorithm can be single exponential in the square of the number of variables, which seems to be among the rare complexity analysis results for triangular-decomposition methods. On the other hand, we show experimentally that, on a large number of examples in the literature, the new algorithm is far more efficient than a popular triangular-decomposition method based on pseudo-division. Furthermore, it is also shown that, on those examples, the methods based on SFSTD for real solution isolation and for computing radicals of zero-dimensional ideals are very efficient.
We introduce here the model of growing graphs, a model of dynamic networks in which nodes can generate new nodes, thus expanding the network. This motivates the algorithmic problem of constructing a target graph G, starting from a single node. To properly model this, we assume that every node u can generate at most one node v in every round (or time slot). Every newly generated node v can activate edges with other nodes, only at the time of its birth, provided that these nodes are up to a small distance d away from v. We show that the most interesting case is when d=2. As we prove, in order to achieve the construction of a target graph G in a small number of time slots, we might need to pay for auxiliary edges (the "excess edges"), which will be eventually removed. This creates a trade-off between the number of time slots and the number of excess edges required to construct a target graph. In this paper, we deal with the following algorithmic question: Given a target graph G of n nodes, can G be constructed in at most k time slots and with at most \ell excess edges? On the positive side, we provide polynomial-time algorithms that efficiently construct fundamental graph families, such as lines, stars, trees, and planar graphs. In particular, we show that trees can be constructed in a poly-logarithmic number of slots with linearly many excess edges, while planar graphs can be constructed in a logarithmic number of slots with O(n\log n) excess edges. We also give a polynomial-time algorithm for deciding whether a graph can be constructed in \log n slots with \ell = 0 excess edges. On the negative side, we prove that the problem of determining the minimum number of slots required for a graph to be constructed with zero excess edges (i) is NP-complete and (ii) for any \varepsilon>0, cannot be approximated within n^{1-\varepsilon}, unless P=NP.
Over the last few years, the Shapley value, a solution concept from cooperative game theory, has found numerous applications in machine learning. In this paper, we first discuss fundamental concepts of cooperative game theory and axiomatic properties of the Shapley value. Then we give an overview of the most important applications of the Shapley value in machine learning: feature selection, explainability, multi-agent reinforcement learning, ensemble pruning, and data valuation. We examine the most crucial limitations of the Shapley value and point out directions for future research.
Recently, deep multiagent reinforcement learning (MARL) has become a highly active research area as many real-world problems can be inherently viewed as multiagent systems. A particularly interesting and widely applicable class of problems is the partially observable cooperative multiagent setting, in which a team of agents learns to coordinate their behaviors conditioning on their private observations and commonly shared global reward signals. One natural solution is to resort to the centralized training and decentralized execution paradigm. During centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level's benefits. In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system's global Q-values into individual agents' Q-values. Unlike previous works which restrict the representation relation of the individual Q-values and the global one, we leverage the integrated gradient attribution technique into deep MARL to directly decompose global Q-values along trajectory paths to assign credits for agents. We evaluate QPD on the challenging StarCraft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.
Reinforcement learning (RL) algorithms have been around for decades and been employed to solve various sequential decision-making problems. These algorithms however have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments. This paper addresses an important aspect of deep RL related to situations that demand multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multi-agent deep RL (MADRL) is presented, including non-stationarity, partial observability, continuous state and action spaces, multi-agent training schemes, multi-agent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed, with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to future development of more robust and highly useful multi-agent learning methods for solving real-world problems.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.