We consider in discrete time, a general class of sequential stochastic dynamic games with asymmetric information with the following features. The underlying system has Markovian dynamics controlled by the agents' joint actions. Each agent's instantaneous utility depends on the current system state and the agents' joint actions. At each time instant each agent makes a private noisy observation of the current system state and the agents' actions in the previous time instant. In addition, at each time instant all agents have a common noisy observation of the current system state and their actions in the previous time instant. Each agent's actions are part of his private information. The objective is to determine Bayesian Nash Equilibrium (BNE) strategy profiles that are based on a compressed version of the agents' information and can be sequentially computed; such BNE strategy profiles may not always exist. We present an approach/methodology that achieves the above-stated objective, along with an instance of a game where BNE strategy profiles with the above-mentioned characteristics exist. We show that the methodology also works for the case where the agents have no common observations.
Information design in an incomplete information game includes a designer with the goal of influencing players' actions through signals generated from a designed probability distribution so that its objective function is optimized. If the players have quadratic payoffs that depend on the players' actions and an unknown payoff-relevant state, and signals on the state that follow a Gaussian distribution conditional on the state realization, then the information design problem under quadratic design objectives is a semidefinite program (SDP). We consider a setting in which the designer has partial knowledge on agents' utilities. We address the uncertainty about players' preferences by formulating a robust information design problem. Specifically, we consider ellipsoid perturbations over payoff matrices in linear-quadratic-Gaussian (LQG) games. We show that this leads to a tractable robust SDP formulation. Using the robust SDP formulation, we obtain analytical conditions for the optimality of no information and full information disclosure. The robust convex program is also extended to interval and general convex cone uncertainty sets on the payoff matrices. Numerical studies are carried out to identify the relation between the perturbation levels and the optimal information structures.
For general-sum, n-player, strategic games with transferable utility, the Harsanyi-Shapley value provides a computable method to both 1) quantify the strategic value of a player; and 2) make cooperation rational through side payments. We give a simple formula to compute the HS value in normal-form games. Next, we provide two methods to generalize the HS values to stochastic (or Markov) games, and show that one of them may be computed using generalized Q-learning algorithms. Finally, an empirical validation is performed on stochastic grid-games with three or more players. Source code is provided to compute HS values for both the normal-form and stochastic game setting.
This paper considers an infinitely repeated three-player Bayesian game with lack of information on two sides, in which an informed player plays two zero-sum games simultaneously at each stage against two uninformed players. This is a generalization of the Aumann et al. [1] two-player zero-sum one-sided incomplete information model. Under a correlated prior, the informed player faces the problem of how to optimally disclose information among two uninformed players in order to maximize his long-term average payoffs. Our objective is to understand the adverse effects of \information spillover" from one game to the other in the equilibrium payoff set of the informed player. We provide conditions under which the informed player can fully overcome such adverse effects and characterize equilibrium payoffs. In a second result, we show how the effects of information spillover on the equilibrium payoff set of the informed player might be severe.
We study a system in which two-state Markov sources send status updates to a common receiver over a slotted ALOHA random access channel. We characterize the performance of the system in terms of state estimation entropy (SEE), which measures the uncertainty at the receiver about the sources' state. Two channel access strategies are considered, a reactive policy that depends on the source behavior and a random one that is independent of it. We prove that the considered policies can be studied using two different hidden Markov models (HMM) and show through density evolution (DE) analysis that the reactive strategy outperforms the random one in terms of SEE while the opposite is true for AoI. Furthermore, we characterize the probability of error in the state estimation at the receiver, considering a maximum a posteriori (MAP) estimator and a low-complexity (decode and hold) estimator. Our study provides useful insights on the design trade-offs that emerge when different performance metrics such as SEE, age or information (AoI) or state estimation probability error are adopted. Moreover, we show how the source statistics significantly impact the system performance.
Tasks where the set of possible actions depend discontinuously on the state pose a significant challenge for current reinforcement learning algorithms. For example, a locked door must be first unlocked, and then the handle turned before the door can be opened. The sequential nature of these tasks makes obtaining final rewards difficult, and transferring information between task variants using continuous learned values such as weights rather than discrete symbols can be inefficient. Our key insight is that agents that act and think symbolically are often more effective in dealing with these tasks. We propose a memory-based learning approach that leverages the symbolic nature of constraints and temporal ordering of actions in these tasks to quickly acquire and transfer high-level information. We evaluate the performance of memory-based learning on both real and simulated tasks with approximately discontinuous constraints between states and actions, and show our method learns to solve these tasks an order of magnitude faster than both model-based and model-free deep reinforcement learning methods.
Causal representation learning is the task of identifying the underlying causal variables and their relations from high-dimensional observations, such as images. Recent work has shown that one can reconstruct the causal variables from temporal sequences of observations under the assumption that there are no instantaneous causal relations between them. In practical applications, however, our measurement or frame rate might be slower than many of the causal effects. This effectively creates "instantaneous" effects and invalidates previous identifiability results. To address this issue, we propose iCITRIS, a causal representation learning method that allows for instantaneous effects in intervened temporal sequences when intervention targets can be observed, e.g., as actions of an agent. iCITRIS identifies the potentially multidimensional causal variables from temporal observations, while simultaneously using a differentiable causal discovery method to learn their causal graph. In experiments on three datasets of interactive systems, iCITRIS accurately identifies the causal variables and their causal graph.
Interacting particle or agent systems that display a rich variety of swarming behaviours are ubiquitous in science and engineering. A fundamental and challenging goal is to understand the link between individual interaction rules and swarming. In this paper, we study the data-driven discovery of a second-order particle swarming model that describes the evolution of $N$ particles in $\mathbb{R}^d$ under radial interactions. We propose a learning approach that models the latent radial interaction function as Gaussian processes, which can simultaneously fulfill two inference goals: one is the nonparametric inference of {the} interaction function with pointwise uncertainty quantification, and the other one is the inference of unknown scalar parameters in the non-collective friction forces of the system. We formulate the learning problem as a statistical inverse problem and provide a detailed analysis of recoverability conditions, establishing that a coercivity condition is sufficient for recoverability. Given data collected from $M$ i.i.d trajectories with independent Gaussian observational noise, we provide a finite-sample analysis, showing that our posterior mean estimator converges in a Reproducing kernel Hilbert space norm, at an optimal rate in $M$ equal to the one in the classical 1-dimensional Kernel Ridge regression. As a byproduct, we show we can obtain a parametric learning rate in $M$ for the posterior marginal variance using $L^{\infty}$ norm, and the rate could also involve $N$ and $L$ (the number of observation time instances for each trajectory), depending on the condition number of the inverse problem. Numerical results on systems that exhibit different swarming behaviors demonstrate efficient learning of our approach from scarce noisy trajectory data.
Representation learning lies at the heart of the empirical success of deep learning for dealing with the curse of dimensionality. However, the power of representation learning has not been fully exploited yet in reinforcement learning (RL), due to i), the trade-off between expressiveness and tractability; and ii), the coupling between exploration and representation learning. In this paper, we first reveal the fact that under some noise assumption in the stochastic control model, we can obtain the linear spectral feature of its corresponding Markov transition operator in closed-form for free. Based on this observation, we propose Spectral Dynamics Embedding (SPEDE), which breaks the trade-off and completes optimistic exploration for representation learning by exploiting the structure of the noise. We provide rigorous theoretical analysis of SPEDE, and demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.
Future NASA lander missions to icy moons will require completely automated, accurate, and data efficient calibration methods for the robot manipulator arms that sample icy terrains in the lander's vicinity. To support this need, this paper presents a Gaussian Process (GP) approach to the classical manipulator kinematic calibration process. Instead of identifying a corrected set of Denavit-Hartenberg kinematic parameters, a set of GPs models the residual kinematic error of the arm over the workspace. More importantly, this modeling framework allows a Gaussian Process Upper Confident Bound (GP-UCB) algorithm to efficiently and adaptively select the calibration's measurement points so as to minimize the number of experiments, and therefore minimize the time needed for recalibration. The method is demonstrated in simulation on a simple 2-DOF arm, a 6 DOF arm whose geometry is a candidate for a future NASA mission, and a 7 DOF Barrett WAM arm.
This paper proposes new, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Different from prior efforts on training agents to beat a fixed set of opponents, our objective is to find the Nash equilibrium policies that are free from exploitation by even the adversarial opponents. We propose (a) Nash-DQN algorithm, which integrates the deep learning techniques from single DQN into the classic Nash Q-learning algorithm for solving tabular Markov games; (b) Nash-DQN-Exploiter algorithm, which additionally adopts an exploiter to guide the exploration of the main agent. We conduct experimental evaluation on tabular examples as well as various two-player Atari games. Our empirical results demonstrate that (i) the policies found by many existing methods including Neural Fictitious Self Play and Policy Space Response Oracle can be prone to exploitation by adversarial opponents; (ii) the output policies of our algorithms are robust to exploitation, and thus outperform existing methods.