We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (i) reachability, (ii) safety-constrained reachability, or (iii) discounted-reward objectives, we build upon the DeepMDP framework introduced by Gelada et al. to derive new bisimulation bounds between the unknown environment and a learned discrete latent model of it. Our bisimulation bounds enable the application of formal methods for Markov decision processes. Finally, we show how one can use a policy obtained via state-of-the-art RL to efficiently train a variational autoencoder that yields a discrete latent model with provably approximately correct bisimulation guarantees. Additionally, we obtain a distilled version of the policy for the latent model.
In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies, the policy space, typically represented by a family of parametric functions. Dealing with such a policy space is a hefty challenge, which often causes sample and computation inefficiencies. However, we argue that a limited number of policies are actually relevant when we also account for the structure of the environment and of the policy parameterization, as many of them would induce very similar interactions, i.e., state-action distributions. In this paper, we seek for a reward-free compression of the policy space into a finite set of representative policies, such that, given any policy $\pi$, the minimum R\'enyi divergence between the state-action distributions of the representative policies and the state-action distribution of $\pi$ is bounded. We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard. Nonetheless, we propose a game-theoretic reformulation for which a locally optimal solution can be efficiently found by iteratively stretching the compressed space to cover an adversarial policy. Finally, we provide an empirical evaluation to illustrate the compression procedure in simple domains, and its ripple effects in reinforcement learning.
Evolutionary strategies have recently been shown to achieve competing levels of performance for complex optimization problems in reinforcement learning. In such problems, one often needs to optimize an objective function subject to a set of constraints, including for instance constraints on the entropy of a policy or to restrict the possible set of actions or states accessible to an agent. Convergence guarantees for evolutionary strategies to optimize stochastic constrained problems are however lacking in the literature. In this work, we address this problem by designing a novel optimization algorithm with a sufficient decrease mechanism that ensures convergence and that is based only on estimates of the functions. We demonstrate the applicability of this algorithm on two types of experiments: i) a control task for maximizing rewards and ii) maximizing rewards subject to a non-relaxable set of constraints.
We introduce the hemicubic codes, a family of quantum codes obtained by associating qubits with the $p$-faces of the $n$-cube (for $n>p$) and stabilizer constraints with faces of dimension $(p\pm1)$. The quantum code obtained by identifying antipodal faces of the resulting complex encodes one logical qubit into $N = 2^{n-p-1} \tbinom{n}{p}$ physical qubits and displays local testability with a soundness of $\Omega(1/\log(N))$ beating the current state-of-the-art of $1/\log^{2}(N)$ due to Hastings. We exploit this local testability to devise an efficient decoding algorithm that corrects arbitrary errors of size less than the minimum distance, up to polylog factors. We then extend this code family by considering the quotient of the $n$-cube by arbitrary linear classical codes of length $n$. We establish the parameters of these generalized hemicubic codes. Interestingly, if the soundness of the hemicubic code could be shown to be constant, similarly to the ordinary $n$-cube, then the generalized hemicubic codes could yield quantum locally testable codes of length not exceeding an exponential or even polynomial function of the code dimension.
The scope of this paper is the analysis and approximation of an optimal control problem related to the Allen-Cahn equation. A tracking functional is minimized subject to the Allen-Cahn equation using distributed controls that satisfy point-wise control constraints. First and second order necessary and sufficient conditions are proved. The lowest order discontinuous Galerkin - in time - scheme is considered for the approximation of the control to state and adjoint state mappings. Under a suitable restriction on maximum size of the temporal and spatial discretization parameters $k$, $h$ respectively in terms of the parameter $\epsilon$ that describes the thickness of the interface layer, a-priori estimates are proved with constants depending polynomially upon $1/ \epsilon$. Unlike to previous works for the uncontrolled Allen-Cahn problem our approach does not rely on a construction of an approximation of the spectral estimate, and as a consequence our estimates are valid under low regularity assumptions imposed by the optimal control setting. These estimates are also valid in cases where the solution and its discrete approximation do not satisfy uniform space-time bounds independent of $\epsilon$. These estimates and a suitable localization technique, via the second order condition (see \cite{Arada-Casas-Troltzsch_2002,Casas-Mateos-Troltzsch_2005,Casas-Raymond_2006,Casas-Mateos-Raymond_2007}), allows to prove error estimates for the difference between local optimal controls and their discrete approximation as well as between the associated state and adjoint state variables and their discrete approximations
We propose a novel approach to disentangle the generative factors of variation underlying a given set of observations. Our method builds upon the idea that the (unknown) low-dimensional manifold underlying the data space can be explicitly modeled as a product of submanifolds. This gives rise to a new definition of disentanglement, and to a novel weakly-supervised algorithm for recovering the unknown explanatory factors behind the data. At training time, our algorithm only requires pairs of non i.i.d. data samples whose elements share at least one, possibly multidimensional, generative factor of variation. We require no knowledge on the nature of these transformations, and do not make any limiting assumption on the properties of each subspace. Our approach is easy to implement, and can be successfully applied to different kinds of data (from images to 3D surfaces) undergoing arbitrary transformations. In addition to standard synthetic benchmarks, we showcase our method in challenging real-world applications, where we compare favorably with the state of the art.
Reinforcement learning (RL) algorithms have been around for decades and been employed to solve various sequential decision-making problems. These algorithms however have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments. This paper addresses an important aspect of deep RL related to situations that demand multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multi-agent deep RL (MADRL) is presented, including non-stationarity, partial observability, continuous state and action spaces, multi-agent training schemes, multi-agent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed, with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to future development of more robust and highly useful multi-agent learning methods for solving real-world problems.
Learning robot objective functions from human input has become increasingly important, but state-of-the-art techniques assume that the human's desired objective lies within the robot's hypothesis space. When this is not true, even methods that keep track of uncertainty over the objective fail because they reason about which hypothesis might be correct, and not whether any of the hypotheses are correct. We focus specifically on learning from physical human corrections during the robot's task execution, where not having a rich enough hypothesis space leads to the robot updating its objective in ways that the person did not actually intend. We observe that such corrections appear irrelevant to the robot, because they are not the best way of achieving any of the candidate objectives. Instead of naively trusting and learning from every human interaction, we propose robots learn conservatively by reasoning in real time about how relevant the human's correction is for the robot's hypothesis space. We test our inference method in an experiment with human interaction data, and demonstrate that this alleviates unintended learning in an in-person user study with a 7DoF robot manipulator.
We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates the number of topics K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any number of documents (n), individual document length (N_i), dictionary size (p) and number of topics (K), and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
In this paper, we develop the continuous time dynamic topic model (cDTM). The cDTM is a dynamic topic model that uses Brownian motion to model the latent topics through a sequential collection of documents, where a "topic" is a pattern of word use that we expect to evolve over the course of the collection. We derive an efficient variational approximate inference algorithm that takes advantage of the sparsity of observations in text, a property that lets us easily handle many time points. In contrast to the cDTM, the original discrete-time dynamic topic model (dDTM) requires that time be discretized. Moreover, the complexity of variational inference for the dDTM grows quickly as time granularity increases, a drawback which limits fine-grained discretization. We demonstrate the cDTM on two news corpora, reporting both predictive perplexity and the novel task of time stamp prediction.