In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies. We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises. To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant. Our Q-value update rule combines the notions behind Clipped Double Q-learning and Maxmin Q-learning by computing the critic objective through the nested combination of maximum and minimum operators to bound the approximate value estimates. We evaluate our modification on the suite of several OpenAI Gym continuous control tasks, improving the state-of-the-art in every environment tested.
We obtained convergence rates of the collocation approximation by deep ReLU neural networks of the solution $u$ to elliptic PDEs with lognormal inputs, parametrized by $\boldsymbol{y}$ from the non-compact set $\mathbb{R}^\infty$. The approximation error is measured in the norm of the Bochner space $L_2(\mathbb{R}^\infty, V, \gamma)$, where $\gamma$ is the infinite tensor product standard Gaussian probability measure on $\mathbb{R}^\infty$ and $V$ is the energy space. Under a certain assumption on $\ell_q$-summability for the lognormal inputs $(0<q<2)$, we proved that given arbitrary number $\delta >0$ small enough, for every integer $n > 1$, one can construct a compactly supported deep ReLU neural network $\boldsymbol{\phi}_n:= \big(\phi_j\big)_{j=1}^m$ of size at most $n$ on $\mathbb{R}^m$ with $m =\mathcal{O}(n^{1 - \delta})$, and a sequence of points $\big(\boldsymbol{y}j\big)_{j=1}^m \subset \mathbb{R}^m$ (which are independent of $u$) so that the collocation approximation of $u$ by $\Phi_n u:= \sum_{j=1}^m u\big(\boldsymbol{y}^j\big) \Phi_j,$ which is based on the $m$ solvers $\Big( u\big(\boldsymbol{y}^j\big)\Big)_{j=1}^m$ and the deep ReLU network $\boldsymbol{\phi}_n$, gives the twofold error bounds: $\|u- \Phi_n u \|_{L_2(\mathbb{R}^\infty V, \gamma)} = \mathcal{O}\left(m^{- (1/q - 1/2)}\right) =\mathcal{O}\left(n^{- (1-\delta)(1/q - 1/2)}\right),$ where $\Phi_j$ are the extensions of $\phi_j$ to the whole $\mathbb{R}^\infty$. We also obtained similar results for the case when the lognormal inputs are parametrized on $\mathbb{R}^M$ with very large dimension $M$, and the approximation error is measured in the $\sqrt{g_M}$-weighted uniform norm of the Bochner space $L_\infty^{\sqrt{g}}(\mathbb{R}^M, V)$, where $g_M$ is the density function of the standard Gaussian probability measure on $\mathbb{R}^M$.
There are many provably efficient algorithms for episodic reinforcement learning. However, these algorithms are built under the assumption that the sequences of states, actions and rewards associated with each episode arrive immediately, allowing policy updates after every interaction with the environment. This assumption is often unrealistic in practice, particularly in areas such as healthcare and online recommendation. In this paper, we study the impact of delayed feedback on several provably efficient algorithms for regret minimisation in episodic reinforcement learning. Firstly, we consider updating the policy as soon as new feedback becomes available. Using this updating scheme, we show that the regret increases by an additive term involving the number of states, actions, episode length and the expected delay. This additive term changes depending on the optimistic algorithm of choice. We also show that updating the policy less frequently can lead to an improved dependency of the regret on the delays.
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. As such, these methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. In fully-observable MDPs, these bridge functions reduce to the familiar value functions and marginal density ratios between the evaluation and the behavior policies. We next propose minimax estimation methods for learning these bridge functions. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. Finally, we construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Their nonasymptotic and asymptotic properties are investigated in detail.
The experience replay mechanism allows agents to use the experiences multiple times. In prior works, the sampling probability of the transitions was adjusted according to their importance. Reassigning sampling probabilities for every transition in the replay buffer after each iteration is highly inefficient. Therefore, experience replay prioritization algorithms recalculate the significance of a transition when the corresponding transition is sampled to gain computational efficiency. However, the importance level of the transitions changes dynamically as the policy and the value function of the agent are updated. In addition, experience replay stores the transitions are generated by the previous policies of the agent that may significantly deviate from the most recent policy of the agent. Higher deviation from the most recent policy of the agent leads to more off-policy updates, which is detrimental for the agent. In this paper, we develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence (KLPER), which prioritizes batch of transitions rather than directly prioritizing each transition. Moreover, to reduce the off-policyness of the updates, our algorithm selects one batch among a certain number of batches and forces the agent to learn through the batch that is most likely generated by the most recent policy of the agent. We combine our algorithm with Deep Deterministic Policy Gradient and Twin Delayed Deep Deterministic Policy Gradient and evaluate it on various continuous control tasks. KLPER provides promising improvements for deep deterministic continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.
We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous works. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements. Density constraints can also avoid the time-consuming process of designing and tuning cost functions required by value function-based constraints to encode system specifications. We leverage the duality between density functions and Q functions to develop an effective algorithm to solve the density constrained RL problem optimally and the constrains are guaranteed to be satisfied. We prove that the proposed algorithm converges to a near-optimal solution with a bounded error even when the policy update is imperfect. We use a set of comprehensive experiments to demonstrate the advantages of our approach over state-of-the-art CRL methods, with a wide range of density constrained tasks as well as standard CRL benchmarks such as Safety-Gym.
The Q-learning algorithm is known to be affected by the maximization bias, i.e. the systematic overestimation of action values, an important issue that has recently received renewed attention. Double Q-learning has been proposed as an efficient algorithm to mitigate this bias. However, this comes at the price of an underestimation of action values, in addition to increased memory requirements and a slower convergence. In this paper, we introduce a new way to address the maximization bias in the form of a "self-correcting algorithm" for approximating the maximum of an expected value. Our method balances the overestimation of the single estimator used in conventional Q-learning and the underestimation of the double estimator used in Double Q-learning. Applying this strategy to Q-learning results in Self-correcting Q-learning. We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate. Empirically, it performs better than Double Q-learning in domains with rewards of high variance, and it even attains faster convergence than Q-learning in domains with rewards of zero or low variance. These advantages transfer to a Deep Q Network implementation that we call Self-correcting DQN and which outperforms regular DQN and Double DQN on several tasks in the Atari 2600 domain.
Proximal Policy Optimization (PPO) is a highly popular model-free reinforcement learning (RL) approach. However, in continuous state and actions spaces and a Gaussian policy -- common in computer animation and robotics -- PPO is prone to getting stuck in local optima. In this paper, we observe a tendency of PPO to prematurely shrink the exploration variance, which naturally leads to slow progress. Motivated by this, we borrow ideas from CMA-ES, a black-box optimization method designed for intelligent adaptive Gaussian exploration, to derive PPO-CMA, a novel proximal policy optimization approach that can expand the exploration variance on objective function slopes and shrink the variance when close to the optimum. This is implemented by using separate neural networks for policy mean and variance and training the mean and variance in separate passes. Our experiments demonstrate a clear improvement over vanilla PPO in many difficult OpenAI Gym MuJoCo tasks.
We present Residual Policy Learning (RPL): a simple method for improving nondifferentiable policies using model-free deep reinforcement learning. RPL thrives in complex robotic manipulation tasks where good but imperfect controllers are available. In these tasks, reinforcement learning from scratch remains data-inefficient or intractable, but learning a residual on top of the initial controller can yield substantial improvement. We study RPL in five challenging MuJoCo tasks involving partial observability, sensor noise, model misspecification, and controller miscalibration. By combining learning with control algorithms, RPL can perform long-horizon, sparse-reward tasks for which reinforcement learning alone fails. Moreover, we find that RPL consistently and substantially improves on the initial controllers. We argue that RPL is a promising approach for combining the complementary strengths of deep reinforcement learning and robotic control, pushing the boundaries of what either can achieve independently.
Eligibility traces are an effective technique to accelerate reinforcement learning by smoothly assigning credit to recently visited states. However, their online implementation is incompatible with modern deep reinforcement learning algorithms, which rely heavily on i.i.d. training data and offline learning. We utilize an efficient, recursive method for computing {\lambda}-returns offline that can provide the benefits of eligibility traces to any value-estimation or actor-critic method. We demonstrate how our method can be combined with DQN, DRQN, and A3C to greatly enhance the learning speed of these algorithms when playing Atari 2600 games, even under partial observability. Our results indicate several-fold improvements to sample efficiency on Seaquest and Q*bert. We expect similar results for other algorithms and domains not considered here, including those with continuous actions.
We consider the exploration-exploitation trade-off in reinforcement learning and we show that an agent imbued with a risk-seeking utility function is able to explore efficiently, as measured by regret. The parameter that controls how risk-seeking the agent is can be optimized exactly, or annealed according to a schedule. We call the resulting algorithm K-learning and show that the corresponding K-values are optimistic for the expected Q-values at each state-action pair. The K-values induce a natural Boltzmann exploration policy for which the `temperature' parameter is equal to the risk-seeking parameter. This policy achieves an expected regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed time-steps. This bound is only a factor of $L$ larger than the established lower bound. K-learning can be interpreted as mirror descent in the policy space, and it is similar to other well-known methods in the literature, including Q-learning, soft-Q-learning, and maximum entropy policy gradient, and is closely related to optimism and count based exploration methods. K-learning is simple to implement, as it only requires adding a bonus to the reward at each state-action and then solving a Bellman equation. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.