In this paper, we mainly focus on solving high-dimensional stochastic Hamiltonian systems with boundary condition, and propose a novel method from the view of the stochastic control. In order to obtain the approximated solution of the Hamiltonian system, we first introduce a corresponding stochastic optimal control problem such that the Hamiltonian system of control problem is exactly what we need to solve, then develop two different algorithms suitable for different cases of the control problem and approximate the stochastic control via deep neural networks. From the numerical results, comparing with the Deep FBSDE method which was developed previously from the view of solving FBSDEs, the novel algorithms converge faster, which means that they require fewer training steps, and demonstrate more stable convergences for different Hamiltonian systems.
Control certificates based on barrier functions have been a powerful tool to generate probably safe control policies for dynamical systems. However, existing methods based on barrier certificates are normally for white-box systems with differentiable dynamics, which makes them inapplicable to many practical applications where the system is a black-box and cannot be accurately modeled. On the other side, model-free reinforcement learning (RL) methods for black-box systems suffer from lack of safety guarantees and low sampling efficiency. In this paper, we propose a novel method that can learn safe control policies and barrier certificates for black-box dynamical systems, without requiring for an accurate system model. Our method re-designs the loss function to back-propagate gradient to the control policy even when the black-box dynamical system is non-differentiable, and we show that the safety certificates hold on the black-box system. Empirical results in simulation show that our method can significantly improve the performance of the learned policies by achieving nearly 100% safety and goal reaching rates using much fewer training samples, compared to state-of-the-art black-box safe control methods. Our learned agents can also generalize to unseen scenarios while keeping the original performance. The source code can be found at //github.com/Zengyi-Qin/bcbf.
One of the fundamental assumptions in stochastic control of continuous time processes is that the dynamics of the underlying (diffusion) process is known. This is, however, usually obviously not fulfilled in practice. On the other hand, over the last decades, a rich theory for nonparametric estimation of the drift (and volatility) for continuous time processes has been developed. The aim of this paper is bringing together techniques from stochastic control with methods from statistics for stochastic processes to find a way to both learn the dynamics of the underlying process and control in a reasonable way at the same time. More precisely, we study a long-term average impulse control problem, a stochastic version of the classical Faustmann timber harvesting problem. One of the problems that immediately arises is an exploration-exploitation dilemma as is well known for problems in machine learning. We propose a way to deal with this issue by combining exploration and exploitation periods in a suitable way. Our main finding is that this construction can be based on the rates of convergence of estimators for the invariant density. Using this, we obtain that the average cumulated regret is of uniform order $O({T^{-1/3}})$.
Controlling antenna tilts in cellular networks is imperative to reach an efficient trade-off between network coverage and capacity. In this paper, we devise algorithms learning optimal tilt control policies from existing data (in the so-called passive learning setting) or from data actively generated by the algorithms (the active learning setting). We formalize the design of such algorithms as a Best Policy Identification (BPI) problem in Contextual Linear Multi-Arm Bandits (CL-MAB). An arm represents an antenna tilt update; the context captures current network conditions; the reward corresponds to an improvement of performance, mixing coverage and capacity; and the objective is to identify, with a given level of confidence, an approximately optimal policy (a function mapping the context to an arm with maximal reward). For CL-MAB in both active and passive learning settings, we derive information-theoretical lower bounds on the number of samples required by any algorithm returning an approximately optimal policy with a given level of certainty, and devise algorithms achieving these fundamental limits. We apply our algorithms to the Remote Electrical Tilt (RET) optimization problem in cellular networks, and show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms.
We propose a novel numerical method for high dimensional Hamilton--Jacobi--Bellman (HJB) type elliptic partial differential equations (PDEs). The HJB PDEs, reformulated as optimal control problems, are tackled by the actor-critic framework inspired by reinforcement learning, based on neural network parametrization of the value and control functions. Within the actor-critic framework, we employ a policy gradient approach to improve the control, while for the value function, we derive a variance reduced least-squares temporal difference method using stochastic calculus. To numerically discretize the stochastic control problem, we employ an adaptive step size scheme to improve the accuracy near the domain boundary. Numerical examples up to $20$ spatial dimensions including the linear quadratic regulators, the stochastic Van der Pol oscillators, the diffusive Eikonal equations, and fully nonlinear elliptic PDEs derived from a regulator problem are presented to validate the effectiveness of our proposed method.
The ability to recover from an unexpected external perturbation is a fundamental motor skill in bipedal locomotion. An effective response includes the ability to not just recover balance and maintain stability but also to fall in a safe manner when balance recovery is physically infeasible. For robots associated with bipedal locomotion, such as humanoid robots and assistive robotic devices that aid humans in walking, designing controllers which can provide this stability and safety can prevent damage to robots or prevent injury related medical costs. This is a challenging task because it involves generating highly dynamic motion for a high-dimensional, non-linear and under-actuated system with contacts. Despite prior advancements in using model-based and optimization methods, challenges such as requirement of extensive domain knowledge, relatively large computational time and limited robustness to changes in dynamics still make this an open problem. In this thesis, to address these issues we develop learning-based algorithms capable of synthesizing push recovery control policies for two different kinds of robots : Humanoid robots and assistive robotic devices that assist in bipedal locomotion. Our work can be branched into two closely related directions : 1) Learning safe falling and fall prevention strategies for humanoid robots and 2) Learning fall prevention strategies for humans using a robotic assistive devices. To achieve this, we introduce a set of Deep Reinforcement Learning (DRL) algorithms to learn control policies that improve safety while using these robots.
In this paper we analyze the Schwarz alternating method for unconstrained elliptic optimal control problems. We discuss the convergence properties of the method in the continuous case first and then apply the arguments to the finite difference discretization case. In both cases, we prove that the Schwarz alternating method is convergent if its counterpart for an elliptic equation is convergent. Meanwhile, the convergence rate of the method for the elliptic equation under the maximum norm also gives a uniform upper bound (with respect to the regularization parameter $\alpha$) of the convergence rate of the method for the optimal control problem under the maximum norm of proper error merit functions in the continuous case or vectors in the discrete case. Our numerical results corroborate our theoretical results and show that with $\alpha$ decreasing to zero, the method will converge faster. We also give some exposition of this phenomenon.
This paper proposes a method for modeling human driver interactions that relies on multi-output gaussian processes. The proposed method is developed as a refinement of the game theoretical hierarchical reasoning approach called "level-k reasoning" which conventionally assigns discrete levels of behaviors to agents. Although it is shown to be an effective modeling tool, the level-k reasoning approach may pose undesired constraints for predicting human decision making due to a limited number (usually 2 or 3) of driver policies it extracts. The proposed approach is put forward to fill this gap in the literature by introducing a continuous domain framework that enables an infinite policy space. By using the approach presented in this paper, more accurate driver models can be obtained, which can then be employed for creating high fidelity simulation platforms for the validation of autonomous vehicle control algorithms. The proposed method is validated on a real traffic dataset and compared with the conventional level-k approach to demonstrate its contributions and implications.
This paper addresses the problem of formally verifying desirable properties of neural networks, i.e., obtaining provable guarantees that neural networks satisfy specifications relating their inputs and outputs (robustness to bounded norm adversarial perturbations, for example). Most previous work on this topic was limited in its applicability by the size of the network, network architecture and the complexity of properties to be verified. In contrast, our framework applies to a general class of activation functions and specifications on neural network inputs and outputs. We formulate verification as an optimization problem (seeking to find the largest violation of the specification) and solve a Lagrangian relaxation of the optimization problem to obtain an upper bound on the worst case violation of the specification being verified. Our approach is anytime i.e. it can be stopped at any time and a valid bound on the maximum violation can be obtained. We develop specialized verification algorithms with provable tightness guarantees under special assumptions and demonstrate the practical significance of our general verification approach on a variety of verification tasks.
Deep reinforcement learning (RL) methods generally engage in exploratory behavior through noise injection in the action space. An alternative is to add noise directly to the agent's parameters, which can lead to more consistent exploration and a richer set of behaviors. Methods such as evolutionary strategies use parameter perturbations, but discard all temporal structure in the process and require significantly more samples. Combining parameter noise with traditional RL methods allows to combine the best of both worlds. We demonstrate that both off- and on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks. Our results show that RL with parameter noise learns more efficiently than traditional RL with action space noise and evolutionary strategies individually.
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems -- finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem -- using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.