Multimodal demonstrations provide robots with an abundance of information to make sense of the world. However, such abundance may not always lead to good performance when it comes to learning sensorimotor control policies from human demonstrations. Extraneous data modalities can lead to state over-specification, where the state contains modalities that are not only useless for decision-making but also can change data distribution across environments. State over-specification leads to issues such as the learned policy not generalizing outside of the training data distribution. In this work, we propose Masked Imitation Learning (MIL) to address state over-specification by selectively using informative modalities. Specifically, we design a masked policy network with a binary mask to block certain modalities. We develop a bi-level optimization algorithm that learns this mask to accurately filter over-specified modalities. We demonstrate empirically that MIL outperforms baseline algorithms in simulated domains including MuJoCo and a robot arm environment using the Robomimic dataset, and effectively recovers the environment-invariant modalities on a multimodal dataset collected on a real robot. Our project website presents supplemental details and videos of our results at: //tinyurl.com/masked-il
Demonstrations provide insight into relevant state or action space regions, bearing great potential to boost the efficiency and practicality of reinforcement learning agents. In this work, we propose to leverage demonstration datasets by combining skill learning and sequence modeling. Starting with a learned joint latent space, we separately train a generative model of demonstration sequences and an accompanying low-level policy. The sequence model forms a latent space prior over plausible demonstration behaviors to accelerate learning of high-level policies. We show how to acquire such priors from state-only motion capture demonstrations and explore several methods for integrating them into policy learning on transfer tasks. Our experimental results confirm that latent space priors provide significant gains in learning speed and final performance in a set of challenging sparse-reward environments with a complex, simulated humanoid. Videos, source code and pre-trained models are available at the corresponding project website at //facebookresearch.github.io/latent-space-priors .
Recent work in incremental learning has introduced diverse approaches to tackle catastrophic forgetting from data augmentation to optimized training regimes. However, most of them focus on very few training steps. We propose a method for robust incremental learning over dozens of fine-tuning steps using data from a variety of languages. We show that a combination of data-augmentation and an optimized training regime allows us to continue improving the model even for as many as fifty training steps. Crucially, our augmentation strategy does not require retaining access to previous training data and is suitable in scenarios with privacy constraints.
Causal discovery (CD) from time-varying data is important in neuroscience, medicine, and machine learning. Techniques for CD include randomized experiments which are generally unbiased but expensive. It also includes algorithms like regression, matching, and Granger causality, which are only correct under strong assumptions made by human designers. However, as we found in other areas of machine learning, humans are usually not quite right and human expertise is usually outperformed by data-driven approaches. Here we test if we can improve causal discovery in a data-driven way. We take a perturbable system with a large number of causal components (transistors), the MOS 6502 processor, acquire the causal ground truth, and learn the causal discovery procedure represented as a neural network. We find that this procedure far outperforms human-designed causal discovery procedures, such as Mutual Information, LiNGAM, and Granger Causality both on MOS 6502 processor and the NetSim dataset which simulates functional magnetic resonance imaging (fMRI) results. We argue that the causality field should consider, where possible, a supervised approach, where CD procedures are learned from large datasets with known causal relations instead of being designed by a human specialist. Our findings promise a new approach toward improving CD in neural and medical data and for the broader machine learning community.
Sim-to-real is a mainstream method to cope with the large number of trials needed by typical deep reinforcement learning. However, transferring a policy trained in simulation to actual hardware remains challenging due to the reality gap. In particular, the characteristics of actuators in legged robots have a considerable influence on sim-to-real transfer. High reduction ratio gears are widely used in actuators, and the reality gap issue becomes especially pronounced when even the utilization of backdrivability is considered to control joints compliantly. We propose a new simulation model of gears to address this gap. Additionally, the difficulty in achieving stable bipedal locomotion causes typical methods to fail to tune physical parameters in simulation with the behavior of transferred policy. Thus, we propose a method for system identification that can utilize failed attempts. The method's effectiveness is verified using a biped robot, the ROBOTIS-OP3, and the sim-to-real transferred policy can stabilize the robot under severe disturbances and walk on uneven surfaces without force and torque sensors.
Discovering new intents is of great significance to establishing Bootstrapped Task-Oriented Dialogue System. Most existing methods either lack the ability to transfer prior knowledge in the known intent data or fall into the dilemma of forgetting prior knowledge in the follow-up. More importantly, these methods do not deeply explore the intrinsic structure of unlabeled data, so they can not seek out the characteristics that make an intent in general. In this paper, starting from the intuition that discovering intents could be beneficial to the identification of the known intents, we propose a probabilistic framework for discovering intents where intent assignments are treated as latent variables. We adopt Expectation Maximization framework for optimization. Specifically, In E-step, we conduct discovering intents and explore the intrinsic structure of unlabeled data by the posterior of intent assignments. In M-step, we alleviate the forgetting of prior knowledge transferred from known intents by optimizing the discrimination of labeled data. Extensive experiments conducted in three challenging real-world datasets demonstrate our method can achieve substantial improvements.
Motion planning and control are crucial components of robotics applications. Here, spatio-temporal hard constraints like system dynamics and safety boundaries (e.g., obstacles in automated driving) restrict the robot's motions. Direct methods from optimal control solve a constrained optimization problem. However, in many applications finding a proper cost function is inherently difficult because of the weighting of partially conflicting objectives. On the other hand, Imitation Learning (IL) methods such as Behavior Cloning (BC) provide a intuitive framework for learning decision-making from offline demonstrations and constitute a promising avenue for planning and control in complex robot applications. Prior work primarily relied on soft-constraint approaches, which use additional auxiliary loss terms describing the constraints. However, catastrophic safety-critical failures might occur in out-of-distribution (OOD) scenarios. This work integrates the flexibility of IL with hard constraint handling in optimal control. Our approach constitutes a general framework for constraint robotic motion planning and control using offline IL. Hard constraints are integrated into the learning problem in a differentiable manner, via explicit completion and gradient-based correction. Simulated experiments of mobile robot navigation and automated driving provide evidence for the performance of the proposed method.
Providing densely shaped reward functions for RL algorithms is often exceedingly challenging, motivating the development of RL algorithms that can learn from easier-to-specify sparse reward functions. This sparsity poses new exploration challenges. One common way to address this problem is using demonstrations to provide initial signal about regions of the state space with high rewards. However, prior RL from demonstrations algorithms introduce significant complexity and many hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified $Q$-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high $Q$-values in corresponding regions of the state space. Experiments across $5$ continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across $6$ commonly used RL and RL-from-demonstrations algorithms. See //sites.google.com/view/mcac-rl for code and supplementary material.
Autonomous robotic systems operating in human environments must understand their surroundings to make accurate and safe decisions. In crowded human scenes with close-up human-robot interaction and robot navigation, a deep understanding requires reasoning about human motion and body dynamics over time with human body pose estimation and tracking. However, existing datasets either do not provide pose annotations or include scene types unrelated to robotic applications. Many datasets also lack the diversity of poses and occlusions found in crowded human scenes. To address this limitation we introduce JRDB-Pose, a large-scale dataset and benchmark for multi-person pose estimation and tracking using videos captured from a social navigation robot. The dataset contains challenge scenes with crowded indoor and outdoor locations and a diverse range of scales and occlusion types. JRDB-Pose provides human pose annotations with per-keypoint occlusion labels and track IDs consistent across the scene. A public evaluation server is made available for fair evaluation on a held-out test set. JRDB-Pose is available at //jrdb.erc.monash.edu/ .
Dynamic attention mechanism and global modeling ability make Transformer show strong feature learning ability. In recent years, Transformer has become comparable to CNNs methods in computer vision. This review mainly investigates the current research progress of Transformer in image and video applications, which makes a comprehensive overview of Transformer in visual learning understanding. First, the attention mechanism is reviewed, which plays an essential part in Transformer. And then, the visual Transformer model and the principle of each module are introduced. Thirdly, the existing Transformer-based models are investigated, and their performance is compared in visual learning understanding applications. Three image tasks and two video tasks of computer vision are investigated. The former mainly includes image classification, object detection, and image segmentation. The latter contains object tracking and video classification. It is significant for comparing different models' performance in various tasks on several public benchmark data sets. Finally, ten general problems are summarized, and the developing prospects of the visual Transformer are given in this review.
We propose a new method for event extraction (EE) task based on an imitation learning framework, specifically, inverse reinforcement learning (IRL) via generative adversarial network (GAN). The GAN estimates proper rewards according to the difference between the actions committed by the expert (or ground truth) and the agent among complicated states in the environment. EE task benefits from these dynamic rewards because instances and labels yield to various extents of difficulty and the gains are expected to be diverse -- e.g., an ambiguous but correctly detected trigger or argument should receive high gains -- while the traditional RL models usually neglect such differences and pay equal attention on all instances. Moreover, our experiments also demonstrate that the proposed framework outperforms state-of-the-art methods, without explicit feature engineering.