The real-time dynamic environment perception has become vital for autonomous robots in crowded spaces. Although the popular voxel-based mapping methods can efficiently represent 3D obstacles with arbitrarily complex shapes, they can hardly distinguish between static and dynamic obstacles, leading to the limited performance of obstacle avoidance. While plenty of sophisticated learning-based dynamic obstacle detection algorithms exist in autonomous driving, the quadcopter's limited computation resources cannot achieve real-time performance using those approaches. To address these issues, we propose a real-time dynamic obstacle tracking and mapping system for quadcopter obstacle avoidance using an RGB-D camera. The proposed system first utilizes a depth image with an occupancy voxel map to generate potential dynamic obstacle regions as proposals. With the obstacle region proposals, the Kalman filter and our continuity filter are applied to track each dynamic obstacle. Finally, the environment-aware trajectory prediction method is proposed based on the Markov chain using the states of tracked dynamic obstacles. We implemented the proposed system with our custom quadcopter and navigation planner. The simulation and physical experiments show that our methods can successfully track and represent obstacles in dynamic environments in real-time and safely avoid obstacles.
The social compatibility (SC) is one of the most important parameters for service robots. It characterises the interaction quality between a robot and a human. In this paper, we first introduce an open-source software-hardware integration scheme for socially-compliant robot navigation and then propose a human-centered benchmarking framework. For the former, we integrate one 3D lidar, one 2D lidar, and four RGB-D cameras for robot exterior perception. The software system is entirely based on the Robot Operating System (ROS) with high modularity and fully deployed to the embedded hardware-based edge while running at a rate that exceeds the release frequency of sensor data. For the latter, we propose a new human-centered performance evaluation metric that can be used to measure SC quickly and efficiently. The values of this metric correlate with the results of the Godspeed questionnaire, which is believed to be a golden standard approach for SC measurements. Together with other commonly used metrics, we benchmark two open-source socially-compliant robot navigation methods, in an end-to-end manner. We clarify all aspects of the benchmarking to ensure the reproducibility of the experiments. We also show that the proposed new metric can provide further justification for the selection of numerical metrics (objective) from a human perspective (subjective).
We propose performing imbalanced classification by regrouping majority classes into small classes so that we turn the problem into balanced multiclass classification. This new idea is dramatically different from popular loss reweighting and class resampling methods. Our preliminary result on imbalanced medical image classification shows that this natural idea can substantially boost the classification performance as measured by average precision (approximately area-under-the-precision-recall-curve, or AUPRC), which is more appropriate for evaluating imbalanced classification than other metrics such as balanced accuracy.
Detecting objects of interest, such as human survivors, safety equipment, and structure access points, is critical to any search-and-rescue operation. Robots deployed for such time-sensitive efforts rely on their onboard sensors to perform their designated tasks. However, as disaster response operations are predominantly conducted under perceptually degraded conditions, commonly utilized sensors such as visual cameras and LiDARs suffer in terms of performance degradation. In response, this work presents a method that utilizes the complementary nature of vision and depth sensors to leverage multi-modal information to aid object detection at longer distances. In particular, depth and intensity values from sparse LiDAR returns are used to generate proposals for objects present in the environment. These proposals are then utilized by a Pan-Tilt-Zoom (PTZ) camera system to perform a directed search by adjusting its pose and zoom level for performing object detection and classification in difficult environments. The proposed work has been thoroughly verified using an ANYmal quadruped robot in underground settings and on datasets collected during the DARPA Subterranean Challenge finals.
We present Visual Navigation and Locomotion over obstacles (ViNL), which enables a quadrupedal robot to navigate unseen apartments while stepping over small obstacles that lie in its path (e.g., shoes, toys, cables), similar to how humans and pets lift their feet over objects as they walk. ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands. Both the policies are entirely "model-free", i.e. sensors-to-actions neural networks trained end-to-end. The two are trained independently in two entirely different simulators and then seamlessly co-deployed by feeding the velocity commands from the navigator to the locomotor, entirely "zero-shot" (without any co-training). While prior works have developed learning methods for visual navigation or visual locomotion, to the best of our knowledge, this is the first fully learned approach that leverages vision to accomplish both (1) intelligent navigation in new environments, and (2) intelligent visual locomotion that aims to traverse cluttered environments without disrupting obstacles. On the task of navigation to distant goals in unknown environments, ViNL using just egocentric vision significantly outperforms prior work on robust locomotion using privileged terrain maps (+32.8% success and -4.42 collisions per meter). Additionally, we ablate our locomotion policy to show that each aspect of our approach helps reduce obstacle collisions. Videos and code at //www.joannetruong.com/projects/vinl.html
LiDAR sensors are an integral part of modern autonomous vehicles as they provide an accurate, high-resolution 3D representation of the vehicle's surroundings. However, it is computationally difficult to make use of the ever-increasing amounts of data from multiple high-resolution LiDAR sensors. As frame-rates, point cloud sizes and sensor resolutions increase, real-time processing of these point clouds must still extract semantics from this increasingly precise picture of the vehicle's environment. One deciding factor of the run-time performance and accuracy of deep neural networks operating on these point clouds is the underlying data representation and the way it is computed. In this work, we examine the relationship between the computational representations used in neural networks and their performance characteristics. To this end, we propose a novel computational taxonomy of LiDAR point cloud representations used in modern deep neural networks for 3D point cloud processing. Using this taxonomy, we perform a structured analysis of different families of approaches. Thereby, we uncover common advantages and limitations in terms of computational efficiency, memory requirements, and representational capacity as measured by semantic segmentation performance. Finally, we provide some insights and guidance for future developments in neural point cloud processing methods.
Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward implementation based on the mean squared error of the optimal association between estimated and reference positions in the preceding time frames. It optimizes the desired properties of a tracking system: handling a time-varying number of sources and ordering localization estimates according to their trajectories, minimizing identity switches (IDSs). Evaluation on simulated data of multiple reverberant moving sources and on two model architectures proves its effectiveness on reducing identity switches without compromising frame-wise localization accuracy.
Autonomous driving is complex, requiring sophisticated 3D scene understanding, localization, mapping, and control. Rather than explicitly modelling and fusing each of these components, we instead consider an end-to-end approach via reinforcement learning (RL). However, collecting exploration driving data in the real world is impractical and dangerous. While training in simulation and deploying visual sim-to-real techniques has worked well for robot manipulation, deploying beyond controlled workspace viewpoints remains a challenge. In this paper, we address this challenge by presenting Sim2Seg, a re-imagining of RCAN that crosses the visual reality gap for off-road autonomous driving, without using any real-world data. This is done by learning to translate randomized simulation images into simulated segmentation and depth maps, subsequently enabling real-world images to also be translated. This allows us to train an end-to-end RL policy in simulation, and directly deploy in the real-world. Our approach, which can be trained in 48 hours on 1 GPU, can perform equally as well as a classical perception and control stack that took thousands of engineering hours over several months to build. We hope this work motivates future end-to-end autonomous driving research.
Triangle mesh-based maps have proven to be a powerful 3D representation of the environment, allowing robots to navigate using universal methods, indoors as well as in challenging outdoor environments with tunnels, hills and varying slopes. However, any robot that navigates autonomously necessarily requires stable, accurate, and continuous localization in such a mesh map where it plans its paths and missions. We present MICP-L, a novel and very fast \textit{Mesh ICP Localization} method that can register one or more range sensors directly on a triangle mesh map to continuously localize a robot, determining its 6D pose in the map. Correspondences between a range sensor and the mesh are found through simulations accelerated with the latest RTX hardware. With MICP-L, a correction can be performed quickly and in parallel even with combined data from different range sensor models. With this work, we aim to significantly advance the development in the field of mesh-based environment representation for autonomous robotic applications. MICP-L is open source and fully integrated with ROS and tf.
The authors present a novel face tracking approach where optical flow information is incorporated into a modified version of the Viola Jones detection algorithm. In the original algorithm, detection is static, as information from previous frames is not considered. In addition, candidate windows have to pass all stages of the classification cascade, otherwise they are discarded as containing no face. In contrast, the proposed tracker preserves information about the number of classification stages passed by each window. Such information is used to build a likelihood map, which represents the probability of having a face located at that position. Tracking capabilities are provided by extrapolating the position of the likelihood map to the next frame by optical flow computation. The proposed algorithm works in real time on a standard laptop. The system is verified on the Boston Head Tracking Database, showing that the proposed algorithm outperforms the standard Viola Jones detector in terms of detection rate and stability of the output bounding box, as well as including the capability to deal with occlusions. The authors also evaluate two recently published face detectors based on convolutional networks and deformable part models with their algorithm showing a comparable accuracy at a fraction of the computation time.
Deep robot vision models are widely used for recognizing objects from camera images, but shows poor performance when detecting objects at untrained positions. Although such problem can be alleviated by training with large datasets, the dataset collection cost cannot be ignored. Existing visual attention models tackled the problem by employing a data efficient structure which learns to extract task relevant image areas. However, since the models cannot modify attention targets after training, it is difficult to apply to dynamically changing tasks. This paper proposed a novel Key-Query-Value formulated visual attention model. This model is capable of switching attention targets by externally modifying the Query representations, namely top-down attention. The proposed model is experimented on a simulator and a real-world environment. The model was compared to existing end-to-end robot vision models in the simulator experiments, showing higher performance and data efficiency. In the real-world robot experiments, the model showed high precision along with its scalability and extendibility.