Forecasting vehicular motions in autonomous driving requires a deep understanding of agent interactions and the preservation of motion equivariance under Euclidean geometric transformations. Traditional models often lack the sophistication needed to handle the intricate dynamics inherent to autonomous vehicles and the interaction relationships among agents in the scene. As a result, these models have a lower model capacity, which then leads to higher prediction errors and lower training efficiency. In our research, we employ EqMotion, a leading equivariant particle, and human prediction model that also accounts for invariant agent interactions, for the task of multi-agent vehicle motion forecasting. In addition, we use a multi-modal prediction mechanism to account for multiple possible future paths in a probabilistic manner. By leveraging EqMotion, our model achieves state-of-the-art (SOTA) performance with fewer parameters (1.2 million) and a significantly reduced training time (less than 2 hours).
As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with latent masking diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.
Perception serves as a critical component in the functionality of autonomous agents. However, the intricate relationship between perception metrics and robotic metrics remains unclear, leading to ambiguity in the development and fine-tuning of perception algorithms. In this paper, we introduce a methodology for quantifying this relationship, taking into account factors such as detection rate, detection quality, and latency. Furthermore, we introduce two novel metrics for Human-Robot Collaboration safety predicated upon perception metrics: Critical Collision Probability (CCP) and Average Collision Probability (ACP). To validate the utility of these metrics in facilitating algorithm development and tuning, we develop an attentive processing strategy that focuses exclusively on key input features. This approach significantly reduces computational time while preserving a similar level of accuracy. Experimental results indicate that the implementation of this strategy in an object detector leads to a maximum reduction of 30.091% in inference time and 26.534% in total time per frame. Additionally, the strategy lowers the CCP and ACP in a baseline model by 11.252% and 13.501%, respectively. The source code will be made publicly available in the final proof version of the manuscript.
Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans. To this end, this paper introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions, enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving, we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips, and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge, we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes can be found at //github.com/opendilab/LMDrive
Tool tracking in surgical videos is vital in computer-assisted intervention for tasks like surgeon skill assessment, safety zone estimation, and human-machine collaboration during minimally invasive procedures. The lack of large-scale datasets hampers Artificial Intelligence implementation in this domain. Current datasets exhibit overly generic tracking formalization, often lacking surgical context: a deficiency that becomes evident when tools move out of the camera's scope, resulting in rigid trajectories that hinder realistic surgical representation. This paper addresses the need for a more precise and adaptable tracking formalization tailored to the intricacies of endoscopic procedures by introducing CholecTrack20, an extensive dataset meticulously annotated for multi-class multi-tool tracking across three perspectives representing the various ways of considering the temporal duration of a tool trajectory: (1) intraoperative, (2) intracorporeal, and (3) visibility within the camera's scope. The dataset comprises 20 laparoscopic videos with over 35,000 frames and 65,000 annotated tool instances with details on spatial location, category, identity, operator, phase, and surgical visual conditions. This detailed dataset caters to the evolving assistive requirements within a procedure.
Safety is the primary priority of autonomous driving. Nevertheless, no published dataset currently supports the direct and explainable safety evaluation for autonomous driving. In this work, we propose DeepAccident, a large-scale dataset generated via a realistic simulator containing diverse accident scenarios that frequently occur in real-world driving. The proposed DeepAccident dataset includes 57K annotated frames and 285K annotated samples, approximately 7 times more than the large-scale nuScenes dataset with 40k annotated samples. In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms. Furthermore, for each scenario, we set four vehicles along with one infrastructure to record data, thus providing diverse viewpoints for accident scenarios and enabling V2X (vehicle-to-everything) research on perception and prediction tasks. Finally, we present a baseline V2X model named V2XFormer that demonstrates superior performance for motion and accident prediction and 3D object detection compared to the single-vehicle model.
This study presents the outcomes of empirical investigations pertaining to human-vehicle interactions involving an autonomous vehicle equipped with both internal and external Human Machine Interfaces (HMIs) within a crosswalk scenario. The internal and external HMIs were integrated with implicit communication techniques, incorporating a combination of gentle and aggressive braking maneuvers within the crosswalk. Data were collected through a combination of questionnaires and quantifiable metrics, including pedestrian decision to cross related to the vehicle distance and speed. The questionnaire responses reveal that pedestrians experience enhanced safety perceptions when the external HMI and gentle braking maneuvers are used in tandem. In contrast, the measured variables demonstrate that the external HMI proves effective when complemented by the gentle braking maneuver. Furthermore, the questionnaire results highlight that the internal HMI enhances passenger confidence only when paired with the aggressive braking maneuver.
Recently, several methods have been proposed to estimate 3D human pose from multi-view images and achieved impressive performance on public datasets collected in relatively easy scenarios. However, there are limited approaches for extracting 3D human skeletons from multimodal inputs (e.g., RGB and pointcloud) that can enhance the accuracy of predicting 3D poses in challenging situations. We fill this gap by introducing a pipeline called PointVoxel that fuses multi-view RGB and pointcloud inputs to obtain 3D human poses. We demonstrate that volumetric representation is an effective architecture for integrating these different modalities. Moreover, in order to overcome the challenges of annotating 3D human pose labels in difficult scenarios, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy so that we can obtain a well-trained 3D human pose estimator without using any manual annotations. We evaluate our approach on four datasets (two public datasets, one synthetic dataset, and one challenging dataset named BasketBall collected by ourselves), showing promising results. The code and dataset will be released soon.
The ability to accurately predict the trajectory of surrounding vehicles is a critical hurdle to overcome on the journey to fully autonomous vehicles. To address this challenge, we pioneer a novel behavior-aware trajectory prediction model (BAT) that incorporates insights and findings from traffic psychology, human behavior, and decision-making. Our model consists of behavior-aware, interaction-aware, priority-aware, and position-aware modules that perceive and understand the underlying interactions and account for uncertainty and variability in prediction, enabling higher-level learning and flexibility without rigid categorization of driving behavior. Importantly, this approach eliminates the need for manual labeling in the training process and addresses the challenges of non-continuous behavior labeling and the selection of appropriate time windows. We evaluate BAT's performance across the Next Generation Simulation (NGSIM), Highway Drone (HighD), Roundabout Drone (RounD), and Macao Connected Autonomous Driving (MoCAD) datasets, showcasing its superiority over prevailing state-of-the-art (SOTA) benchmarks in terms of prediction accuracy and efficiency. Remarkably, even when trained on reduced portions of the training data (25%), our model outperforms most of the baselines, demonstrating its robustness and efficiency in predicting vehicle trajectories, and the potential to reduce the amount of data required to train autonomous vehicles, especially in corner cases. In conclusion, the behavior-aware model represents a significant advancement in the development of autonomous vehicles capable of predicting trajectories with the same level of proficiency as human drivers. The project page is available at //github.com/Petrichor625/BATraj-Behavior-aware-Model.
The research study of detecting multiple intents and filling slots is becoming more popular because of its relevance to complicated real-world situations. Recent advanced approaches, which are joint models based on graphs, might still face two potential issues: (i) the uncertainty introduced by constructing graphs based on preliminary intents and slots, which may transfer intent-slot correlation information to incorrect label node destinations, and (ii) direct incorporation of multiple intent labels for each token w.r.t. token-level intent voting might potentially lead to incorrect slot predictions, thereby hurting the overall performance. To address these two issues, we propose a joint model named MISCA. Our MISCA introduces an intent-slot co-attention mechanism and an underlying layer of label attention mechanism. These mechanisms enable MISCA to effectively capture correlations between intents and slot labels, eliminating the need for graph construction. They also facilitate the transfer of correlation information in both directions: from intents to slots and from slots to intents, through multiple levels of label-specific representations, without relying on token-level intent information. Experimental results show that MISCA outperforms previous models, achieving new state-of-the-art overall accuracy performances on two benchmark datasets MixATIS and MixSNIPS. This highlights the effectiveness of our attention mechanisms.
The development of autonomous agents which can interact with other agents to accomplish a given task is a core area of research in artificial intelligence and machine learning. Towards this goal, the Autonomous Agents Research Group develops novel machine learning algorithms for autonomous systems control, with a specific focus on deep reinforcement learning and multi-agent reinforcement learning. Research problems include scalable learning of coordinated agent policies and inter-agent communication; reasoning about the behaviours, goals, and composition of other agents from limited observations; and sample-efficient learning based on intrinsic motivation, curriculum learning, causal inference, and representation learning. This article provides a broad overview of the ongoing research portfolio of the group and discusses open problems for future directions.