We present DeepIPCv2, an autonomous driving model that perceives the environment using a LiDAR sensor for more robust drivability, especially when driving under poor illumination conditions where everything is not clearly visible. DeepIPCv2 takes a set of LiDAR point clouds as the main perception input. Since point clouds are not affected by illumination changes, they can provide a clear observation of the surroundings no matter what the condition is. This results in a better scene understanding and stable features provided by the perception module to support the controller module in estimating navigational control properly. To evaluate its performance, we conduct several tests by deploying the model to predict a set of driving records and perform real automated driving under three different conditions. We also conduct ablation and comparative studies with some recent models to justify its performance. Based on the experimental results, DeepIPCv2 shows a robust performance by achieving the best drivability in all driving scenarios. Furthermore, to support future research, we will upload the codes and data to //github.com/oskarnatan/DeepIPCv2.
Trajectory prediction in autonomous driving relies on accurate representation of all relevant contexts of the driving scene, including traffic participants, road topology, traffic signs, as well as their semantic relations to each other. Despite increased attention to this issue, most approaches in trajectory prediction do not consider all of these factors sufficiently. We present SemanticFormer, an approach for predicting multimodal trajectories by reasoning over a semantic traffic scene graph using a hybrid approach. It utilizes high-level information in the form of meta-paths, i.e. trajectories on which an agent is allowed to drive from a knowledge graph which is then processed by a novel pipeline based on multiple attention mechanisms to predict accurate trajectories. SemanticFormer comprises a hierarchical heterogeneous graph encoder to capture spatio-temporal and relational information across agents as well as between agents and road elements. Further, it includes a predictor to fuse different encodings and decode trajectories with probabilities. Finally, a refinement module assesses permitted meta-paths of trajectories and speed profiles to obtain final predicted trajectories. Evaluation of the nuScenes benchmark demonstrates improved performance compared to several SOTA methods. In addition, we demonstrate that our knowledge graph can be easily added to two graph-based existing SOTA methods, namely VectorNet and Laformer, replacing their original homogeneous graphs. The evaluation results suggest that by adding our knowledge graph the performance of the original methods is enhanced by 5% and 4%, respectively.
Automated driving systems are an integral part of the automotive industry. Tools such as Robot Operating System and simulators support their development. However, in the end, the developers must test their algorithms on a real vehicle. To better observe the difference between reality and simulation--the reality gap--digital twin technology offers real-time communication between the real vehicle and its model. We present low fidelity digital twin generator and describe situations where automatic generation is preferable to high fidelity simulation. We validated our approach of generating a virtual environment with a vehicle model by replaying the data recorded from the real vehicle.
We present a novel synthetically generated multi-modal dataset, SCaRL, to enable the training and validation of autonomous driving solutions. Multi-modal datasets are essential to attain the robustness and high accuracy required by autonomous systems in applications such as autonomous driving. As deep learning-based solutions are becoming more prevalent for object detection, classification, and tracking tasks, there is great demand for datasets combining camera, lidar, and radar sensors. Existing real/synthetic datasets for autonomous driving lack synchronized data collection from a complete sensor suite. SCaRL provides synchronized Synthetic data from RGB, semantic/instance, and depth Cameras; Range-Doppler-Azimuth/Elevation maps and raw data from Radar; and 3D point clouds/2D maps of semantic, depth and Doppler data from coherent Lidar. SCaRL is a large dataset based on the CARLA Simulator, which provides data for diverse, dynamic scenarios and traffic conditions. SCaRL is the first dataset to include synthetic synchronized data from coherent Lidar and MIMO radar sensors. The dataset can be accessed here: //fhr-ihs-sva.pages.fraunhofer.de/asp/scarl/
Zero-shot information extraction (IE) aims to build IE systems from the unannotated text. It is challenging due to involving little human intervention. Challenging but worthwhile, zero-shot IE reduces the time and effort that data labeling takes. Recent efforts on large language models (LLMs, e.g., GPT-3, ChatGPT) show promising performance on zero-shot settings, thus inspiring us to explore prompt-based methods. In this work, we ask whether strong IE models can be constructed by directly prompting LLMs. Specifically, we transform the zero-shot IE task into a multi-turn question-answering problem with a two-stage framework (ChatIE). With the power of ChatGPT, we extensively evaluate our framework on three IE tasks: entity-relation triple extract, named entity recognition, and event extraction. Empirical results on six datasets across two languages show that ChatIE achieves impressive performance and even surpasses some full-shot models on several datasets (e.g., NYT11-HRL). We believe that our work could shed light on building IE models with limited resources.
We introduce RedMotion, a transformer model for motion prediction in self-driving vehicles that learns environment representations via redundancy reduction. Our first type of redundancy reduction is induced by an internal transformer decoder and reduces a variable-sized set of local road environment tokens, representing road graphs and agent data, to a fixed-sized global embedding. The second type of redundancy reduction is obtained by self-supervised learning and applies the redundancy reduction principle to embeddings generated from augmented views of road environments. Our experiments reveal that our representation learning approach outperforms PreTraM, Traj-MAE, and GraphDINO in a semi-supervised setting. Moreover, RedMotion achieves competitive results compared to HPTR or MTR++ in the Waymo Motion Prediction Challenge. Our open-source implementation is available at: //github.com/kit-mrt/future-motion
We present GSDeformer, a method that achieves free-form deformation on 3D Gaussian Splatting(3DGS) without requiring any architectural changes. Our method extends cage-based deformation, a traditional mesh deformation method, to 3DGS. This is done by converting 3DGS into a novel proxy point cloud representation, where its deformation can be used to infer the transformations to apply on the 3D gaussians making up 3DGS. We also propose an automatic cage construction algorithm for 3DGS to minimize manual work. Our method does not modify the underlying architecture of 3DGS. Therefore, any existing trained vanilla 3DGS can be easily edited by our method. We compare the deformation capability of our method against other existing methods, demonstrating the ease of use and comparable quality of our method, despite being more direct and thus easier to integrate with other concurrent developments on 3DGS.
We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling software. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we employ a 3D native diffusion model, which operates on latent space learned from latent set-based 3D representations, to generate coarse geometries with regular mesh topology in seconds. In particular, this process takes as input a text prompt or a reference image and leverages a powerful multi-view (MV) diffusion model to generate multiple views of the coarse geometry, which are fed into our MV-conditioned 3D diffusion model for generating the 3D geometry, significantly improving robustness and generalizability. Following that, a normal-based geometry refiner is used to significantly enhance the surface details. This refinement can be performed automatically, or interactively with user-supplied edits. Extensive experiments demonstrate that our method achieves high efficacy in producing superior-quality 3D assets compared to existing methods. HomePage: //craftsman3d.github.io/, Code: //github.com/wyysf-98/CraftsMan
Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a family of autoencoders for LDMs that leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality. We also investigate the training methodologies and the decoder architecture of LiteVAE and propose several enhancements that improve the training dynamics and reconstruction quality. Our base LiteVAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (rFID, LPIPS, PSNR, and SSIM).
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.
We present Emu, a system that semantically enhances multilingual sentence embeddings. Our framework fine-tunes pre-trained multilingual sentence embeddings using two main components: a semantic classifier and a language discriminator. The semantic classifier improves the semantic similarity of related sentences, whereas the language discriminator enhances the multilinguality of the embeddings via multilingual adversarial training. Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data.