Learning-based approaches have achieved remarkable performance in the domain of autonomous driving. Leveraging the impressive ability of neural networks and large amounts of human driving data, complex patterns and rules of driving behavior can be encoded as a model to benefit the autonomous driving system. Besides, an increasing number of data-driven works have been studied in the decision-making and motion planning module. However, the reliability and the stability of the neural network is still full of uncertainty. In this paper, we introduce a hierarchical planning architecture including a high-level grid-based behavior planner and a low-level trajectory planner, which is highly interpretable and controllable. As the high-level planner is responsible for finding a consistent route, the low-level planner generates a feasible trajectory. We evaluate our method both in closed-loop simulation and real world driving, and demonstrate the neural network planner has outstanding performance in complex urban autonomous driving scenarios.
Recent advances in Neural Radiance Fields (NeRFs) have made it possible to reconstruct and reanimate dynamic portrait scenes with control over head-pose, facial expressions and viewing direction. However, training such models assumes photometric consistency over the deformed region e.g. the face must be evenly lit as it deforms with changing head-pose and facial expression. Such photometric consistency across frames of a video is hard to maintain, even in studio environments, thus making the created reanimatable neural portraits prone to artifacts during reanimation. In this work, we propose CoDyNeRF, a system that enables the creation of fully controllable 3D portraits in real-world capture conditions. CoDyNeRF learns to approximate illumination dependent effects via a dynamic appearance model in the canonical space that is conditioned on predicted surface normals and the facial expressions and head-pose deformations. The surface normals prediction is guided using 3DMM normals that act as a coarse prior for the normals of the human head, where direct prediction of normals is hard due to rigid and non-rigid deformations induced by head-pose and facial expression changes. Using only a smartphone-captured short video of a subject for training, we demonstrate the effectiveness of our method on free view synthesis of a portrait scene with explicit head pose and expression controls, and realistic lighting effects. The project page can be found here: //shahrukhathar.github.io/2023/08/22/CoDyNeRF.html
Unmanned aerial vehicles (UAVs) are recognized as promising technologies for area coverage due to the flexibility and adaptability. However, the ability of a single UAV is limited, and as for the large-scale three-dimensional (3D) scenario, UAV swarms can establish seamless wireless communication services. Hence, in this work, we consider a scenario of UAV swarm deployment and trajectory to satisfy 3D coverage considering the effects of obstacles. In detail, we propose a hierarchical swarm framework to efficiently serve the large-area users. Then, the problem is formulated to minimize the total trajectory loss of the UAV swarm. However, the problem is intractable due to the non-convex property, and we decompose it into smaller issues of users clustering, UAV swarm hovering points selection, and swarm trajectory determination. Moreover, we design a Q-learning based algorithm to accelerate the solution efficiency. Finally, we conduct extensive simulations to verify the proposed mechanisms, and the designed algorithm outperforms other referred methods.
Information compression techniques are majorly employed to address the concern of reducing communication cost over peer-to-peer links. In this paper, we investigate distributed Nash equilibrium (NE) seeking problems in a class of non-cooperative games over directed graphs with information compression. To improve communication efficiency, a compressed distributed NE seeking (C-DNES) algorithm is proposed to obtain a NE for games, where the differences between decision vectors and their estimates are compressed. The proposed algorithm is compatible with a general class of compression operators, including both unbiased and biased compressors. Moreover, our approach only requires the adjacency matrix of the directed graph to be row-stochastic, in contrast to past works that relied on balancedness or specific global network parameters. It is shown that C-DNES not only inherits the advantages of conventional distributed NE algorithms, achieving linear convergence rate for games with restricted strongly monotone mappings, but also saves communication costs in terms of transmitted bits. Finally, numerical simulations illustrate the advantages of C-DNES in saving communication cost by an order of magnitude under different compressors.
Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, due to the lack of domain-specific knowledge, they may not be optimal in completing code that requires intensive domain knowledge for example completing the library names. Although there are several works that have confirmed the effectiveness of fine-tuning techniques to adapt language models for code completion in specific domains. They are limited by the need for constant fine-tuning of the model when the project is in constant iteration. To address this limitation, in this paper, we propose $k$NM-LM, a retrieval-augmented language model (R-LM), that integrates domain knowledge into language models without fine-tuning. Different from previous techniques, our approach is able to automatically adapt to different language models and domains. Specifically, it utilizes the in-domain code to build the retrieval-based database decoupled from LM, and then combines it with LM through Bayesian inference to complete the code. The extensive experiments on the completion of intra-project and intra-scenario have confirmed that $k$NM-LM brings about appreciable enhancements when compared to CodeGPT and UnixCoder. A deep analysis of our tool including the responding speed, storage usage, specific type code completion, and API invocation completion has confirmed that $k$NM-LM provides satisfactory performance, which renders it highly appropriate for domain adaptive code completion. Furthermore, our approach operates without the requirement for direct access to the language model's parameters. As a result, it can seamlessly integrate with black-box code completion models, making it easy to integrate our approach as a plugin to further enhance the performance of these models.
Anticipating the motion of other road users is crucial for automated driving systems (ADS), as it enables safe and informed downstream decision-making and motion planning. Unfortunately, contemporary learning-based approaches for motion prediction exhibit significant performance degradation as the prediction horizon increases or the observation window decreases. This paper proposes a novel technique for trajectory prediction that combines a data-driven learning-based method with a velocity vector field (VVF) generated from a nature-inspired concept, i.e., fluid flow dynamics. In this work, the vector field is incorporated as an additional input to a convolutional-recurrent deep neural network to help predict the most likely future trajectories given a sequence of bird's eye view scene representations. The performance of the proposed model is compared with state-of-the-art methods on the HighD dataset demonstrating that the VVF inclusion improves the prediction accuracy for both short and long-term (5~sec) time horizons. It is also shown that the accuracy remains consistent with decreasing observation windows which alleviates the requirement of a long history of past observations for accurate trajectory prediction. Source codes are available at: //github.com/Amir-Samadi/VVF-TP.
Robotic surgical subtask automation has the potential to reduce the per-patient workload of human surgeons. There are a variety of surgical subtasks that require geometric information of subsurface anatomy, such as the location of tumors, which necessitates accurate and efficient surgical sensing. In this work, we propose an automated sensing method that maps 3D subsurface anatomy to provide such geometric knowledge. We model the anatomy via a Bayesian Hilbert map-based probabilistic 3D occupancy map. Using the 3D occupancy map, we plan sensing paths on the surface of the anatomy via a graph search algorithm, $A^*$ search, with a cost function that enables the trajectories generated to balance between exploration of unsensed regions and refining the existing probabilistic understanding. We demonstrate the performance of our proposed method by comparing it against 3 different methods in several anatomical environments including a real-life CT scan dataset. The experimental results show that our method efficiently detects relevant subsurface anatomy with shorter trajectories than the comparison methods, and the resulting occupancy map achieves high accuracy.
Clustering techniques have been the key drivers of data mining, machine learning and pattern recognition for decades. One of the most popular clustering algorithms is DBSCAN due to its high accuracy and noise tolerance. Many superior algorithms such as DBSCAN have input parameters that are hard to estimate. Therefore, finding those parameters is a time consuming process. In this paper, we propose a novel clustering algorithm Bacteria-Farm, which balances the performance and ease of finding the optimal parameters for clustering. Bacteria- Farm algorithm is inspired by the growth of bacteria in closed experimental farms - their ability to consume food and grow - which closely represents the ideal cluster growth desired in clustering algorithms. In addition, the algorithm features a modular design to allow the creation of versions of the algorithm for specific tasks / distributions of data. In contrast with other clustering algorithms, our algorithm also has a provision to specify the amount of noise to be excluded during clustering.
Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
Graph Neural Networks (GNNs) have shown promising results on a broad spectrum of applications. Most empirical studies of GNNs directly take the observed graph as input, assuming the observed structure perfectly depicts the accurate and complete relations between nodes. However, graphs in the real world are inevitably noisy or incomplete, which could even exacerbate the quality of graph representations. In this work, we propose a novel Variational Information Bottleneck guided Graph Structure Learning framework, namely VIB-GSL, in the perspective of information theory. VIB-GSL advances the Information Bottleneck (IB) principle for graph structure learning, providing a more elegant and universal framework for mining underlying task-relevant relations. VIB-GSL learns an informative and compressive graph structure to distill the actionable information for specific downstream tasks. VIB-GSL deduces a variational approximation for irregular graph data to form a tractable IB objective function, which facilitates training stability. Extensive experimental results demonstrate that the superior effectiveness and robustness of VIB-GSL.
Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong underestimation bias when estimating large amounts of MI. We propose decomposing the full MI estimation problem into a sum of smaller estimation problems by splitting one of the views into progressively more informed subviews and by applying the chain rule on MI between the decomposed views. This expression contains a sum of unconditional and conditional MI terms, each measuring modest chunks of the total MI, which facilitates approximation via contrastive bounds. To maximize the sum, we formulate a contrastive lower bound on the conditional MI which can be approximated efficiently. We refer to our general approach as Decomposed Estimation of Mutual Information (DEMI). We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation.