The increasing demand for autonomous vehicles has created a need for robust navigation systems that can also operate effectively in adverse weather conditions. Visual odometry is a technique used in these navigation systems, enabling the estimation of vehicle position and motion using input from onboard cameras. However, visual odometry accuracy can be significantly impacted in challenging weather conditions, such as heavy rain, snow, or fog. In this paper, we evaluate a range of visual odometry methods, including our DROIDSLAM based heuristic approach. Specifically, these algorithms are tested on both clear and rainy weather urban driving data to evaluate their robustness. We compiled a dataset comprising of a range of rainy weather conditions from different cities. This includes, the Oxford Robotcar dataset from Oxford, the 4Seasons dataset from Munich and an internal dataset collected in Singapore. We evaluated different visual odometry algorithms for both monocular and stereo camera setups using the Absolute Trajectory Error (ATE). Our evaluation suggests that the Depth and Flow for Visual Odometry (DF-VO) algorithm with monocular setup worked well for short range distances (< 500m) and our proposed DROID-SLAM based heuristic approach for the stereo setup performed relatively well for long-term localization. Both algorithms performed consistently well across all rain conditions.
The ability to learn and refine behavior after deployment has become ever more important for robots as we design them to operate in unstructured environments like households. In this work, we design a new learning system based on large language model (LLM), OLAF, that allows everyday users to teach a robot using verbal corrections when the robot makes mistakes, e.g., by saying "Stop what you're doing. You should move closer to the cup." A key feature of OLAF is its ability to update the robot's visuomotor neural policy based on the verbal feedback to avoid repeating mistakes in the future. This is in contrast to existing LLM-based robotic systems, which only follow verbal commands or corrections but not learn from them. We demonstrate the efficacy of our design in experiments where a user teaches a robot to perform long-horizon manipulation tasks both in simulation and on physical hardware, achieving on average 20.0% improvement in policy success rate. Videos and more results are at //ut-austin-rpl.github.io/olaf/
Spectral clustering has been widely used for community detection in network sciences. While its empirical successes are well-documented, a clear theoretical understanding, particularly for sparse networks where degrees are much smaller than $\log n$, remains unclear. In this paper, we address this significant gap by demonstrating that spectral clustering offers exponentially small error rates when applied to sparse networks under Stochastic Block Models. Our analysis provides sharp characterizations of its performance, backed by matching upper and lower bounds possessing an identical exponent with the same leading constant. The key to our results is a novel truncated $\ell_2$ perturbation analysis for eigenvectors, coupled with a new analysis idea of eigenvectors truncation.
The potential for deploying autonomous systems can be significantly increased by improving the perception and interpretation of the environment. However, the development of deep learning-based techniques for autonomous systems in unstructured outdoor environments poses challenges due to limited data availability for training and testing. To address this gap, we present the German Outdoor and Offroad Dataset (GOOSE), a comprehensive dataset specifically designed for unstructured outdoor environments. The GOOSE dataset incorporates 10 000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models on both image and point cloud data. We open source the dataset, along with an ontology for unstructured terrain, as well as dataset standards and guidelines. This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments. The dataset, pre-trained models for offroad perception, and additional documentation can be found at //goose-dataset.de/.
Robots performing human-scale manipulation tasks require an extensive amount of knowledge about their surroundings in order to perform their actions competently and human-like. In this work, we investigate the use of virtual reality technology as an implementation for robot environment modeling, and present a technique for translating scene graphs into knowledge bases. To this end, we take advantage of the Universal Scene Description (USD) format which is an emerging standard for the authoring, visualization and simulation of complex environments. We investigate the conversion of USD-based environment models into Knowledge Graph (KG) representations that facilitate semantic querying and integration with additional knowledge sources.
Data-driven models for nonlinear dynamical systems based on approximating the underlying Koopman operator or generator have proven to be successful tools for forecasting, feature learning, state estimation, and control. It has become well known that the Koopman generators for control-affine systems also have affine dependence on the input, leading to convenient finite-dimensional bilinear approximations of the dynamics. Yet there are still two main obstacles that limit the scope of current approaches for approximating the Koopman generators of systems with actuation. First, the performance of existing methods depends heavily on the choice of basis functions over which the Koopman generator is to be approximated; and there is currently no universal way to choose them for systems that are not measure preserving. Secondly, if we do not observe the full state, then it becomes necessary to account for the dependence of the output time series on the sequence of supplied inputs when constructing observables to approximate Koopman operators. To address these issues, we write the dynamics of observables governed by the Koopman generator as a bilinear hidden Markov model, and determine the model parameters using the expectation-maximization (EM) algorithm. The E-step involves a standard Kalman filter and smoother, while the M-step resembles control-affine dynamic mode decomposition for the generator. We demonstrate the performance of this method on three examples, including recovery of a finite-dimensional Koopman-invariant subspace for an actuated system with a slow manifold; estimation of Koopman eigenfunctions for the unforced Duffing equation; and model-predictive control of a fluidic pinball system based only on noisy observations of lift and drag.
Collaborative perception among multiple connected and autonomous vehicles can greatly enhance perceptive capabilities by allowing vehicles to exchange supplementary information via communications. Despite advances in previous approaches, challenges still remain due to channel variations and data heterogeneity among collaborative vehicles. To address these issues, we propose ACC-DA, a channel-aware collaborative perception framework to dynamically adjust the communication graph and minimize the average transmission delay while mitigating the side effects from the data heterogeneity. Our novelties lie in three aspects. We first design a transmission delay minimization method, which can construct the communication graph and minimize the transmission delay according to different channel information state. We then propose an adaptive data reconstruction mechanism, which can dynamically adjust the rate-distortion trade-off to enhance perception efficiency. Moreover, it minimizes the temporal redundancy during data transmissions. Finally, we conceive a domain alignment scheme to align the data distribution from different vehicles, which can mitigate the domain gap between different vehicles and improve the performance of the target task. Comprehensive experiments demonstrate the effectiveness of our method in comparison to the existing state-of-the-art works.
The task of estimating 3D occupancy from surrounding-view images is an exciting development in the field of autonomous driving, following the success of Bird's Eye View (BEV) perception. This task provides crucial 3D attributes of the driving environment, enhancing the overall understanding and perception of the surrounding space. In this work, we present a simple framework for 3D occupancy estimation, which is a CNN-based framework designed to reveal several key factors for 3D occupancy estimation, such as network design, optimization, and evaluation. In addition, we explore the relationship between 3D occupancy estimation and other related tasks, such as monocular depth estimation and 3D reconstruction, which could advance the study of 3D perception in autonomous driving. For evaluation, we propose a simple sampling strategy to define the metric for occupancy evaluation, which is flexible for current public datasets. Moreover, we establish the benchmark in terms of the depth estimation metric, where we compare our proposed method with monocular depth estimation methods on the DDAD and Nuscenes datasets and achieve competitive performance. The relevant code will be updated in //github.com/GANWANSHUI/SimpleOccupancy.
This manuscript portrays optimization as a process. In many practical applications the environment is so complex that it is infeasible to lay out a comprehensive theoretical model and use classical algorithmic theory and mathematical optimization. It is necessary as well as beneficial to take a robust approach, by applying an optimization method that learns as one goes along, learning from experience as more aspects of the problem are observed. This view of optimization as a process has become prominent in varied fields and has led to some spectacular success in modeling and systems that are now part of our daily lives.
It is important to detect anomalous inputs when deploying machine learning systems. The use of larger and more complex inputs in deep learning magnifies the difficulty of distinguishing between anomalous and in-distribution examples. At the same time, diverse image and text data are available in enormous quantities. We propose leveraging these data to improve deep anomaly detection by training anomaly detectors against an auxiliary dataset of outliers, an approach we call Outlier Exposure (OE). This enables anomaly detectors to generalize and detect unseen anomalies. In extensive experiments on natural language processing and small- and large-scale vision tasks, we find that Outlier Exposure significantly improves detection performance. We also observe that cutting-edge generative models trained on CIFAR-10 may assign higher likelihoods to SVHN images than to CIFAR-10 images; we use OE to mitigate this issue. We also analyze the flexibility and robustness of Outlier Exposure, and identify characteristics of the auxiliary dataset that improve performance.
Detecting carried objects is one of the requirements for developing systems to reason about activities involving people and objects. We present an approach to detect carried objects from a single video frame with a novel method that incorporates features from multiple scales. Initially, a foreground mask in a video frame is segmented into multi-scale superpixels. Then the human-like regions in the segmented area are identified by matching a set of extracted features from superpixels against learned features in a codebook. A carried object probability map is generated using the complement of the matching probabilities of superpixels to human-like regions and background information. A group of superpixels with high carried object probability and strong edge support is then merged to obtain the shape of the carried object. We applied our method to two challenging datasets, and results show that our method is competitive with or better than the state-of-the-art.