Most classical SLAM systems rely on the static scene assumption, which limits their applicability in real world scenarios. Recent SLAM frameworks have been proposed to simultaneously track the camera and moving objects. However they are often unable to estimate the canonical pose of the objects and exhibit a low object tracking accuracy. To solve this problem we propose TwistSLAM++, a semantic, dynamic, SLAM system that fuses stereo images and LiDAR information. Using semantic information, we track potentially moving objects and associate them to 3D object detections in LiDAR scans to obtain their pose and size. Then, we perform registration on consecutive object scans to refine object pose estimation. Finally, object scans are used to estimate the shape of the object and constrain map points to lie on the estimated surface within the BA. We show on classical benchmarks that this fusion approach based on multimodal information improves the accuracy of object tracking.
As an essential processing step before the fusing of infrared and visible images, the performance of image registration determines whether the two images can be fused at correct spatial position. In the actual scenario, the varied imaging devices may lead to a change in perspective or time gap between shots, making significant non-rigid spatial relationship in infrared and visible images. Even if a large number of feature points are matched, the registration accuracy may still be inadequate, affecting the result of image fusion and other vision tasks. To alleviate this problem, we propose a Semantic-Aware on-Demand registration network (SA-DNet), which mainly purpose is to confine the feature matching process to the semantic region of interest (sROI) by designing semantic-aware module (SAM) and HOL-Deep hybrid matching module (HDM). After utilizing TPS to transform infrared and visible images based on the corresponding feature points in sROI, the registered images are fused using image fusion module (IFM) to achieve a fully functional registration and fusion network. Moreover, we point out that for different demands, this type of approach allows us to select semantic objects for feature matching as needed and accomplishes task-specific registration based on specific requirements. To demonstrate the robustness of SA-DNet for non-rigid distortions, we conduct extensive experiments by comparing SA-DNet with five state-of-the-art infrared and visible image feature matching methods, and the experimental results show that our method adapts better to the presence of non-rigid distortions in the images and provides semantically well-registered images.
In a complex disease such as tuberculosis, the evidence for the disease and its evolution may be present in multiple modalities such as clinical, genomic, or imaging data. Effective patient-tailored outcome prediction and therapeutic guidance will require fusing evidence from these modalities. Such multimodal fusion is difficult since the evidence for the disease may not be uniform across all modalities, not all modality features may be relevant, or not all modalities may be present for all patients. All these nuances make simple methods of early, late, or intermediate fusion of features inadequate for outcome prediction. In this paper, we present a novel fusion framework using multiplexed graphs and derive a new graph neural network for learning from such graphs. Specifically, the framework allows modalities to be represented through their targeted encodings, and models their relationship explicitly via multiplexed graphs derived from salient features in a combined latent space. We present results that show that our proposed method outperforms state-of-the-art methods of fusing modalities for multi-outcome prediction on a large Tuberculosis (TB) dataset.
The exploration of large-scale unknown environments can benefit from the deployment of multiple robots for collaborative mapping. Each robot explores a section of the environment and communicates onboard pose estimates and maps to a central server to build an optimized global multi-robot map. Naturally, inconsistencies can arise between onboard and server estimates due to onboard odometry drift, failures, or degeneracies. The mapping server can correct and overcome such failure cases using computationally expensive operations such as inter-robot loop closure detection and multi-modal mapping. However, the individual robots do not benefit from the collaborative map if the mapping server provides no feedback. Although server updates from the multi-robot map can greatly alleviate the robotic mission strategically, most existing work lacks them, due to their associated computational and bandwidth-related costs. Motivated by this challenge, this paper proposes a novel collaborative mapping framework that enables global mapping consistency among robots and the mapping server. In particular, we propose graph spectral analysis, at different spatial scales, to detect structural differences between robot and server graphs, and to generate necessary constraints for the individual robot pose graphs. Our approach specifically finds the nodes that correspond to the drift's origin rather than the nodes where the error becomes too large. We thoroughly analyze and validate our proposed framework using several real-world multi-robot field deployments where we show improvements of the onboard system up to 90\% and can recover the onboard estimation from localization failures and even from the degeneracies within its estimation.
With the advanced request to employ a team of robots to perform a task collaboratively, the research community has become increasingly interested in collaborative simultaneous localization and mapping. Unfortunately, existing datasets are limited in the scale and variation of the collaborative trajectories they capture, even though generalization between inter-trajectories among different agents is crucial to the overall viability of collaborative tasks. To help align the research community's contributions with real-world multiagent ordinated SLAM problems, we introduce S3E, a novel large-scale multimodal dataset captured by a fleet of unmanned ground vehicles along four designed collaborative trajectory paradigms. S3E consists of 7 outdoor and 5 indoor scenes that each exceed 200 seconds, consisting of well synchronized and calibrated high-quality stereo camera, LiDAR, and high-frequency IMU data. Crucially, our effort exceeds previous attempts regarding dataset size, scene variability, and complexity. It has 4x as much average recording time as the pioneering EuRoC dataset. We also provide careful dataset analysis as well as baselines for collaborative SLAM and single counterparts. Find data, code, and more up-to-date information at //github.com/PengYu-Team/S3E.
Since point clouds obtained from lidar are spatially discrete and non-repetitive, directly using point clouds to achieve object data association and robust state estimation is not a simple task. Further, tracking and analyzing the object states facilitates determining how they are involved in localization and mapping works. In this paper, we propose a least-squares estimator incorporating semantic 3D bounding boxes and geometric point clouds to achieve accurate and robust tracking of multiple objects. Then, the proposed tracker is integrated into a multi-object lidar odometry (MLO) system using only point clouds as input. By analyzing object motion states, the mapping module uses static objects and environmental features to eliminate accumulated errors. Meanwhile, the MLO system provides continuous object trajectories in map coordinate. Finally, we evaluate the effectiveness of the proposed semantic geometric fusion multi-object tracking (SGF-MOT) module and the localization accuracy of the MLO system under the public KITTI dataset.
State estimation in complex illumination environments based on conventional visual-inertial odometry is a challenging task due to the severe visual degradation of the visual camera. The thermal infrared camera is capable of all-day time and is less affected by illumination variation. However, most existing visual data association algorithms are incompatible because the thermal infrared data contains large noise and low contrast. Motivated by the phenomenon that thermal radiation varies most significantly at the edges of objects, the study proposes an ETIO, which is the first edge-based monocular thermal-inertial odometry for robust localization in visually degraded environments. Instead of the raw image, we utilize the binarized image from edge extraction for pose estimation to overcome the poor thermal infrared image quality. Then, an adaptive feature tracking strategy ADT-KLT is developed for robust data association based on limited edge information and its distance distribution. Finally, a pose graph optimization performs real-time estimation over a sliding window of recent states by combining IMU pre-integration with reprojection error of all edge feature observations. We evaluated the performance of the proposed system on public datasets and real-world experiments and compared it against state-of-the-art methods. The proposed ETIO was verified with the ability to enable accurate and robust localization all-day time.
Utilizing the sensor characteristics of the audio, visible camera, and thermal camera, the robustness of person recognition can be enhanced. Existing multimodal person recognition frameworks are primarily formulated assuming that multimodal data is always available. In this paper, we propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera, which addresses the missing modality problem. In the framework, a novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings. Also, a novel loss function, termed missing modality loss, accounts for possible missing modalities based on the triplet loss calculation while learning the individual latent embeddings. Additionally, a joint latent embedding utilizing the trimodal data is learnt using the multi-head attention transformer, which assigns attention weights to the different modalities. The different latent embeddings are subsequently used to train a deep neural network. The proposed framework is validated on the Speaking Faces dataset. A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy while accounting for missing modalities.
3D LiDAR sensors are indispensable for the robust vision of autonomous mobile robots. However, deploying LiDAR-based perception algorithms often fails due to a domain gap from the training environment, such as inconsistent angular resolution and missing properties. Existing studies have tackled the issue by learning inter-domain mapping, while the transferability is constrained by the training configuration and the training is susceptible to peculiar lossy noises called ray-drop. To address the issue, this paper proposes a generative model of LiDAR range images applicable to the data-level domain transfer. Motivated by the fact that LiDAR measurement is based on point-by-point range imaging, we train an implicit image representation-based generative adversarial networks along with a differentiable ray-drop effect. We demonstrate the fidelity and diversity of our model in comparison with the point-based and image-based state-of-the-art generative models. We also showcase upsampling and restoration applications. Furthermore, we introduce a Sim2Real application for LiDAR semantic segmentation. We demonstrate that our method is effective as a realistic ray-drop simulator and outperforms state-of-the-art methods.
In this work, we have implemented a SLAM-assisted navigation module for a real autonomous vehicle with unknown dynamics. The navigation objective is to reach a desired goal configuration along a collision-free trajectory while adhering to the dynamics of the system. Specifically, we use LiDAR-based Hector SLAM for building the map of the environment, detecting obstacles, and for tracking vehicle's conformance to the trajectory as it passes through various states. For motion planning, we use rapidly exploring random trees (RRTs) on a set of generated motion primitives to search for dynamically feasible trajectory sequences and collision-free path to the goal. We demonstrate complex maneuvers such as parallel parking, perpendicular parking, and reversing motion by the real vehicle in a constrained environment using the presented approach.
Autonomous robotic systems operating in human environments must understand their surroundings to make accurate and safe decisions. In crowded human scenes with close-up human-robot interaction and robot navigation, a deep understanding requires reasoning about human motion and body dynamics over time with human body pose estimation and tracking. However, existing datasets either do not provide pose annotations or include scene types unrelated to robotic applications. Many datasets also lack the diversity of poses and occlusions found in crowded human scenes. To address this limitation we introduce JRDB-Pose, a large-scale dataset and benchmark for multi-person pose estimation and tracking using videos captured from a social navigation robot. The dataset contains challenge scenes with crowded indoor and outdoor locations and a diverse range of scales and occlusion types. JRDB-Pose provides human pose annotations with per-keypoint occlusion labels and track IDs consistent across the scene. A public evaluation server is made available for fair evaluation on a held-out test set. JRDB-Pose is available at //jrdb.erc.monash.edu/ .