Both, robot and hand-eye calibration haven been object to research for decades. While current approaches manage to precisely and robustly identify the parameters of a robot's kinematic model, they still rely on external devices, such as calibration objects, markers and/or external sensors. Instead of trying to fit the recorded measurements to a model of a known object, this paper treats robot calibration as an offline SLAM problem, where scanning poses are linked to a fixed point in space by a moving kinematic chain. As such, the presented framework allows robot calibration using nothing but an arbitrary eye-in-hand depth sensor, thus enabling fully autonomous self-calibration without any external tools. My new approach is utilizes a modified version of the Iterative Closest Point algorithm to run bundle adjustment on multiple 3D recordings estimating the optimal parameters of the kinematic model. A detailed evaluation of the system is shown on a real robot with various attached 3D sensors. The presented results show that the system reaches precision comparable to a dedicated external tracking system at a fraction of its cost.
Human perception reliably identifies movable and immovable parts of 3D scenes, and completes the 3D structure of objects and background from incomplete observations. We learn this skill not via labeled examples, but simply by observing objects move. In this work, we propose an approach that observes unlabeled multi-view videos at training time and learns to map a single image observation of a complex scene, such as a street with cars, to a 3D neural scene representation that is disentangled into movable and immovable parts while plausibly completing its 3D structure. We separately parameterize movable and immovable scene parts via 2D neural ground plans. These ground plans are 2D grids of features aligned with the ground plane that can be locally decoded into 3D neural radiance fields. Our model is trained self-supervised via neural rendering. We demonstrate that the structure inherent to our disentangled 3D representation enables a variety of downstream tasks in street-scale 3D scenes using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance segmentation, and 3D bounding box prediction, highlighting its value as a backbone for data-efficient 3D scene understanding models. This disentanglement further enables scene editing via object manipulation such as deletion, insertion, and rigid-body motion.
We present a generic framework for scale-aware direct monocular odometry based on depth prediction from a deep neural network. In contrast with previous methods where depth information is only partially exploited, we formulate a novel depth prediction residual which allows us to incorporate multi-view depth information. In addition, we propose to use a truncated robust cost function which prevents considering inconsistent depth estimations. The photometric and depth-prediction measurements are integrated into a tightly-coupled optimization leading to a scale-aware monocular system which does not accumulate scale drift. Our proposal does not particularize for a concrete neural network, being able to work along with the vast majority of the existing depth prediction solutions. We demonstrate the validity and generality of our proposal evaluating it on the KITTI odometry dataset, using two publicly available neural networks and comparing it with similar approaches and the state-of-the-art for monocular and stereo SLAM. Experiments show that our proposal largely outperforms classic monocular SLAM, being 5 to 9 times more precise, beating similar approaches and having an accuracy which is closer to that of stereo systems.
While dense visual SLAM methods are capable of estimating dense reconstructions of the environment, they suffer from a lack of robustness in their tracking step, especially when the optimisation is poorly initialised. Sparse visual SLAM systems have attained high levels of accuracy and robustness through the inclusion of inertial measurements in a tightly-coupled fusion. Inspired by this performance, we propose the first tightly-coupled dense RGB-D-inertial SLAM system. Our system has real-time capability while running on a GPU. It jointly optimises for the camera pose, velocity, IMU biases and gravity direction while building up a globally consistent, fully dense surfel-based 3D reconstruction of the environment. Through a series of experiments on both synthetic and real world datasets, we show that our dense visual-inertial SLAM system is more robust to fast motions and periods of low texture and low geometric variation than a related RGB-D-only SLAM system.
Space robotics applications, such as Active Space Debris Removal (ASDR), require representative testing before launch. A commonly used approach to emulate the microgravity environment in space is air-bearing based platforms on flat-floors, such as the European Space Agency's Orbital Robotics and GNC Lab (ORGL). This work proposes a control architecture for a floating platform at the ORGL, equipped with eight solenoid-valve-based thrusters and one reaction wheel. The control architecture consists of two main components: a trajectory planner that finds optimal trajectories connecting two states and a trajectory follower that follows any physically feasible trajectory. The controller is first evaluated within an introduced simulation, achieving a 100 % success rate at finding and following trajectories to the origin within a Monte-Carlo test. Individual trajectories are also successfully followed by the physical system. In this work, we showcase the ability of the controller to reject disturbances and follow a straight-line trajectory within tens of centimeters.
Event cameras are bio-inspired sensors that offer advantages over traditional cameras. They work asynchronously, sampling the scene with microsecond resolution and producing a stream of brightness changes. This unconventional output has sparked novel computer vision methods to unlock the camera's potential. We tackle the problem of event-based stereo 3D reconstruction for SLAM. Most event-based stereo methods try to exploit the camera's high temporal resolution and event simultaneity across cameras to establish matches and estimate depth. By contrast, we investigate how to estimate depth without explicit data association by fusing Disparity Space Images (DSIs) originated in efficient monocular methods. We develop fusion theory and apply it to design multi-camera 3D reconstruction algorithms that produce state-of-the-art results, as we confirm by comparing against four baseline methods and testing on a variety of available datasets.
This paper introduces an Online Localisation and Colored Mesh Reconstruction (OLCMR) ROS perception architecture for ground exploration robots aiming to perform robust Simultaneous Localisation And Mapping (SLAM) in challenging unknown environments and provide an associated colored 3D mesh representation in real time. It is intended to be used by a remote human operator to easily visualise the mapped environment during or after the mission or as a development base for further researches in the field of exploration robotics. The architecture is mainly composed of carefully-selected open-source ROS implementations of a LiDAR-based SLAM algorithm alongside a colored surface reconstruction procedure using a point cloud and RGB camera images projected into the 3D space. The overall performances are evaluated on the Newer College handheld LiDAR-Vision reference dataset and on two experimental trajectories gathered on board of representative wheeled robots in respectively urban and countryside outdoor environments. Index Terms: Field Robots, Mapping, SLAM, Colored Surface Reconstruction
Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at //github.com/164140757/SCM.
In this work, we introduce an adaptive control framework for human-robot collaborative transportation of objects with unknown deformation behaviour. The proposed framework takes as input the haptic information transmitted through the object, and the kinematic information of the human body obtained from a motion capture system to create reactive whole-body motions on a mobile collaborative robot. In order to validate our framework experimentally, we compared its performance with an admittance controller during a co-transportation task of a partially deformable object. We additionally demonstrate the potential of the framework while co-transporting rigid (aluminum rod) and highly deformable (rope) objects. A mobile manipulator which consists of an Omni-directional mobile base, a collaborative robotic arm, and a robotic hand is used as the robotic partner in the experiments. Quantitative and qualitative results of a 12-subjects experiment show that the proposed framework can effectively deal with objects of unknown deformability and provides intuitive assistance to human partners.
Applying artificial intelligence techniques in medical imaging is one of the most promising areas in medicine. However, most of the recent success in this area highly relies on large amounts of carefully annotated data, whereas annotating medical images is a costly process. In this paper, we propose a novel method, called FocalMix, which, to the best of our knowledge, is the first to leverage recent advances in semi-supervised learning (SSL) for 3D medical image detection. We conducted extensive experiments on two widely used datasets for lung nodule detection, LUNA16 and NLST. Results show that our proposed SSL methods can achieve a substantial improvement of up to 17.3% over state-of-the-art supervised learning approaches with 400 unlabeled CT scans.
This work addresses a novel and challenging problem of estimating the full 3D hand shape and pose from a single RGB image. Most current methods in 3D hand analysis from monocular RGB images only focus on estimating the 3D locations of hand keypoints, which cannot fully express the 3D shape of hand. In contrast, we propose a Graph Convolutional Neural Network (Graph CNN) based method to reconstruct a full 3D mesh of hand surface that contains richer information of both 3D hand shape and pose. To train networks with full supervision, we create a large-scale synthetic dataset containing both ground truth 3D meshes and 3D poses. When fine-tuning the networks on real-world datasets without 3D ground truth, we propose a weakly-supervised approach by leveraging the depth map as a weak supervision in training. Through extensive evaluations on our proposed new datasets and two public datasets, we show that our proposed method can produce accurate and reasonable 3D hand mesh, and can achieve superior 3D hand pose estimation accuracy when compared with state-of-the-art methods.