Vision-based Simultaneous Localization And Mapping (VSLAM) is a mature problem in Robotics. Most VSLAM systems are feature based methods, which are robust and present high accuracy, but yield sparse maps with limited application for further navigation tasks. Most recently, direct methods which operate directly on image intensity have been introduced, capable of reconstructing richer maps at the cost of higher processing power. In this work, an edge-based monocular SLAM system (SE-SLAM) is proposed as a middle point: edges present good localization as point features, while enabling a structural semidense map reconstruction. However, edges are not easy to associate, track and optimize over time, as they lack descriptors and biunivocal correspondence, unlike point features. To tackle these issues, this paper presents a method to match edges between frames in a consistent manner; a feasible strategy to solve the optimization problem, since its size rapidly increases when working with edges; and the use of non-linear optimization techniques. The resulting system achieves comparable precision to state of the art feature-based and dense/semi-dense systems, while inherently building a structural semi-dense reconstruction of the environment, providing relevant structure data for further navigation algorithms. To achieve such accuracy, state of the art non-linear optimization is needed, over a continuous feed of 10000 edgepoints per frame, to optimize the full semi-dense output. Despite its heavy processing requirements, the system achieves near to real-time operation, thanks to a custom built solver and parallelization of its key stages. In order to encourage further development of edge-based SLAM systems, SE-SLAM source code will be released as open source.
This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors. First, the ability to estimate the local shape (scale, orientation, etc.) of feature points is often neglected during dense feature extraction, while the shape-awareness is crucial to acquire stronger geometric invariance. Second, the localization accuracy of detected keypoints is not sufficient to reliably recover camera geometry, which has become the bottleneck in tasks such as 3D reconstruction. In this paper, we present ASLFeat, with three light-weight yet effective modifications to mitigate above issues. First, we resort to deformable convolutional networks to densely estimate and apply local transformation. Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, we use a peakiness measurement to relate feature responses and derive more indicative detection scores. The effect of each modification is thoroughly studied, and the evaluation is extensively conducted across a variety of practical scenarios. State-of-the-art results are reported that demonstrate the superiority of our methods.
In this paper, we propose a novel dense surfel mapping system that scales well in different environments with only CPU computation. Using a sparse SLAM system to estimate camera poses, the proposed mapping system can fuse intensity images and depth images into a globally consistent model. The system is carefully designed so that it can build from room-scale environments to urban-scale environments using depth images from RGB-D cameras, stereo cameras or even a monocular camera. First, superpixels extracted from both intensity and depth images are used to model surfels in the system. superpixel-based surfels make our method both run-time efficient and memory efficient. Second, surfels are further organized according to the pose graph of the SLAM system to achieve $O(1)$ fusion time regardless of the scale of reconstructed models. Third, a fast map deformation using the optimized pose graph enables the map to achieve global consistency in real-time. The proposed surfel mapping system is compared with other state-of-the-art methods on synthetic datasets. The performances of urban-scale and room-scale reconstruction are demonstrated using the KITTI dataset and autonomous aggressive flights, respectively. The code is available for the benefit of the community.
In order to facilitate the accesses of general users to knowledge graphs, an increasing effort is being exerted to construct graph-structured queries of given natural language questions. At the core of the construction is to deduce the structure of the target query and determine the vertices/edges which constitute the query. Existing query construction methods rely on question understanding and conventional graph-based algorithms which lead to inefficient and degraded performances facing complex natural language questions over knowledge graphs with large scales. In this paper, we focus on this problem and propose a novel framework standing on recent knowledge graph embedding techniques. Our framework first encodes the underlying knowledge graph into a low-dimensional embedding space by leveraging generalized local knowledge graphs. Given a natural language question, the learned embedding representations of the knowledge graph are utilized to compute the query structure and assemble vertices/edges into the target query. Extensive experiments were conducted on the benchmark dataset, and the results demonstrate that our framework outperforms state-of-the-art baseline models regarding effectiveness and efficiency.
In this paper, we proposed a new deep learning based dense monocular SLAM method. Compared to existing methods, the proposed framework constructs a dense 3D model via a sparse to dense mapping using learned surface normals. With single view learned depth estimation as prior for monocular visual odometry, we obtain both accurate positioning and high quality depth reconstruction. The depth and normal are predicted by a single network trained in a tightly coupled manner.Experimental results show that our method significantly improves the performance of visual tracking and depth prediction in comparison to the state-of-the-art in deep monocular dense SLAM.
Single-image piece-wise planar 3D reconstruction aims to simultaneously segment plane instances and recover 3D plane parameters from an image. Most recent approaches leverage convolutional neural networks (CNNs) and achieve promising results. However, these methods are limited to detecting a fixed number of planes with certain learned order. To tackle this problem, we propose a novel two-stage method based on associative embedding, inspired by its recent success in instance segmentation. In the first stage, we train a CNN to map each pixel to an embedding space where pixels from the same plane instance have similar embeddings. Then, the plane instances are obtained by grouping the embedding vectors in planar regions via an efficient mean shift clustering algorithm. In the second stage, we estimate the parameter for each plane instance by considering both pixel-level and instance-level consistencies. With the proposed method, we are able to detect an arbitrary number of planes. Extensive experiments on public datasets validate the effectiveness and efficiency of our method. Furthermore, our method runs at 30 fps at the testing time, thus could facilitate many real-time applications such as visual SLAM and human-robot interaction. Code is available at //github.com/svip-lab/PlanarReconstruction.
We propose a new multi-instance dynamic RGB-D SLAM system using an object-level octree-based volumetric representation. It can provide robust camera tracking in dynamic environments and at the same time, continuously estimate geometric, semantic, and motion properties for arbitrary objects in the scene. For each incoming frame, we perform instance segmentation to detect objects and refine mask boundaries using geometric and motion information. Meanwhile, we estimate the pose of each existing moving object using an object-oriented tracking method and robustly track the camera pose against the static scene. Based on the estimated camera pose and object poses, we associate segmented masks with existing models and incrementally fuse corresponding colour, depth, semantic, and foreground object probabilities into each object model. In contrast to existing approaches, our system is the first system to generate an object-level dynamic volumetric map from a single RGB-D camera, which can be used directly for robotic tasks. Our method can run at 2-3 Hz on a CPU, excluding the instance segmentation part. We demonstrate its effectiveness by quantitatively and qualitatively testing it on both synthetic and real-world sequences.
We present a unified framework tackling two problems: class-specific 3D reconstruction from a single image, and generation of new 3D shape samples. These tasks have received considerable attention recently; however, existing approaches rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance. Our framework is very general: it can be trained in similar settings to these existing approaches, while also supporting weaker supervision scenarios. Importantly, it can be trained purely from 2D images, without ground-truth pose annotations, and with a single view per instance. We employ meshes as an output representation, instead of voxels used in most prior work. This allows us to exploit shading information during training, which previous 2D-supervised methods cannot. Thus, our method can learn to generate and reconstruct concave object classes. We evaluate our approach on synthetic data in various settings, showing that (i) it learns to disentangle shape from pose; (ii) using shading in the loss improves performance; (iii) our model is comparable or superior to state-of-the-art voxel-based approaches on quantitative metrics, while producing results that are visually more pleasing; (iv) it still performs well when given supervision weaker than in prior works.
We present a monocular Simultaneous Localization and Mapping (SLAM) using high level object and plane landmarks, in addition to points. The resulting map is denser, more compact and meaningful compared to point only SLAM. We first propose a high order graphical model to jointly infer the 3D object and layout planes from single image considering occlusions and semantic constraints. The extracted cuboid object and layout planes are further optimized in a unified SLAM framework. Objects and planes can provide more semantic constraints such as Manhattan and object supporting relationships compared to points. Experiments on various public and collected datasets including ICL NUIM and TUM mono show that our algorithm can improve camera localization accuracy compared to state-of-the-art SLAM and also generate dense maps in many structured environments.
Simultaneous Localization And Mapping (SLAM) is a fundamental problem in mobile robotics. While point-based SLAM methods provide accurate camera localization, the generated maps lack semantic information. On the other hand, state of the art object detection methods provide rich information about entities present in the scene from a single image. This work marries the two and proposes a method for representing generic objects as quadrics which allows object detections to be seamlessly integrated in a SLAM framework. For scene coverage, additional dominant planar structures are modeled as infinite planes. Experiments show that the proposed points-planes-quadrics representation can easily incorporate Manhattan and object affordance constraints, greatly improving camera localization and leading to semantically meaningful maps. The performance of our SLAM system is demonstrated in //youtu.be/dR-rB9keF8M .
Cloud Robotics is one of the emerging area of robotics. It has created a lot of attention due to its direct practical implications on Robotics. In Cloud Robotics, the concept of cloud computing is used to offload computational extensive jobs of the robots to the cloud. Apart from this, additional functionalities can also be offered on run to the robots on demand. Simultaneous Localization and Mapping (SLAM) is one of the computational intensive algorithm in robotics used by robots for navigation and map building in an unknown environment. Several Cloud based frameworks are proposed specifically to address the problem of SLAM, DAvinCi, Rapyuta and C2TAM are some of those framework. In this paper, we presented a detailed review of all these framework implementation for SLAM problem.