面向對象的映射對于場景理解非常重要,因為它們共同捕獲幾何和語義,允許對對象進行單獨的實例化和有意義的推理。我們介紹了FroDO,這是一種從RGB視頻中精確重建物體實例的方法,它以一種由粗到細的方式推斷出物體的位置、姿態和形狀。FroDO的關鍵是將對象形狀嵌入到一個新的學習空間中,允許在稀疏點云和稠密DeepSDF解碼之間進行無縫切換。給定一個局部的RGB幀的輸入序列,FroDO首先聚合2D檢測,為每個對象實例化一個分類感知的3D包圍框。在利用稀疏和稠密形狀表示進一步優化形狀和姿態之前,使用編碼器網絡對形狀代碼進行回歸。優化使用多視圖幾何,光度和剪影損失。我們對真實世界的數據集進行評估,包括Pix3D、Redwood-OS和ScanNet,用于單視圖、多視圖和多對象重建。
End-to-End Object Detection with Transformers
代碼:
本文已提交至ECCV 2020,作者團隊:Facebook AI Research。FAIR提出DETR:基于Transformers的端到端目標檢測,沒有NMS后處理步驟、真正的沒有anchor,直接對標且超越Faster R-CNN,代碼剛剛開源!
注:開源24小時,star數已高達700+!
簡介
本文提出了一種將目標檢測視為direct set直接集合預測問題的新方法。我們的方法簡化了檢測流程,有效地消除了對許多手工設計的組件的需求,例如非極大值抑制(NMS)或錨點生成,這些組件明確編碼了我們對任務的先驗知識。
這種稱為Detection Transformer或DETR的新框架的主要組成部分是基于集合的全局損失函數,該損失函數通過二分匹配和transformer編碼器-解碼器體系結構強制進行唯一的預測。給定一個固定的學習對象查詢的小集合,DETR會考慮目標對象與全局圖像上下文之間的關系,并直接并行輸出最終的預測集合。
與許多其他現代檢測器不同,新模型在概念上很簡單,并且不需要專門的庫。DETR與具有挑戰性的COCO對象檢測數據集上公認的且高度優化的Faster R-CNN baseline具有同等的準確性和運行時性能。此外,可以很容易地將DETR遷移到其他任務例如全景分割。
本文的Detection Transformer(DETR,請參見圖1)可以預測所有物體的劇烈運動,并通過設置損失函數進行端到端訓練,該函數可以在預測的物體與地面真實物體之間進行二分匹配。DETR通過刪除多個手工設計的后處理過程例如nms,對先驗知識進行編碼的組件來簡化檢測流程。與大多數現有的檢測方法不同,DETR不需要任何自定義層,因此可以在包含標準CNN和轉換器類的任何框架中輕松復制。
DETR的主要特征是二分匹配損失和具有(非自回歸)并行解碼的Transformer的結合。
參考:
The task of detecting 3D objects in point cloud has a pivotal role in many real-world applications. However, 3D object detection performance is behind that of 2D object detection due to the lack of powerful 3D feature extraction methods. In order to address this issue, we propose to build a 3D backbone network to learn rich 3D feature maps by using sparse 3D CNN operations for 3D object detection in point cloud. The 3D backbone network can inherently learn 3D features from almost raw data without compressing point cloud into multiple 2D images and generate rich feature maps for object detection. The sparse 3D CNN takes full advantages of the sparsity in the 3D point cloud to accelerate computation and save memory, which makes the 3D backbone network achievable. Empirical experiments are conducted on the KITTI benchmark and results show that the proposed method can achieve state-of-the-art performance for 3D object detection.
This paper aims at developing a faster and a more accurate solution to the amodal 3D object detection problem for indoor scenes. It is achieved through a novel neural network that takes a pair of RGB-D images as the input and delivers oriented 3D bounding boxes as the output. The network, named 3D-SSD, composed of two parts: hierarchical feature fusion and multi-layer prediction. The hierarchical feature fusion combines appearance and geometric features from RGB-D images while the multi-layer prediction utilizes multi-scale features for object detection. As a result, the network can exploit 2.5D representations in a synergetic way to improve the accuracy and efficiency. The issue of object sizes is addressed by attaching a set of 3D anchor boxes with varying sizes to every location of the prediction layers. At the end stage, the category scores for 3D anchor boxes are generated with adjusted positions, sizes and orientations respectively, leading to the final detections using non-maximum suppression. In the training phase, the positive samples are identified with the aid of 2D ground truth to avoid the noisy estimation of depth from raw data, which guide to a better converged model. Experiments performed on the challenging SUN RGB-D dataset show that our algorithm outperforms the state-of-the-art Deep Sliding Shape by 10.2% mAP and 88x faster. Further, experiments also suggest our approach achieves comparable accuracy and is 386x faster than the state-of-art method on the NYUv2 dataset even with a smaller input image size.