亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Safe autonomous driving technology heavily depends on accurate 3D object detection since it produces input to safety critical downstream tasks such as prediction and navigation. Recent advances in this field is made by developing the refinement stage for voxel-based region proposal networks to better strike the balance between accuracy and efficiency. A popular approach among state-of-the-art frameworks is to divide proposals, or Region of Interest (ROI), into grids and extract feature for each grid location before synthesizing them to ROI feature. While achieving impressive performances, such an approach involves a number of hand crafted components (e.g. grid sampling, set abstraction) which requires expert knowledge to be tuned correctly. This paper takes a more data-driven approach to ROI feature extraction using the attention mechanism. Specifically, points inside a ROI are positionally encoded to incorporate ROI 's geometry. The resulted position encoding and their features are transformed into ROI feature via vector attention. Unlike the original multi-head attention, vector attention assign different weights to different channels within a point feature, thus being able to capture a more sophisticated relation between pooled points and ROI. Experiments on KITTI \textit{validation} set show that our method achieves competitive performance of 84.84 AP for class Car at Moderate difficulty while having the least parameters compared to closely related methods and attaining a quasi-real time inference speed at 15 FPS on NVIDIA V100 GPU. The code will be released.

相關內容

Attention機制最早是在(zai)視覺圖像領(ling)域提(ti)出(chu)來的(de)(de),但是真正火起(qi)來應該算(suan)是google mind團隊的(de)(de)這篇(pian)論文《Recurrent Models of Visual Attention》[14],他們(men)在(zai)RNN模(mo)型上(shang)使用了attention機制來進行圖像分類(lei)(lei)。隨(sui)后,Bahdanau等人(ren)在(zai)論文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中(zhong)(zhong),使用類(lei)(lei)似(si)attention的(de)(de)機制在(zai)機器翻譯任務(wu)上(shang)將翻譯和(he)對(dui)齊同時進行,他們(men)的(de)(de)工作(zuo)算(suan)是是第一(yi)個提(ti)出(chu)attention機制應用到NLP領(ling)域中(zhong)(zhong)。接著類(lei)(lei)似(si)的(de)(de)基于attention機制的(de)(de)RNN模(mo)型擴展(zhan)開始應用到各種NLP任務(wu)中(zhong)(zhong)。最近(jin),如何(he)在(zai)CNN中(zhong)(zhong)使用attention機制也成為了大(da)家的(de)(de)研(yan)究熱點。下圖表示(shi)了attention研(yan)究進展(zhan)的(de)(de)大(da)概趨勢(shi)。

In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.

In this work we present point-level region contrast, a self-supervised pre-training approach for the task of object detection. This approach is motivated by the two key factors in detection: localization and recognition. While accurate localization favors models that operate at the pixel- or point-level, correct recognition typically relies on a more holistic, region-level view of objects. Incorporating this perspective in pre-training, our approach performs contrastive learning by directly sampling individual point pairs from different regions. Compared to an aggregated representation per region, our approach is more robust to the change in input region quality, and further enables us to implicitly improve initial region assignments via online knowledge distillation during training. Both advantages are important when dealing with imperfect regions encountered in the unsupervised setting. Experiments show point-level region contrast improves on state-of-the-art pre-training methods for object detection and segmentation across multiple tasks and datasets, and we provide extensive ablation studies and visualizations to aid understanding. Code will be made available.

There are two mainstreams for object detection: top-down and bottom-up. The state-of-the-art approaches mostly belong to the first category. In this paper, we demonstrate that the bottom-up approaches are as competitive as the top-down and enjoy higher recall. Our approach, named CenterNet, detects each object as a triplet keypoints (top-left and bottom-right corners and the center keypoint). We firstly group the corners by some designed cues and further confirm the objects by the center keypoints. The corner keypoints equip the approach with the ability to detect objects of various scales and shapes and the center keypoint avoids the confusion brought by a large number of false-positive proposals. Our approach is a kind of anchor-free detector because it does not need to define explicit anchor boxes. We adapt our approach to the backbones with different structures, i.e., the 'hourglass' like networks and the the 'pyramid' like networks, which detect objects on a single-resolution feature map and multi-resolution feature maps, respectively. On the MS-COCO dataset, CenterNet with Res2Net-101 and Swin-Transformer achieves APs of 53.7% and 57.1%, respectively, outperforming all existing bottom-up detectors and achieving state-of-the-art. We also design a real-time CenterNet, which achieves a good trade-off between accuracy and speed with an AP of 43.6% at 30.5 FPS. //github.com/Duankaiwen/PyCenterNet.

With the field of rigid-body robotics having matured in the last fifty years, routing, planning, and manipulation of deformable objects have recently emerged as a more untouched research area in many fields ranging from surgical robotics to industrial assembly and construction. Routing approaches for deformable objects which rely on learned implicit spatial representations (e.g., Learning-from-Demonstration methods) make them vulnerable to changes in the environment and the specific setup. On the other hand, algorithms that entirely separate the spatial representation of the deformable object from the routing and manipulation, often using a representation approach independent of planning, result in slow planning in high dimensional space. This paper proposes a novel approach to routing deformable one-dimensional objects (e.g., wires, cables, ropes, sutures, threads). This approach utilizes a compact representation for the object, allowing efficient and fast online routing. The spatial representation is based on the geometrical decomposition of the space into convex subspaces, resulting in a discrete coding of the deformable object configuration as a sequence. With such a configuration, the routing problem can be solved using a fast dynamic programming sequence matching method that calculates the next routing move. The proposed method couples the routing and efficient configuration for improved planning time. Our simulation and real experiments show the method correctly computing the next manipulation action in sub-millisecond time and accomplishing various routing and manipulation tasks.

Accurate extraction of the Region of Interest is critical for successful ocular region-based biometrics. In this direction, we propose a new context-based segmentation approach, entitled Ocular Region Context Network (ORCNet), introducing a specific loss function, i.e., he Punish Context Loss (PC-Loss). The PC-Loss punishes the segmentation losses of a network by using a percentage difference value between the ground truth and the segmented masks. We obtain the percentage difference by taking into account Biederman's semantic relationship concepts, in which we use three contexts (semantic, spatial, and scale) to evaluate the relationships of the objects in an image. Our proposal achieved promising results in the evaluated scenarios: iris, sclera, and ALL (iris + sclera) segmentations, utperforming the literature baseline techniques. The ORCNet with ResNet-152 outperforms the best baseline (EncNet with ResNet-152) on average by 2.27%, 28.26% and 6.43% in terms of F-Score, Error Rate and Intersection Over Union, respectively. We also provide (for research purposes) 3,191 manually labeled masks for the MICHE-I database, as another contribution of our work.

Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to unsupervisedly pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade-off multi-task learning of classification and localization in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection. (2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multi-query patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher precision on PASCAL VOC and COCO datasets. The code will be available soon.

In this paper, we study the few-shot multi-label classification for user intent detection. For multi-label intent detection, state-of-the-art work estimates label-instance relevance scores and uses a threshold to select multiple associated intent labels. To determine appropriate thresholds with only a few examples, we first learn universal thresholding experience on data-rich domains, and then adapt the thresholds to certain few-shot domains with a calibration based on nonparametric learning. For better calculation of label-instance relevance score, we introduce label name embedding as anchor points in representation space, which refines representations of different classes to be well-separated from each other. Experiments on two datasets show that the proposed model significantly outperforms strong baselines in both one-shot and five-shot settings.

Applying artificial intelligence techniques in medical imaging is one of the most promising areas in medicine. However, most of the recent success in this area highly relies on large amounts of carefully annotated data, whereas annotating medical images is a costly process. In this paper, we propose a novel method, called FocalMix, which, to the best of our knowledge, is the first to leverage recent advances in semi-supervised learning (SSL) for 3D medical image detection. We conducted extensive experiments on two widely used datasets for lung nodule detection, LUNA16 and NLST. Results show that our proposed SSL methods can achieve a substantial improvement of up to 17.3% over state-of-the-art supervised learning approaches with 400 unlabeled CT scans.

Benefit from the quick development of deep learning techniques, salient object detection has achieved remarkable progresses recently. However, there still exists following two major challenges that hinder its application in embedded devices, low resolution output and heavy model weight. To this end, this paper presents an accurate yet compact deep network for efficient salient object detection. More specifically, given a coarse saliency prediction in the deepest layer, we first employ residual learning to learn side-output residual features for saliency refinement, which can be achieved with very limited convolutional parameters while keep accuracy. Secondly, we further propose reverse attention to guide such side-output residual learning in a top-down manner. By erasing the current predicted salient regions from side-output features, the network can eventually explore the missing object parts and details which results in high resolution and accuracy. Experiments on six benchmark datasets demonstrate that the proposed approach compares favorably against state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB).

Retrieving object instances among cluttered scenes efficiently requires compact yet comprehensive regional image representations. Intuitively, object semantics can help build the index that focuses on the most relevant regions. However, due to the lack of bounding-box datasets for objects of interest among retrieval benchmarks, most recent work on regional representations has focused on either uniform or class-agnostic region selection. In this paper, we first fill the void by providing a new dataset of landmark bounding boxes, based on the Google Landmarks dataset, that includes $94k$ images with manually curated boxes from $15k$ unique landmarks. Then, we demonstrate how a trained landmark detector, using our new dataset, can be leveraged to index image regions and improve retrieval accuracy while being much more efficient than existing regional methods. In addition, we further introduce a novel regional aggregated selective match kernel (R-ASMK) to effectively combine information from detected regions into an improved holistic image representation. R-ASMK boosts image retrieval accuracy substantially at no additional memory cost, while even outperforming systems that index image regions independently. Our complete image retrieval system improves upon the previous state-of-the-art by significant margins on the Revisited Oxford and Paris datasets. Code and data will be released.

北京阿比特科技有限公司