欧美综合一本热第九页_久久国产乱子伦精品噜噜_国产精品亚洲午夜一区二区_国产激情一区二区三区HD_亚洲综合欧洲国产精品区_中国农村黄色网站AV_一区二区视频在线观看免费的

The recently proposed MaskFormer gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. The proposed transformer decoder performs cross-attention between the learnable queries and each spatial feature from the feature pyramid in parallel and uses cross-scale inter-query attention to exchange complimentary information. We achieve competitive performance on three widely used semantic segmentation datasets. In particular, on ADE20K validation set, our result with Swin-B backbone surpasses that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.7 mIoU respectively. Using a Swin-L backbone, we achieve single-scale 56.1 mIoU and multi-scale 57.4 mIoU, obtaining state-of-the-art performance on the dataset. Extensive experiments on three widely used semantic segmentation datasets verify the effectiveness of our proposed method.

相關內容

Pyramid

關注 20

Pyramid is a small, fast, down-to-earth Python web application development framework.

多重集 · Performer · 圖像分割 · 模型評估 · Learning ·

2023 年 7 月 19 日

Two Approaches to Supervised Image Segmentation

Alexandre Benatti,Luciano da F. Costa

from arxiv, 37 pages, 18 figures

Though performed almost effortlessly by humans, segmenting 2D gray-scale or color images in terms of their constituent regions of interest (e.g.~background, objects or portions of objects) constitutes one of the greatest challenges in science and technology as a consequence of the involved dimensionality reduction(3D to 2D), noise, reflections, shades, and occlusions, among many other possible effects. While a large number of interesting approaches have been respectively suggested along the last decades, it was mainly with the more recent development of deep learning that more effective and general solutions have been obtained, currently constituting the basic comparison reference for this type of operation. Also developed recently, a multiset-based methodology has been described that is capable of encouraging performance that combines spatial accuracy, stability, and robustness while requiring minimal computational resources (hardware and/or training and recognition time). The interesting features of the latter methodology mostly follow from the enhanced selectivity and sensitivity, as well as good robustness to data perturbations and outliers, allowed by the coincidence similarity index on which the multiset approach to supervised image segmentation is based. After describing the deep learning and multiset approaches, the present work develops two comparison experiments between them which are primarily aimed at illustrating their respective main interesting features when applied to the adopted specific type of data and parameter configurations. While the deep learning approach confirmed its potential for performing image segmentation, the alternative multiset methodology allowed for encouraging accuracy while requiring little computational resources.

prototype · 優化器 · 端到端 · Extensibility · state-of-the-art ·

2023 年 7 月 19 日

Boundary-Refined Prototype Generation: A General End-to-End Paradigm for Semi-Supervised Semantic Segmentation

Junhao Dong,Zhu Meng,Delong Liu,Zhicheng Zhao,Fei Su

from arxiv, 53 pages, 7 figures

Prototype-based classification is a classical method in machine learning, and recently it has achieved remarkable success in semi-supervised semantic segmentation. However, the current approach isolates the prototype initialization process from the main training framework, which appears to be unnecessary. Furthermore, while the direct use of K-Means algorithm for prototype generation has considered rich intra-class variance, it may not be the optimal solution for the classification task. To tackle these problems, we propose a novel boundary-refined prototype generation (BRPG) method, which is incorporated into the whole training framework. Specifically, our approach samples and clusters high- and low-confidence features separately based on a confidence threshold, aiming to generate prototypes closer to the class boundaries. Moreover, an adaptive prototype optimization strategy is introduced to make prototype augmentation for categories with scattered feature distributions. Extensive experiments on the PASCAL VOC 2012 and Cityscapes datasets demonstrate the superiority and scalability of the proposed method, outperforming the current state-of-the-art approaches. The code is available at xxxxxxxxxxxxxx.

對數幾率 · 監督 · contrastive · prototype · 未標記 ·

2023 年 7 月 19 日

Space Engage: Collaborative Space Supervision for Contrastive-based Semi-Supervised Semantic Segmentation

Changqi Wang,Haoyu Xie,Yuhui Yuan,Chong Fu,Xiangyu Yue

from arxiv, Accepted to ICCV 2023

Semi-Supervised Semantic Segmentation (S4) aims to train a segmentation model with limited labeled images and a substantial volume of unlabeled images. To improve the robustness of representations, powerful methods introduce a pixel-wise contrastive learning approach in latent space (i.e., representation space) that aggregates the representations to their prototypes in a fully supervised manner. However, previous contrastive-based S4 methods merely rely on the supervision from the model's output (logits) in logit space during unlabeled training. In contrast, we utilize the outputs in both logit space and representation space to obtain supervision in a collaborative way. The supervision from two spaces plays two roles: 1) reduces the risk of over-fitting to incorrect semantic information in logits with the help of representations; 2) enhances the knowledge exchange between the two spaces. Furthermore, unlike previous approaches, we use the similarity between representations and prototypes as a new indicator to tilt training those under-performing representations and achieve a more efficient contrastive learning process. Results on two public benchmarks demonstrate the competitive performance of our method compared with state-of-the-art methods.

MoDELS · Performer · 點云 · 3D · Learning ·

2023 年 7 月 18 日

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

Jiahui Liu,Chirui Chang,Jianhui Liu,Xiaoyang Wu,Lan Ma,Xiaojuan Qi

3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware module for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at //github.com/CVMI-Lab/MarS3D.

3D · 知識 (knowledge) · 語義相似度 · 相似度 · 蒸餾 ·

2023 年 7 月 18 日

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

Zehan Wang,Haifeng Huang,Yang Zhao,Linjun Li,Xize Cheng,Yichen Zhu,Aoxiong Yin,Zhou Zhao

from arxiv, ICCV2023

3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we reconstruct the masked keywords of the sentence using each candidate one by one, and the reconstructed accuracy finely reflects the semantic similarity of each candidate to the query. Additionally, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D visual grounding model, which reduces inference costs and improves performance by taking full advantage of the well-studied structure of the existing architectures. We conduct extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the effectiveness of our proposed method.

GROUP · UNet · Hadamard積 · INFORMS · Performer ·

2023 年 7 月 17 日

EGE-UNet: an Efficient Group Enhanced UNet for skin lesion segmentation

Jiacheng Ruan,Mingye Xie,Jingsheng Gao,Ting Liu,Yuzhuo Fu

from arxiv, 10 pages, 4 figures, 2 tables. This paper has been early accepted by MICCAI 2023 and has received the MICCAI Student-Author Registration (STAR) Award

Transformer and its variants have been widely used for medical image segmentation. However, the large number of parameter and computational load of these models make them unsuitable for mobile health applications. To address this issue, we propose a more efficient approach, the Efficient Group Enhanced UNet (EGE-UNet). We incorporate a Group multi-axis Hadamard Product Attention module (GHPA) and a Group Aggregation Bridge module (GAB) in a lightweight manner. The GHPA groups input features and performs Hadamard Product Attention mechanism (HPA) on different axes to extract pathological information from diverse perspectives. The GAB effectively fuses multi-scale information by grouping low-level features, high-level features, and a mask generated by the decoder at each stage. Comprehensive experiments on the ISIC2017 and ISIC2018 datasets demonstrate that EGE-UNet outperforms existing state-of-the-art methods. In short, compared to the TransFuse, our model achieves superior segmentation performance while reducing parameter and computation costs by 494x and 160x, respectively. Moreover, to our best knowledge, this is the first model with a parameter count limited to just 50KB. Our code is available at //github.com/JCruan519/EGE-UNet.

contrastive · 學成 · 對比學習 · 目標檢測 · 優化器 ·

2021 年 4 月 4 日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Xinlong Wang,Rufeng Zhang,Chunhua Shen,Tao Kong,Lei Li

from arxiv, 11 pages. Accepted to IEEE/CVF Conf. Comp. Vision Pattern Recognition (CVPR) 2021; Oral paper

To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: //git.io/AdelaiDet

示例 · 端到端 · 變換 · MoDELS · 可理解性 ·

2021 年 3 月 24 日

End-to-End Video Instance Segmentation with Transformers

Yuqing Wang,Zhaoliang Xu,Xinlong Wang,Chunhua Shen,Baoshan Cheng,Hao Shen,Huaxia Xia

from arxiv, CVPR2021 Oral

Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is significantly different from existing approaches. Without bells and whistles, VisTR achieves the highest speed among all existing VIS models, and achieves the best result among methods using single model on the YouTube-VIS dataset. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy. We hope that VisTR can motivate future research for more video understanding tasks.

3D · 圖像分割 · Neural Networks · Networking · state-of-the-art ·

2018 年 8 月 2 日

A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation

Zhuotun Zhu,Yingda Xia,Wei Shen,Elliot K. Fishman,Alan L. Yuille

from arxiv, 9 pages, 4 figures, Accepted to 3DV

In this paper, we adopt 3D Convolutional Neural Networks to segment volumetric medical images. Although deep neural networks have been proven to be very effective on many 2D vision tasks, it is still challenging to apply them to 3D tasks due to the limited amount of annotated 3D data and limited computational resources. We propose a novel 3D-based coarse-to-fine framework to effectively and efficiently tackle these challenges. The proposed 3D-based framework outperforms the 2D counterpart to a large margin since it can leverage the rich spatial infor- mation along all three axes. We conduct experiments on two datasets which include healthy and pathological pancreases respectively, and achieve the current state-of-the-art in terms of Dice-S{\o}rensen Coefficient (DSC). On the NIH pancreas segmentation dataset, we outperform the previous best by an average of over 2%, and the worst case is improved by 7% to reach almost 70%, which indicates the reliability of our framework in clinical applications.

FCN · 全卷積網絡 · 3D · 級聯 · MoDELS ·

2018 年 3 月 20 日

An application of cascaded 3D fully convolutional networks for medical image segmentation

Holger R. Roth,Hirohisa Oda,Xiangrong Zhou,Natsuki Shimizu,Ying Yang,Yuichiro Hayashi,Masahiro Oda,Michitaka Fujiwara,Kazunari Misawa,Kensaku Mori

from arxiv, Preprint accepted for publication in Computerized Medical Imaging and Graphics. Substantial extension of arXiv:1704.06382; Corrected references to figure numbers in this version

Recent advances in 3D fully convolutional networks (FCN) have made it feasible to produce dense voxel-wise predictions of volumetric images. In this work, we show that a multi-class 3D FCN trained on manually labeled CT scans of several anatomical structures (ranging from the large organs to thin vessels) can achieve competitive segmentation results, while avoiding the need for handcrafting features or training class-specific models. To this end, we propose a two-stage, coarse-to-fine approach that will first use a 3D FCN to roughly define a candidate region, which will then be used as input to a second 3D FCN. This reduces the number of voxels the second FCN has to classify to ~10% and allows it to focus on more detailed segmentation of the organs and vessels. We utilize training and validation sets consisting of 331 clinical CT images and test our models on a completely unseen data collection acquired at a different hospital that includes 150 CT scans, targeting three anatomical organs (liver, spleen, and pancreas). In challenging organs such as the pancreas, our cascaded approach improves the mean Dice score from 68.5 to 82.2%, achieving the highest reported average score on this dataset. We compare with a 2D FCN method on a separate dataset of 240 CT scans with 18 classes and achieve a significantly higher performance in small organs and vessels. Furthermore, we explore fine-tuning our models to different datasets. Our experiments illustrate the promise and robustness of current 3D FCN based semantic segmentation of medical images, achieving state-of-the-art results. Our code and trained models are available for download: //github.com/holgerroth/3Dunet_abdomen_cascade.