国产欧美日韩视频一区二区_一级黄色视频一区_粉嫩AV一区二区三区_亚洲中文久久久久久精品国产三批_2019午夜福利合集更新_久久精品日日躁夜夜躁91蜜臀_亚洲国产字幕在线观看一乱码

Recently, vision transformers have performed well in various computer vision tasks, including voxel 3D reconstruction. However, the windows of the vision transformer are not multi-scale, and there is no connection between the windows, which limits the accuracy of voxel 3D reconstruction . Therefore, we propose a shifted windows attention voxel 3D reconstruction network. To the best of our knowledge, this is the first work to apply shifted window attention to voxel 3D reconstruction. Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.

相關內容

三維(wei)重建

關注 1173

在(zai)計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)視(shi)(shi)(shi)覺(jue)中, 三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)是指根據單(dan)視(shi)(shi)(shi)圖(tu)(tu)或者(zhe)多(duo)視(shi)(shi)(shi)圖(tu)(tu)的(de)(de)(de)(de)(de)(de)圖(tu)(tu)像重(zhong)(zhong)建(jian)(jian)(jian)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)信息(xi)(xi)的(de)(de)(de)(de)(de)(de)過程(cheng). 由于單(dan)視(shi)(shi)(shi)頻的(de)(de)(de)(de)(de)(de)信息(xi)(xi)不完全(quan),因此(ci)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)需(xu)要(yao)利(li)用(yong)(yong)經(jing)(jing)驗知識. 而多(duo)視(shi)(shi)(shi)圖(tu)(tu)的(de)(de)(de)(de)(de)(de)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)(類(lei)似人的(de)(de)(de)(de)(de)(de)雙目定(ding)位)相(xiang)對(dui)比較容易(yi), 其方法(fa)是先對(dui)攝(she)像機(ji)(ji)進行標定(ding), 即計(ji)(ji)算(suan)(suan)(suan)出攝(she)像機(ji)(ji)的(de)(de)(de)(de)(de)(de)圖(tu)(tu)象(xiang)坐標系與世界坐標系的(de)(de)(de)(de)(de)(de)關(guan)系.然后(hou)利(li)用(yong)(yong)多(duo)個(ge)二(er)維(wei)(wei)(wei)(wei)(wei)圖(tu)(tu)象(xiang)中的(de)(de)(de)(de)(de)(de)信息(xi)(xi)重(zhong)(zhong)建(jian)(jian)(jian)出三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)信息(xi)(xi)。物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)是計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)輔助幾(ji)何設(she)計(ji)(ji)(CAGD)、計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)圖(tu)(tu)形學(xue)(xue)(xue)(CG)、計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)動畫、計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)視(shi)(shi)(shi)覺(jue)、醫(yi)學(xue)(xue)(xue)圖(tu)(tu)像處理(li)、科學(xue)(xue)(xue)計(ji)(ji)算(suan)(suan)(suan)和(he)(he)虛擬現(xian)實、數(shu)字(zi)媒體(ti)創(chuang)作等(deng)領域的(de)(de)(de)(de)(de)(de)共性科學(xue)(xue)(xue)問(wen)題和(he)(he)核心技術(shu)。在(zai)計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)內(nei)生成物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)表示主要(yao)有兩類(lei)方法(fa)。一類(lei)是使用(yong)(yong)幾(ji)何建(jian)(jian)(jian)模軟件通過人機(ji)(ji)交(jiao)互生成人為控制下的(de)(de)(de)(de)(de)(de)物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)幾(ji)何模型,另一類(lei)是通過一定(ding)的(de)(de)(de)(de)(de)(de)手段獲取(qu)真實物體(ti)的(de)(de)(de)(de)(de)(de)幾(ji)何形狀。前者(zhe)實現(xian)技術(shu)已經(jing)(jing)十分成熟,現(xian)有若干軟件支持,比如(ru):3DMAX、Maya、AutoCAD、UG等(deng)等(deng),它們一般(ban)使用(yong)(yong)具有數(shu)學(xue)(xue)(xue)表達式的(de)(de)(de)(de)(de)(de)曲線曲面表示幾(ji)何形狀。后(hou)者(zhe)一般(ban)稱為三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)過程(cheng),三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)是指利(li)用(yong)(yong)二(er)維(wei)(wei)(wei)(wei)(wei)投影恢復(fu)物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)信息(xi)(xi)(形狀等(deng))的(de)(de)(de)(de)(de)(de)數(shu)學(xue)(xue)(xue)過程(cheng)和(he)(he)計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)技術(shu),包括(kuo)數(shu)據獲取(qu)、預處理(li)、點(dian)云拼接和(he)(he)特征分析等(deng)步(bu)驟(zou)。

估計/估計量 · Better · MoDELS · 樣本 · 最大平均偏差 ·

2024 年 1 月 25 日

Rethinking FID: Towards a Better Evaluation Metric for Image Generation

Sadeep Jayasumana,Srikumar Ramalingam,Andreas Veit,Daniel Glasner,Ayan Chakrabarti,Sanjiv Kumar

from arxiv, Code is available at: //github.com/google-research/google-research/tree/master/cmmd

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

語言模型化 · 多峰值 · 大語言模型 · INFORMS · MoDELS ·

2024 年 1 月 25 日

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Xiyao Wang,Yuhang Zhou,Xiaoyu Liu,Hongjin Lu,Yuancheng Xu,Feihong He,Jaehong Yoon,Taixi Lu,Gedas Bertasius,Mohit Bansal,Huaxiu Yao,Furong Huang

from arxiv, 27 pages, 23 figures

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at //github.com/umd-huang-lab/Mementos.

圖像分割 · Learning · MoDELS · 情景 · Extensibility ·

2024 年 1 月 24 日

Tyche: Stochastic In-Context Learning for Medical Image Segmentation

Marianne Rakic,Hallee E. Wong,Jose Javier Gonzalez Ortiz,Beth Cimini,John Guttag,Adrian V. Dalca

Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however, there is often considerable uncertainty about what constitutes the correct segmentation, and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche, a model that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation, a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions, Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain.

Pyramid · MoDELS · Extensibility · state-of-the-art · Performer ·

2022 年 12 月 1 日

Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis

Wan-Cyuan Fan,Yen-Chun Chen,Dongdong Chen,Yu Cheng,Lu Yuan,Yu-Chiang Frank Wang

from arxiv, AAAI 2023

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.

Learning · 多峰值 · 表示學習 · Vision · Attention ·

2022 年 6 月 8 日

Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data

Shohreh Deldari,Hao Xue,Aaqib Saeed,Jiayuan He,Daniel V. Smith,Flora D. Salim

from arxiv, 36 pages, 5 figures, 9 tables, Survey paper

Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data

Extensibility · 噪聲 · Performer · state-of-the-art · 學成 ·

2021 年 6 月 30 日

Affective Image Content Analysis: Two Decades Review and New Perspectives

Sicheng Zhao,Xingxu Yao,Jufeng Yang,Guoli Jia,Guiguang Ding,Tat-Seng Chua,Bj?rn W. Schuller,Kurt Keutzer

from arxiv, Accepted by IEEE TPAMI

Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.

MoDELS · state-of-the-art · 視頻分類 · Extensibility · Networking ·

2021 年 1 月 5 日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Wenhao Wu,Dongliang He,Tianwei Lin,Fu Li,Chuang Gan,Errui Ding

from arxiv, Accepted by AAAI2021

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

損失函數（機器學習） · 學習的學習 · 學成 · entity · 泛函 ·

2019 年 9 月 9 日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Jiawei Wu,Wenhan Xiong,William Yang Wang

from arxiv, 11pages, 5 figures, accepted to EMNLP 2019

Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.

學成 · 深度學習 · Vision · 計算機視覺 · Performance ·

2019 年 9 月 5 日

Scene Text Detection and Recognition: The Deep Learning Era

Shangbang Long,Xin He,Cong Yao

from arxiv, Submitted version

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: //github.com/Jyouhou/SceneTextPapers.

state-of-the-art · LISA · Performer · MoDELS · 標注 ·

2018 年 8 月 28 日

Linguistically-Informed Self-Attention for Semantic Role Labeling

Emma Strubell,Patrick Verga,Daniel Andor,David Weiss,Andrew McCallum

from arxiv, In Conference on Empirical Methods in Natural Language Processing (EMNLP). Brussels, Belgium. October 2018

Current state-of-the-art semantic role labeling (SRL) uses a deep neural network with no explicit linguistic features. However, prior work has shown that gold syntax trees can dramatically improve SRL decoding, suggesting the possibility of increased accuracy from explicit modeling of syntax. In this work, we present linguistically-informed self-attention (LISA): a neural network model that combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection and SRL. Unlike previous models which require significant pre-processing to prepare linguistic features, LISA can incorporate syntax using merely raw tokens as input, encoding the sequence only once to simultaneously perform parsing, predicate detection and role labeling for all predicates. Syntax is incorporated by training one attention head to attend to syntactic parents for each token. Moreover, if a high-quality syntactic parse is already available, it can be beneficially injected at test time without re-training our SRL model. In experiments on CoNLL-2005 SRL, LISA achieves new state-of-the-art performance for a model using predicted predicates and standard word embeddings, attaining 2.5 F1 absolute higher than the previous state-of-the-art on newswire and more than 3.5 F1 on out-of-domain data, nearly 10% reduction in error. On ConLL-2012 English SRL we also show an improvement of more than 2.5 F1. LISA also out-performs the state-of-the-art with contextually-encoded (ELMo) word representations, by nearly 1.0 F1 on news and more than 2.0 F1 on out-of-domain text.