亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Recently, vision transformers have performed well in various computer vision tasks, including voxel 3D reconstruction. However, the windows of the vision transformer are not multi-scale, and there is no connection between the windows, which limits the accuracy of voxel 3D reconstruction . Therefore, we propose a shifted windows attention voxel 3D reconstruction network. To the best of our knowledge, this is the first work to apply shifted window attention to voxel 3D reconstruction. Experimental results on ShapeNet verify our method achieves SOTA accuracy in single-view reconstruction.

相關內容

在(zai)計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)視(shi)(shi)(shi)覺(jue)中, 三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)是指根據單(dan)視(shi)(shi)(shi)圖(tu)(tu)或者(zhe)多(duo)視(shi)(shi)(shi)圖(tu)(tu)的(de)(de)(de)(de)(de)(de)圖(tu)(tu)像重(zhong)(zhong)建(jian)(jian)(jian)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)信息(xi)(xi)的(de)(de)(de)(de)(de)(de)過程(cheng). 由于單(dan)視(shi)(shi)(shi)頻的(de)(de)(de)(de)(de)(de)信息(xi)(xi)不完全(quan),因此(ci)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)需(xu)要(yao)利(li)用(yong)(yong)經(jing)(jing)驗知識. 而多(duo)視(shi)(shi)(shi)圖(tu)(tu)的(de)(de)(de)(de)(de)(de)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)(類(lei)似人的(de)(de)(de)(de)(de)(de)雙目定(ding)位)相(xiang)對(dui)比較容易(yi), 其方法(fa)是先對(dui)攝(she)像機(ji)(ji)進行標定(ding), 即計(ji)(ji)算(suan)(suan)(suan)出攝(she)像機(ji)(ji)的(de)(de)(de)(de)(de)(de)圖(tu)(tu)象(xiang)坐標系與世界坐標系的(de)(de)(de)(de)(de)(de)關(guan)系.然后(hou)利(li)用(yong)(yong)多(duo)個(ge)二(er)維(wei)(wei)(wei)(wei)(wei)圖(tu)(tu)象(xiang)中的(de)(de)(de)(de)(de)(de)信息(xi)(xi)重(zhong)(zhong)建(jian)(jian)(jian)出三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)信息(xi)(xi)。 物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)是計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)輔助幾(ji)何設(she)計(ji)(ji)(CAGD)、計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)圖(tu)(tu)形學(xue)(xue)(xue)(CG)、計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)動畫、計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)視(shi)(shi)(shi)覺(jue)、醫(yi)學(xue)(xue)(xue)圖(tu)(tu)像處理(li)、科學(xue)(xue)(xue)計(ji)(ji)算(suan)(suan)(suan)和(he)(he)虛擬現(xian)實、數(shu)字(zi)媒體(ti)創(chuang)作等(deng)領域的(de)(de)(de)(de)(de)(de)共性科學(xue)(xue)(xue)問(wen)題和(he)(he)核心技術(shu)。在(zai)計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)內(nei)生成物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)表示主要(yao)有兩類(lei)方法(fa)。一類(lei)是使用(yong)(yong)幾(ji)何建(jian)(jian)(jian)模軟件通過人機(ji)(ji)交(jiao)互生成人為控制下的(de)(de)(de)(de)(de)(de)物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)幾(ji)何模型,另一類(lei)是通過一定(ding)的(de)(de)(de)(de)(de)(de)手段獲取(qu)真實物體(ti)的(de)(de)(de)(de)(de)(de)幾(ji)何形狀。前者(zhe)實現(xian)技術(shu)已經(jing)(jing)十分成熟,現(xian)有若干軟件支持,比如(ru):3DMAX、Maya、AutoCAD、UG等(deng)等(deng),它們一般(ban)使用(yong)(yong)具有數(shu)學(xue)(xue)(xue)表達式的(de)(de)(de)(de)(de)(de)曲線曲面表示幾(ji)何形狀。后(hou)者(zhe)一般(ban)稱為三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)過程(cheng),三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)重(zhong)(zhong)建(jian)(jian)(jian)是指利(li)用(yong)(yong)二(er)維(wei)(wei)(wei)(wei)(wei)投影恢復(fu)物體(ti)三(san)(san)(san)維(wei)(wei)(wei)(wei)(wei)信息(xi)(xi)(形狀等(deng))的(de)(de)(de)(de)(de)(de)數(shu)學(xue)(xue)(xue)過程(cheng)和(he)(he)計(ji)(ji)算(suan)(suan)(suan)機(ji)(ji)技術(shu),包括(kuo)數(shu)據獲取(qu)、預處理(li)、點(dian)云拼接和(he)(he)特征分析等(deng)步(bu)驟(zou)。

As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at //github.com/umd-huang-lab/Mementos.

Existing learning-based solutions to medical image segmentation have two important shortcomings. First, for most new segmentation task, a new model has to be trained or fine-tuned. This requires extensive resources and machine learning expertise, and is therefore often infeasible for medical researchers and clinicians. Second, most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however, there is often considerable uncertainty about what constitutes the correct segmentation, and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche, a model that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation, a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions, Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain.

Diffusion models (DMs) have shown great potential for high-quality image synthesis. However, when it comes to producing images with complex scenes, how to properly describe both image global structures and object details remains a challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion model performing a multi-scale coarse-to-fine denoising process for image synthesis. Our model decomposes an input image into scale-dependent vector quantized features, followed by a coarse-to-fine gating for producing image output. During the above multi-scale representation learning stage, additional input conditions like text, scene graph, or image layout can be further exploited. Thus, Frido can be also applied for conditional or cross-modality image synthesis. We conduct extensive experiments over various unconditioned and conditional image generation tasks, ranging from text-to-image synthesis, layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and label-to-image on COCO. Code is available at //github.com/davidhalladay/Frido.

Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in the field of computer vision, speech, natural language processing (NLP), and recently, with other types of modalities, including time series from sensors. The popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training. Acquiring annotated data can be a difficult and costly process. Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models using supervisory signals that have been freely obtained from the raw data. Unlike existing reviews of SSRL that have pre-dominately focused upon methods in the fields of CV or NLP for a single modality, we aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data. To this end, we 1) provide a comprehensive categorization of existing SSRL methods, 2) introduce a generic pipeline by defining the key components of a SSRL framework, 3) compare existing models in terms of their objective function, network architecture and potential applications, and 4) review existing multimodal techniques in each category and various modalities. Finally, we present existing weaknesses and future opportunities. We believe our work develops a perspective on the requirements of SSRL in domains that utilise multimodal and/or temporal data

Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: //github.com/Jyouhou/SceneTextPapers.

Current state-of-the-art semantic role labeling (SRL) uses a deep neural network with no explicit linguistic features. However, prior work has shown that gold syntax trees can dramatically improve SRL decoding, suggesting the possibility of increased accuracy from explicit modeling of syntax. In this work, we present linguistically-informed self-attention (LISA): a neural network model that combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection and SRL. Unlike previous models which require significant pre-processing to prepare linguistic features, LISA can incorporate syntax using merely raw tokens as input, encoding the sequence only once to simultaneously perform parsing, predicate detection and role labeling for all predicates. Syntax is incorporated by training one attention head to attend to syntactic parents for each token. Moreover, if a high-quality syntactic parse is already available, it can be beneficially injected at test time without re-training our SRL model. In experiments on CoNLL-2005 SRL, LISA achieves new state-of-the-art performance for a model using predicted predicates and standard word embeddings, attaining 2.5 F1 absolute higher than the previous state-of-the-art on newswire and more than 3.5 F1 on out-of-domain data, nearly 10% reduction in error. On ConLL-2012 English SRL we also show an improvement of more than 2.5 F1. LISA also out-performs the state-of-the-art with contextually-encoded (ELMo) word representations, by nearly 1.0 F1 on news and more than 2.0 F1 on out-of-domain text.

北京阿比特科技有限公司