好诱人的搜子好爽免费观看_无码人妻丰满熟妇A片护士M_国产精品午夜福利鲁丝片在线_国语成本人片免费AV无码_国产午夜精品久久久久婷婷_黄色成人免费在线观看视频_无码免费永久在线观看乱妇

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from given text prompts. However, previous methods fail to perform accurate modality alignment between text concepts and generated images due to the lack of fine-level semantic guidance that successfully diagnoses the modality discrepancy. In this paper, we propose FineRewards to improve the alignment between text and images in text-to-image diffusion models by introducing two new fine-grained semantic rewards: the caption reward and the Semantic Segment Anything (SAM) reward. From the global semantic view, the caption reward generates a corresponding detailed caption that depicts all important contents in the synthetic image via a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. From the local semantic view, the SAM reward segments the generated images into local parts with category labels, and scores the segmented parts by measuring the likelihood of each category appearing in the prompted scene via a large language model, i.e., Vicuna-7B. Additionally, we adopt an assemble reward-ranked learning strategy to enable the integration of multiple reward functions to jointly guide the model training. Adapting results of text-to-image models on the MS-COCO benchmark show that the proposed semantic reward outperforms other baseline reward functions with a considerable margin on both visual quality and semantic similarity with the input prompt. Moreover, by adopting the assemble reward-ranked learning strategy, we further demonstrate that model performance is further improved when adapting under the unifying of the proposed semantic reward with the current image rewards.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · 3D · state-of-the-art · 逼真度 · 查準率/準確率 ·

2023 年 7 月 21 日

Conditional Diffusion Models for Semantic 3D Medical Image Synthesis

Zolnamar Dorjsembe,Hsing-Kuo Pao,Sodtavilan Odonchimed,Furen Xiao

The demand for artificial intelligence (AI) in healthcare is rapidly increasing. However, significant challenges arise from data scarcity and privacy concerns, particularly in medical imaging. While existing generative models have achieved success in image synthesis and image-to-image translation tasks, there remains a gap in the generation of 3D semantic medical images. To address this gap, we introduce Med-DDPM, a diffusion model specifically designed for semantic 3D medical image synthesis, effectively tackling data scarcity and privacy issues. The novelty of Med-DDPM lies in its incorporation of semantic conditioning, enabling precise control during the image generation process. Our model outperforms Generative Adversarial Networks (GANs) in terms of stability and performance, generating diverse and anatomically coherent images with high visual fidelity. Comparative analysis against state-of-the-art augmentation techniques demonstrates that Med-DDPM produces comparable results, highlighting its potential as a data augmentation tool for enhancing model accuracy. In conclusion, Med-DDPM pioneers 3D semantic medical image synthesis by delivering high-quality and anatomically coherent images. Furthermore, the integration of semantic conditioning with Med-DDPM holds promise for image anonymization in the field of biomedical imaging, showcasing the capabilities of the model in addressing challenges related to data scarcity and privacy concerns.

圖片分類 · MoDELS · 模型評估 · 小樣本學習 · 多樣性 ·

2023 年 7 月 21 日

Generating Image-Specific Text Improves Fine-grained Image Classification

Emily Mu,Kathleen M. Lewis,Adrian V. Dalca,John Guttag

from arxiv, The first two authors contributed equally to this work

Recent vision-language models outperform vision-only models on many image classification tasks. However, because of the absence of paired text/image descriptions, it remains difficult to fine-tune these models for fine-grained image classification. In this work, we propose a method, GIST, for generating image-specific fine-grained text descriptions from image-only datasets, and show that these text descriptions can be used to improve classification. Key parts of our method include 1. prompting a pretrained large language model with domain-specific prompts to generate diverse fine-grained text descriptions for each class and 2. using a pretrained vision-language model to match each image to label-preserving text descriptions that capture relevant visual features in the image. We demonstrate the utility of GIST by fine-tuning vision-language models on the image-and-generated-text pairs to learn an aligned vision-language representation space for improved classification. We evaluate our learned representation space in full-shot and few-shot scenarios across four diverse fine-grained classification datasets, each from a different domain. Our method achieves an average improvement of $4.1\%$ in accuracy over CLIP linear probes and an average of $1.1\%$ improvement in accuracy over the previous state-of-the-art image-text classification method on the full-shot datasets. Our method achieves similar improvements across few-shot regimes. Code is available at //github.com/emu1729/GIST.

Guidance · 類別 · 蒸餾 · Learning · Extensibility ·

2023 年 7 月 20 日

Gradient-Semantic Compensation for Incremental Semantic Segmentation

Wei Cong,Yang Cong,Jiahua Dong,Gan Sun,Henghui Ding

Incremental semantic segmentation aims to continually learn the segmentation of new coming classes without accessing the training data of previously learned classes. However, most current methods fail to address catastrophic forgetting and background shift since they 1) treat all previous classes equally without considering different forgetting paces caused by imbalanced gradient back-propagation; 2) lack strong semantic guidance between classes. To tackle the above challenges, in this paper, we propose a Gradient-Semantic Compensation (GSC) model, which surmounts incremental semantic segmentation from both gradient and semantic perspectives. Specifically, to address catastrophic forgetting from the gradient aspect, we develop a step-aware gradient compensation that can balance forgetting paces of previously seen classes via re-weighting gradient backpropagation. Meanwhile, we propose a soft-sharp semantic relation distillation to distill consistent inter-class semantic relations via soft labels for alleviating catastrophic forgetting from the semantic aspect. In addition, we develop a prototypical pseudo re-labeling that provides strong semantic guidance to mitigate background shift. It produces high-quality pseudo labels for old classes in the background by measuring distances between pixels and class-wise prototypes. Extensive experiments on three public datasets, i.e., Pascal VOC 2012, ADE20K, and Cityscapes, demonstrate the effectiveness of our proposed GSC model.

MoDELS · Pair · 控制器 · Extensibility · HTTPS ·

2023 年 7 月 20 日

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

Jinheng Xie,Yuexiang Li,Yawen Huang,Haozhe Liu,Wentian Zhang,Yefeng Zheng,Mike Zheng Shou

from arxiv, Accepted by ICCV 2023. The paper is still being revised for better organization and comparison

Recent text-to-image diffusion models have demonstrated an astonishing capacity to generate high-quality images. However, researchers mainly studied the way of synthesizing images with only text prompts. While some works have explored using other modalities as conditions, considerable paired data, e.g., box/mask-image pairs, and fine-tuning time are required for nurturing models. As such paired data is time-consuming and labor-intensive to acquire and restricted to a closed set, this potentially becomes the bottleneck for applications in an open world. This paper focuses on the simplest form of user-provided conditions, e.g., box or scribble. To mitigate the aforementioned problem, we propose a training-free method to control objects and contexts in the synthesized images adhering to the given spatial conditions. Specifically, three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints, are designed and seamlessly integrated into the denoising step of diffusion models, requiring no additional training and massive annotated layout data. Extensive results show that the proposed constraints can control what and where to present in the images while retaining the ability of the Stable Diffusion model to synthesize with high fidelity and diverse concept coverage. The code is publicly available at //github.com/Sierkinhane/BoxDiff.

contrastive · 視頻描述生成（Video Caption） · Learning · 圖像字幕 · 相關系數 ·

2023 年 7 月 20 日

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Sara Sarto,Manuele Barraco,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

from arxiv, CVPR 2023 (highlight paper)

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: //github.com/aimagelab/pacscore.

Guidance · 穩健性 · MoDELS · state-of-the-art · AIM ·

2023 年 7 月 19 日

Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

Hang Guo,Tao Dai,Guanghao Meng,Shu-Tao Xia

from arxiv, Accepted as IJCAI2023 paper

Scene text image super-resolution (STISR), aiming to improve image quality while boosting downstream scene text recognition accuracy, has recently achieved great success. However, most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process, and neglect the disturbance from the complex background, thus limiting the performance. To address these issues, in this paper, we propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution. To model the location of characters effectively, we propose the location enhancement module to extract character region features based on the attention map sequence. Besides, we propose the multi-modal alignment module to perform bidirectional visual-semantic alignment to generate high-quality prior guidance, which is then incorporated into the super-resolution branch in an adaptive manner using the proposed adaptive fusion module. Experiments on TextZoom and four scene text recognition benchmarks demonstrate the superiority of our method over other state-of-the-art methods. Code is available at //github.com/csguoh/LEMMA.

MoDELS · DDPM · 生成模型 · Processing（編程語言） · Vision ·

2022 年 9 月 6 日

A Survey on Generative Diffusion Model

Hanqun Cao,Cheng Tan,Zhangyang Gao,Guangyong Chen,Pheng-Ann Heng,Stan Z. Li

Deep learning shows great potential in generation tasks thanks to deep latent representation. Generative models are classes of models that can generate observations randomly with respect to certain implied parameters. Recently, the diffusion Model becomes a raising class of generative models by virtue of its power-generating ability. Nowadays, great achievements have been reached. More applications except for computer vision, speech generation, bioinformatics, and natural language processing are to be explored in this field. However, the diffusion model has its natural drawback of a slow generation process, leading to many enhanced works. This survey makes a summary of the field of the diffusion model. We firstly state the main problem with two landmark works - DDPM and DSM. Then, we present a diverse range of advanced techniques to speed up the diffusion models - training schedule, training-free sampling, mixed-modeling, and score & diffusion unification. Regarding existing models, we also provide a benchmark of FID score, IS, and NLL according to specific NFE. Moreover, applications with diffusion models are introduced including computer vision, sequence modeling, audio, and AI for science. Finally, there is a summarization of this field together with limitations & further directions.

CARS · 深度學習 · Pattern Recognition · 圖像檢索 · Vision ·

2021 年 11 月 11 日

Fine-Grained Image Analysis with Deep Learning: A Survey

Xiu-Shen Wei,Yi-Zhe Song,Oisin Mac Aodha,Jianxin Wu,Yuxin Peng,Jinhui Tang,Jian Yang,Serge Belongie

from arxiv, Accepted by IEEE TPAMI

Fine-grained image analysis (FGIA) is a longstanding and fundamental problem in computer vision and pattern recognition, and underpins a diverse set of real-world applications. The task of FGIA targets analyzing visual objects from subordinate categories, e.g., species of birds or models of cars. The small inter-class and large intra-class variation inherent to fine-grained image analysis makes it a challenging problem. Capitalizing on advances in deep learning, in recent years we have witnessed remarkable progress in deep learning powered FGIA. In this paper we present a systematic survey of these advances, where we attempt to re-define and broaden the field of FGIA by consolidating two fundamental fine-grained research areas -- fine-grained image recognition and fine-grained image retrieval. In addition, we also review other key issues of FGIA, such as publicly available benchmark datasets and related domain-specific applications. We conclude by highlighting several research directions and open problems which need further exploration from the community.

圖像字幕 · state-of-the-art · Vision · 可辨認的 · 語言模型化 ·

2021 年 7 月 14 日

From Show to Tell: A Survey on Image Captioning

Matteo Stefanini,Marcella Cornia,Lorenzo Baraldi,Silvia Cascianelli,Giuseppe Fiameni,Rita Cucchiara

Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, in the last few years, a large research effort has been devoted to image captioning, i.e. the task of describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoding step and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, and relationships and the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results obtained, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview and categorization of image captioning approaches, from visual encoding and text generation to training strategies, used datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in image captioning architectures and training strategies. Moreover, many variants of the problem and its open challenges are analyzed and discussed. The final goal of this work is to serve as a tool for understanding the existing state-of-the-art and highlighting the future directions for an area of research where Computer Vision and Natural Language Processing can find an optimal synergy.

目標領域 · 源領域 · Vision · Processing（編程語言） · 圖像處理 ·

2021 年 1 月 21 日

Image-to-Image Translation: Methods and Applications

Yingxue Pang,Jianxin Lin,Tao Qin,Zhibo Chen

from arxiv, 19 pages, 17 figures

Image-to-image translation (I2I) aims to transfer images from a source domain to a target domain while preserving the content representations. I2I has drawn increasing attention and made tremendous progress in recent years because of its wide range of applications in many computer vision and image processing problems, such as image synthesis, segmentation, style transfer, restoration, and pose estimation. In this paper, we provide an overview of the I2I works developed in recent years. We will analyze the key techniques of the existing I2I works and clarify the main progress the community has made. Additionally, we will elaborate on the effect of I2I on the research and industry community and point out remaining challenges in related fields.