2021精品一级毛片一区二区_亚洲天天做日日做天天谢日日欢_久久久久久加精品国产区_欧洲中文字母久久_三级小视频在线观看_不卡一区二区三区视频免费观看_日韩精品无码一区二区久久

Text-to-image diffusion models (SD) exhibit significant advancements while requiring extensive computational resources. Existing acceleration methods usually require extensive training and are not universally applicable. LCM-LoRA, trainable once for diverse models, offers universality but rarely considers ensuring the consistency of generated content before and after acceleration. This paper proposes SpeedUpNet (SUN), an innovative acceleration module, to address the challenges of universality and consistency. Exploiting the role of cross-attention layers in U-Net for SD models, we introduce an adapter specifically designed for these layers, quantifying the offset in image generation caused by negative prompts relative to positive prompts. This learned offset demonstrates stability across a range of models, enhancing SUN's universality. To improve output consistency, we propose a Multi-Step Consistency (MSC) loss, which stabilizes the offset and ensures fidelity in accelerated content. Experiments on SD v1.5 show that SUN leads to an overall speedup of more than 10 times compared to the baseline 25-step DPM-solver++, and offers two extra advantages: (1) training-free integration into various fine-tuned Stable-Diffusion models and (2) state-of-the-art FIDs of the generated data set before and after acceleration guided by random combinations of positive and negative prompts. Code is available: //williechai.github.io/speedup-plugin-for-stable-diffusions.github.io.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · MoDELS · 代碼 · 塊 · Neural Networks ·

2024 年 11 月 8 日

Process-and-Forward: Deep Joint Source-Channel Coding Over Cooperative Relay Networks

Chenghong Bian,Yulin Shao,Haotian Wu,Emre Ozfatura,Deniz Gunduz

from arxiv, Accepted to IEEE JSAC, 2024

We introduce deep joint source-channel coding (DeepJSCC) schemes for image transmission over cooperative relay channels. The relay either amplifies-and-forwards its received signal, called DeepJSCC-AF, or leverages neural networks to extract relevant features from its received signal, called DeepJSCC-PF (Process-and-Forward). We consider both half- and full-duplex relays, and propose a novel transformer-based model at the relay. For a half-duplex relay, it is shown that the proposed scheme learns to generate correlated signals at the relay and source to obtain beamforming gains. In the full-duplex case, we introduce a novel block-based transmission strategy, in which the source transmits in blocks, and the relay updates its knowledge about the input signal after each block and generates its own signal. To enhance practicality, a single transformer-based model is used at the relay at each block, together with an adaptive transmission module, which allows the model to seamlessly adapt to different channel qualities and the transmission powers}. Simulation results demonstrate the superior performance of DeepJSCC-PF compared to the state-of-the-art BPG image compression algorithm operating at the maximum achievable rate of conventional decode-and-forward and compress-and-forward protocols, in both half- and full-duplex relay scenarios over AWGN and Rayleigh fading channels.

MoDELS · 基準 · Performer · 情景 · 稀疏 ·

2024 年 11 月 7 日

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang,Lili Yu,Liang Luo,Srinivasan Iyer,Ning Dong,Chunting Zhou,Gargi Ghosh,Mike Lewis,Wen-tau Yih,Luke Zettlemoyer,Xi Victoria Lin

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

控制器 · MoDELS · 知識 (knowledge) · Processing（編程語言） · 試驗 ·

2024 年 11 月 7 日

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata,Sherwin Bahmani,Ziyi Wu,Yash Kant,Igor Gilitschenski,David B. Lindell

from arxiv, Project page: //kmcode1.github.io/Projects/SG-I2V/

Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while being competitive with supervised models in terms of visual quality and motion fidelity.

多峰值 · MoDELS · 設計 · BASIC · 查準率/準確率 ·

2024 年 11 月 7 日

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe,Hanan Gani,Wenqi Zhu,Jiale Cao,Eric Xing,Fahad Shahbaz Khan,Salman Khan

from arxiv, Technical Report of VideoGLaMM

Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

3D · MoDELS · 控制器 · 泛化理論 · 生成模型 ·

2024 年 11 月 6 日

Gaussian Deja-vu: Creating Controllable 3D Gaussian Head-Avatars with Enhanced Generalization and Personalization Abilities

Peizhi Yan,Rabab Ward,Qiang Tang,Shan Du

from arxiv, 11 pages, Accepted by WACV 2025 in Round 1

Recent advancements in 3D Gaussian Splatting (3DGS) have unlocked significant potential for modeling 3D head avatars, providing greater flexibility than mesh-based methods and more efficient rendering compared to NeRF-based approaches. Despite these advancements, the creation of controllable 3DGS-based head avatars remains time-intensive, often requiring tens of minutes to hours. To expedite this process, we here introduce the "Gaussian Deja-vu" framework, which first obtains a generalized model of the head avatar and then personalizes the result. The generalized model is trained on large 2D (synthetic and real) image datasets. This model provides a well-initialized 3D Gaussian head that is further refined using a monocular video to achieve the personalized head avatar. For personalizing, we propose learnable expression-aware rectification blendmaps to correct the initial 3D Gaussians, ensuring rapid convergence without the reliance on neural networks. Experiments demonstrate that the proposed method meets its objectives. It outperforms state-of-the-art 3D Gaussian head avatars in terms of photorealistic quality as well as reduces training time consumption to at least a quarter of the existing methods, producing the avatar in minutes.

MoDELS · 輸出 · 語言模型化 · Vision · 模式識別 ·

2024 年 11 月 6 日

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Nhi Pham,Michael Schott

from arxiv, Poster at //sites.google.com/berkeley.edu/bb-stat/home

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

MoDELS · 多峰值 · 基準 · Performer · Pair ·

2024 年 11 月 6 日

ReEdit: Multimodal Exemplar-Based Image Editing with Diffusion Models

Ashutosh Srivastava,Tarun Ram Menta,Abhinav Java,Avadhoot Jadhav,Silky Singh,Surgan Jandial,Balaji Krishnamurthy

from arxiv, First three authors contributed equally to this work

Modern Text-to-Image (T2I) Diffusion models have revolutionized image editing by enabling the generation of high-quality photorealistic images. While the de facto method for performing edits with T2I models is through text instructions, this approach non-trivial due to the complex many-to-many mapping between natural language and images. In this work, we address exemplar-based image editing -- the task of transferring an edit from an exemplar pair to a content image(s). We propose ReEdit, a modular and efficient end-to-end framework that captures edits in both text and image modalities while ensuring the fidelity of the edited image. We validate the effectiveness of ReEdit through extensive comparisons with state-of-the-art baselines and sensitivity analyses of key design choices. Our results demonstrate that ReEdit consistently outperforms contemporary approaches both qualitatively and quantitatively. Additionally, ReEdit boasts high practical applicability, as it does not require any task-specific optimization and is four times faster than the next best baseline.

解碼 · Networking · 推斷 · 查準率/準確率 · Performer ·

2024 年 11 月 6 日

HRDecoder: High-Resolution Decoder Network for Fundus Image Lesion Segmentation

Ziyuan Ding,Yixiong Liang,Shichao Kan,Qing Liu

from arxiv, 11 pages, 3 figures, accepted by MICCAI 2024, the revised version

High resolution is crucial for precise segmentation in fundus images, yet handling high-resolution inputs incurs considerable GPU memory costs, with diminishing performance gains as overhead increases. To address this issue while tackling the challenge of segmenting tiny objects, recent studies have explored local-global fusion methods. These methods preserve fine details using local regions and capture long-range context information from downscaled global images. However, the necessity of multiple forward passes inevitably incurs significant computational overhead, adversely affecting inference speed. In this paper, we propose HRDecoder, a simple High-Resolution Decoder network for fundus lesion segmentation. It integrates a high-resolution representation learning module to capture fine-grained local features and a high-resolution fusion module to fuse multi-scale predictions. Our method effectively improves the overall segmentation accuracy of fundus lesions while consuming reasonable memory and computational overhead, and maintaining satisfying inference speed. Experimental results on the IDRID and DDR datasets demonstrate the effectiveness of our method. Code is available at //github.com/CVIU-CSU/HRDecoder.

示例 · 控制器 · MoDELS · Excel · 邊界框 ·

2024 年 11 月 6 日

IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation

Yinwei Wu,Xianpan Zhou,Bing Ma,Xuefeng Su,Kai Ma,Xinchao Wang

While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.

contrastive · 表示 · Performer · Subspace · MoDELS ·

2024 年 11 月 6 日

ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework

Hengyuan Zhang,Chenming Shang,Sizhe Wang,Dongdong Zhang,Renliang Sun,Yiyao Yu,Yujiu Yang,Furu Wei

from arxiv, 23 pages, 11 figures

Although fine-tuning Large Language Models (LLMs) with multilingual data can rapidly enhance the multilingual capabilities of LLMs, they still exhibit a performance gap between the dominant language (e.g., English) and non-dominant ones due to the imbalance of training data across languages. To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one. Specifically, it shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters. The enriched representations are then shifted back into their original language subspace before generation. Moreover, we introduce a subspace distance metric to pinpoint the optimal layer area for shifting representations and employ multilingual contrastive learning to further enhance the alignment of representations within this area. Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages, particularly for low-resource ones. Further analysis offers extra insights to verify the effectiveness of ShifCon and propel future research