在线亚洲91SE亚洲综合在线-A级日本乱理伦片免费入口

We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

相關內容

MoDELS

關注 43

ACM/IEEE第23屆模型驅動工程語言和系統國際會議，是模型驅動軟件和系統工程的首要會議系列，由ACM-SIGSOFT和IEEE-TCSE支持組織。自1998年以來，模型涵蓋了建模的各個方面，從語言和方法到工具和應用程序。模特的參加者來自不同的背景，包括研究人員、學者、工程師和工業專業人士。MODELS 2019是一個論壇，參與者可以圍繞建模和模型驅動的軟件和系統交流前沿研究成果和創新實踐經驗。今年的版本將為建模社區提供進一步推進建模基礎的機會，并在網絡物理系統、嵌入式系統、社會技術系統、云計算、大數據、機器學習、安全、開源等新興領域提出建模的創新應用以及可持續性。官網鏈接： · Prompt · Guidance · 大語言模型 · 相似度 ·

2024 年 4 月 29 日

TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

Junhao Cheng,Baiqiao Yin,Kaixin Cai,Minbin Huang,Hanhui Li,Yuxin He,Xi Lu,Yue Li,Yifei Li,Yuhao Cheng,Yiqiang Yan,Xiaodan Liang

Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.

端到端 · MoDELS · 得分 · state-of-the-art · Transformer ·

2024 年 4 月 29 日

Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription

Antonio Ríos-Vila,Jorge Calvo-Zaragoza,Thierry Paquet

from arxiv, Submitted to the International Conference on Document Analysis and Recognition 2024

State-of-the-art end-to-end Optical Music Recognition (OMR) has, to date, primarily been carried out using monophonic transcription techniques to handle complex score layouts, such as polyphony, often by resorting to simplifications or specific adaptations. Despite their efficacy, these approaches imply challenges related to scalability and limitations. This paper presents the Sheet Music Transformer, the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies. Our model employs a Transformer-based image-to-sequence framework that predicts score transcriptions in a standard digital music encoding format from input images. Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively. The experimental outcomes not only indicate the competence of the model, but also show that it is better than the state-of-the-art methods, thus contributing to advancements in end-to-end OMR transcription.

Performer · 3D · Projection · 變換 · 稀疏 ·

2024 年 4 月 29 日

InverseMatrixVT3D: An Efficient Projection Matrix-Based Approach for 3D Occupancy Prediction

Zhenxing Ming,Julie Stephany Berrio,Mao Shan,Stewart Worrall

This paper introduces InverseMatrixVT3D, an efficient method for transforming multi-view image features into 3D feature volumes for 3D semantic occupancy prediction. Existing methods for constructing 3D volumes often rely on depth estimation, device-specific operators, or transformer queries, which hinders the widespread adoption of 3D occupancy models. In contrast, our approach leverages two projection matrices to store the static mapping relationships and matrix multiplications to efficiently generate global Bird's Eye View (BEV) features and local 3D feature volumes. Specifically, we achieve this by performing matrix multiplications between multi-view image feature maps and two sparse projection matrices. We introduce a sparse matrix handling technique for the projection matrices to optimize GPU memory usage. Moreover, a global-local attention fusion module is proposed to integrate the global BEV features with the local 3D feature volumes to obtain the final 3D volume. We also employ a multi-scale supervision mechanism to enhance performance further. Extensive experiments performed on the nuScenes and SemanticKITTI datasets reveal that our approach not only stands out for its simplicity and effectiveness but also achieves the top performance in detecting vulnerable road users (VRU), crucial for autonomous driving and road safety. The code has been made available at: //github.com/DanielMing123/InverseMatrixVT3D

contrastive · MoDELS · 未標記 · 掩碼自編碼MAE · 縮放 ·

2024 年 4 月 29 日

Cacophony: An Improved Contrastive Audio-Text Model

Ge Zhu,Jordan Darefsky,Zhiyao Duan

from arxiv, Work in Progress

Despite recent advancements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then, initializing our audio encoder from the MAE model, train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.

Performer · 多峰值 · 控制器 · 潛在 · 大語言模型 ·

2024 年 4 月 26 日

InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

Shufan Li,Harkanwar Singh,Aditya Grover

from arxiv, 29 pages, 14 figures

The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at //github.com/jacklishufan/InstructAny2Pix.git

多峰值 · 語言模型化 · tuning · MoDELS · 大語言模型 ·

2024 年 4 月 26 日

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Xichen Pan,Li Dong,Shaohan Huang,Zhiliang Peng,Wenhu Chen,Furu Wei

from arxiv, Code: //aka.ms/Kosmos-G Project Page: //xichenpan.github.io/kosmosg

Recent advancements in subject-driven image generation have made significant strides. However, current methods still fall short in diverse application scenarios, as they require test-time tuning and cannot accept interleaved multi-image and text input. These limitations keep them far from the ultimate goal of "image as a foreign language in image generation." This paper presents Kosmos-G, a model that leverages the advanced multimodal perception capabilities of Multimodal Large Language Models (MLLMs) to tackle the aforementioned challenge. Our approach aligns the output space of MLLM with CLIP using the textual modality as an anchor and performs compositional instruction tuning on curated data. Kosmos-G demonstrates an impressive capability of zero-shot subject-driven generation with interleaved multi-image and text input. Notably, the score distillation instruction tuning requires no modifications to the image decoder. This allows for a seamless substitution of CLIP and effortless integration with a myriad of U-Net techniques ranging from fine-grained controls to personalized image decoder variants. We posit Kosmos-G as an initial attempt towards the goal of "image as a foreign language in image generation." The code can be found at //aka.ms/Kosmos-G

生成式人工智能 · 約束 · TOOLS · 設計 · Microsoft Surface ·

2024 年 4 月 25 日

Generative AI in Color-Changing Systems: Re-Programmable 3D Object Textures with Material and Design Constraints

Yunyi Zhu,Faraz Faruqi,Stefanie Mueller

Advances in Generative AI tools have allowed designers to manipulate existing 3D models using text or image-based prompts, enabling creators to explore different design goals. Photochromic color-changing systems, on the other hand, allow for the reprogramming of surface texture of 3D models, enabling easy customization of physical objects and opening up the possibility of using object surfaces for data display. However, existing photochromic systems require the user to manually design the desired texture, inspect the simulation of the pattern on the object, and verify the efficacy of the generated pattern. These manual design, inspection, and verification steps prevent the user from efficiently exploring the design space of possible patterns. Thus, by designing an automated workflow desired for an end-to-end texture application process, we can allow rapid iteration on different practicable patterns. In this workshop paper, we discuss the possibilities of extending generative AI systems, with material and design constraints for reprogrammable surfaces with photochromic materials. By constraining generative AI systems to colors and materials possible to be physically realized with photochromic dyes, we can create tools that would allow users to explore different viable patterns, with text and image-based prompts. We identify two focus areas in this topic: photochromic material constraints and design constraints for data-encoded textures. We highlight the current limitations of using generative AI tools to create viable textures using photochromic material. Finally, we present possible approaches to augment generative AI methods to take into account the photochromic material constraints, allowing for the creation of viable photochromic textures rapidly and easily.

可約的 · Adobe Flash · MoDELS · Learning · 線性模型 ·

2024 年 4 月 25 日

LearnedFTL: A Learning-Based Page-Level FTL for Reducing Double Reads in Flash-Based SSDs

Shengzhe Wang,Zihang Lin,Suzhen Wu,Hong Jiang,Jie Zhang,Bo Mao

from arxiv, Published in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA'24)

We present LearnedFTL, a new on-demand page-level flash translation layer (FTL) design, which employs learned indexes to improve the address translation efficiency of flash-based SSDs. The first of its kind, it reduces the number of double reads induced by address translation in random read accesses. LearnedFTL proposes three key techniques: an in-place-update linear model to build learned indexes efficiently, a virtual PPN representation to obtain contiguous PPNs for sorted LPNs, and a group-based allocation and model training via GC/rewrite strategy to reduce the training overhead. By tightly integrating the aforementioned key techniques, LearnedFTL considerably speeds up address translation while reducing the number of flash read accesses caused by the address translation. Our extensive experiments on a FEMU-based prototype show that LearnedFTL can reduce up to 55.5\% address translation-induced double reads. As a result, LearnedFTL reduces the P99 tail latency by 2.9$\times$ $\sim$ 12.2$\times$ with an average of 5.5$\times$ and 8.2$\times$ compared to the state-of-the-art TPFTL and LeaFTL schemes, respectively.

Color · 多峰值 · 飽和 · Guidance · 可理解性 ·

2024 年 4 月 25 日

Multimodal Semantic-Aware Automatic Colorization with Diffusion Prior

Han Wang,Xinning Chai,Yiwen Wang,Yuhong Zhang,Rong Xie,Li Song

Colorizing grayscale images offers an engaging visual experience. Existing automatic colorization methods often fail to generate satisfactory results due to incorrect semantic colors and unsaturated colors. In this work, we propose an automatic colorization pipeline to overcome these challenges. We leverage the extraordinary generative ability of the diffusion prior to synthesize color with plausible semantics. To overcome the artifacts introduced by the diffusion prior, we apply the luminance conditional guidance. Moreover, we adopt multimodal high-level semantic priors to help the model understand the image content and deliver saturated colors. Besides, a luminance-aware decoder is designed to restore details and enhance overall visual quality. The proposed pipeline synthesizes saturated colors while maintaining plausible semantics. Experiments indicate that our proposed method considers both diversity and fidelity, surpassing previous methods in terms of perceptual realism and gain most human preference.

原點 · 可約的 · 模型評估 · 計算機科學 · 機器人 ·

2024 年 4 月 24 日

Visualizing High-Dimensional Configuration Spaces: A Comprehensive Analytical Approach

Jorge Ocampo Jimenez,Wael Suleiman

from arxiv, 8 pages, 11 figures

The representation of a Configuration Space C plays a vital role in accelerating the finding of a collision-free path for sampling-based motion planners where the majority of computation time is spent in collision checking of states. Traditionally, planners evaluate C's representations through limited evaluations of collision-free paths using the collision checker or by reducing the dimensionality of C for visualization. However, a collision checker may indicate high accuracy even when only a subset of the original C is represented; limiting the motion planner's ability to find paths comparable to those in the original C. Additionally, dealing with high-dimensional Cs is challenging, as qualitative evaluations become increasingly difficult in dimensions higher than three, where reduced-dimensional C evaluation may decrease accuracy in cluttered environments. In this paper, we present a novel approach for visualizing representations of high-dimensional Cs of manipulator robots in a 2D format. We provide a new tool for qualitative evaluation of high-dimensional Cs approximations without reducing the original dimension. This enhances our ability to compare the accuracy and coverage of two different high-dimensional Cs. Leveraging the kinematic chain of manipulator robots and human color perception, we show the efficacy of our method using a 7-degree-of-freedom CS of a manipulator robot. This visualization offers qualitative insights into the joint boundaries of the robot and the coverage of collision state combinations without reducing the dimensionality of the original data. To support our claim, we conduct a numerical evaluation of the proposed visualization.