91婷婷国产精选国产色,在线精品黑人粗大视频,国产无遮挡又黄又爽又不要VIP,成年无码专区在线蜜芽

from arxiv, The project (including the collected dataset VidProM and related code) is publicly available at //vidprom.github.io under the CC-BY-NC 4.0 License

The arrival of Sora marks a new era for text-to-video diffusion models, bringing significant advancements in video generation and potential applications. However, Sora, along with other text-to-video diffusion models, is highly reliant on prompts, and there is no publicly available dataset that features a study of text-to-video prompts. In this paper, we introduce VidProM, the first large-scale dataset comprising 1.67 Million unique text-to-Video Prompts from real users. Additionally, this dataset includes 6.69 million videos generated by four state-of-the-art diffusion models, alongside some related data. We initially discuss the curation of this large-scale dataset, a process that is both time-consuming and costly. Subsequently, we underscore the need for a new prompt dataset specifically designed for text-to-video generation by illustrating how VidProM differs from DiffusionDB, a large-scale prompt-gallery dataset for image generation. Our extensive and diverse dataset also opens up many exciting new research areas. For instance, we suggest exploring text-to-video prompt engineering, efficient video generation, and video copy detection for diffusion models to develop better, more efficient, and safer models. The project (including the collected dataset VidProM and related code) is publicly available at //vidprom.github.io under the CC-BY-NC 4.0 License.

相關內容

數據集

關注 88

數據集，又稱為資料集、數據集合或資料集合，是一種由數據所組成的集合。
Data set（或dataset）是一個數據的集合，通常以表格形式出現。每一列代表一個特定變量。每一行都對應于某一成員的數據集的問題。它列出的價值觀為每一個變量，如身高和體重的一個物體或價值的隨機數。每個數值被稱為數據資料。對應于行數，該數據集的數據可能包括一個或多個成員。

MoDELS · Performer · 可理解性 · 模型評估 · 樣本 ·

2024 年 6 月 27 日

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

Jr-Jen Chen,Yu-Chien Liao,Hsi-Che Lin,Yu-Chu Yu,Yen-Chun Chen,Yu-Chiang Frank Wang

We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.

SLIM · Performer · 多峰值 · 可理解性 · 詞元分析器 ·

2024 年 6 月 27 日

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Jiaxin Zhang,Wentao Yang,Songxuan Lai,Zecheng Xie,Lianwen Jin

Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. DocKylin utilizes an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, DocKylin incorporates a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to create a compressed, adaptive visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks. Notably, both the proposed APS and DTS are parameter-free, facilitating easy integration into existing MLLMs, and our experiments indicate their potential for broader applications.

MoDELS · 多峰值 · 監督 · 得分 · 生成模型 ·

2024 年 6 月 27 日

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Zhiyu Tan,Xiaomeng Yang,Luozheng Qin,Mengping Yang,Cheng Zhang,Hao Li

from arxiv, Github Repository: //github.com/SAIS-FUXI/EvalAlign

The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.

講稿 · INTERACT · TOOLS · 有向 · Attention ·

2024 年 6 月 25 日

VisConductor: Affect-Varying Widgets for Animated Data Storytelling in Gesture-Aware Augmented Video Presentation

Temiloluwa Femi-Gege,Matthew Brehmer,Jian Zhao

from arxiv, To appear in ACM ISS'24

Augmented video presentation tools provide a natural way for presenters to interact with their content, resulting in engaging experiences for remote audiences, such as when a presenter uses hand gestures to manipulate and direct attention to visual aids overlaid on their webcam feed. However, authoring and customizing these presentations can be challenging, particularly when presenting dynamic data visualization (i.e., animated charts). To this end, we introduce VisConductor, an authoring and presentation tool that equips presenters with the ability to configure gestures that control affect-varying visualization animation, foreshadow visualization transitions, direct attention to notable data points, and animate the disclosure of annotations. These gestures are integrated into configurable widgets, allowing presenters to trigger content transformations by executing gestures within widget boundaries, with feedback visible only to them. Altogether, our palette of widgets provides a level of flexibility appropriate for improvisational presentations and ad-hoc content transformations, such as when responding to audience engagement. To evaluate VisConductor, we conducted two studies focusing on presenters (N = 11) and audience members (N = 11). Our findings indicate that our approach taken with VisConductor can facilitate interactive and engaging remote presentations with dynamic visual aids. Reflecting on our findings, we also offer insights to inform the future of augmented video presentation tools.

MoDELS · 變換 · 語言模型化 · 語音合成 · 噪聲 ·

2024 年 6 月 25 日

SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

Xiaofei Wang,Manthan Thakker,Zhuo Chen,Naoyuki Kanda,Sefik Emre Eskimez,Sanyuan Chen,Min Tang,Shujie Liu,Jinyu Li,Takuya Yoshioka

from arxiv, To appear in TASLP. See //aka.ms/speechx for demo samples

Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See //aka.ms/speechx for demo samples.

可辨認的 · 可約的 · 值域 · 多樣性 · 講稿 ·

2024 年 6 月 25 日

Soundify: Matching Sound Effects to Video

David Chuan-En Lin,Anastasis Germanidis,Cristóbal Valenzuela,Yining Shi,Nikolas Martelaro

from arxiv, //soundify.cc

In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in helping video editors match sounds to video with lighter workload, reduced task completion time, and improved usability.

3D · Extensibility · Performance · 分離的 · HTTPS ·

2024 年 6 月 24 日

ClotheDreamer: Text-Guided Garment Generation with 3D Gaussians

Yufei Liu,Junshu Tang,Chu Zheng,Shijie Zhang,Jinkun Hao,Junwei Zhu,Dongjin Huang

from arxiv, Project Page: //ggxxii.github.io/clothedreamer

High-fidelity 3D garment synthesis from text is desirable yet challenging for digital avatar creation. Recent diffusion-based approaches via Score Distillation Sampling (SDS) have enabled new possibilities but either intricately couple with human body or struggle to reuse. We introduce ClotheDreamer, a 3D Gaussian-based method for generating wearable, production-ready 3D garment assets from text prompts. We propose a novel representation Disentangled Clothe Gaussian Splatting (DCGS) to enable separate optimization. DCGS represents clothed avatar as one Gaussian model but freezes body Gaussian splats. To enhance quality and completeness, we incorporate bidirectional SDS to supervise clothed avatar and garment RGBD renderings respectively with pose conditions and propose a new pruning strategy for loose clothing. Our approach can also support custom clothing templates as input. Benefiting from our design, the synthetic 3D garment can be easily applied to virtual try-on and support physically accurate animation. Extensive experiments showcase our method's superior and competitive performance. Our project page is at //ggxxii.github.io/clothedreamer.

相關系數 · MoDELS · 得分 · 數據集 · Learning ·

2024 年 6 月 24 日

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Xuan He,Dongfu Jiang,Ge Zhang,Max Ku,Achint Soni,Sherman Siu,Haonan Chen,Abhranil Chandra,Ziyan Jiang,Aaran Arulraj,Kai Wang,Quy Duc Do,Yuansheng Ni,Bohan Lyu,Yaswanth Narsupalli,Rongqi Fan,Zhiheng Lyu,Yuchen Lin,Wenhu Chen

The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.

state-of-the-art · MoDELS · 多樣性 · Performer · Extensibility ·

2024 年 6 月 24 日

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Yirui Chen,Xudong Huang,Quan Zhang,Wei Li,Mingjian Zhu,Qiangyu Yan,Simiao Li,Hanting Chen,Hailin Hu,Jie Yang,Wei Liu,Jie Hu

from arxiv, Code page: //github.com/chenyirui/GIM

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, We propose the GIM dataset, which has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. 2) Rich Image Content, encompassing a broad range of image classes 3) Diverse Generative Manipulation, manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the generalization capability and comprehensive performance of baseline methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses previous state-of-the-art works significantly on two different benchmarks.

MoDELS · 語言模型化 · 大語言模型 · 多峰值 · 穩健性 ·

2024 年 6 月 23 日

AudioBench: A Universal Benchmark for Audio Large Language Models

Bin Wang,Xunlong Zou,Geyu Lin,Shuo Sun,Zhuohan Liu,Wenyu Zhang,Zhengyuan Liu,AiTi Aw,Nancy F. Chen

from arxiv, 20 pages; Preprint; Code: //github.com/AudioLLMs/AudioBench

We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments.