99热日韩这里只有国产中文精品_高中小鲜肉自慰GAY免费_美女视频网站黄还免费_青青青免费国产在线91_1024手机看片你懂得_国产欧美日韩一区二区免费_国产精品久久久久无码AV网站

from arxiv, Project page with all the artifacts - //ruchitrawal.github.io/cinepile/. Updated version with adversarial refinement pipeline and more model evaluations

Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.

相關內容

數據集

關注 88

數據集，又稱為資料集、數據集合或資料集合，是一種由數據所組成的集合。
Data set（或dataset）是一個數據的集合，通常以表格形式出現。每一列代表一個特定變量。每一行都對應于某一成員的數據集的問題。它列出的價值觀為每一個變量，如身高和體重的一個物體或價值的隨機數。每個數值被稱為數據資料。對應于行數，該數據集的數據可能包括一個或多個成員。

MoDELS · 控制器 · 相似度 · Pair · contrastive ·

2024 年 12 月 10 日

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Zixuan Ye,Huijuan Huang,Xintao Wang,Pengfei Wan,Di Zhang,Wenhan Luo

from arxiv, Project webpage available at //zixuan-ye.github.io/stylemaster

Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at //zixuan-ye.github.io/stylemaster

MoDELS · SimPLe · 設計 · 穩健性 · 數據監管 ·

2024 年 12 月 10 日

STIV: Scalable Text and Image Conditioned Video Generation

Zongyu Lin,Wei Liu,Chen Chen,Jiasen Lu,Wenze Hu,Tsu-Jui Fu,Jesse Allardice,Zhengfeng Lai,Liangchen Song,Bowen Zhang,Cha Chen,Yiran Fei,Yifan Jiang,Lezhi Li,Yizhou Sun,Kai-Wei Chang,Yinfei Yang

The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.

MoDELS · 監督 · Backbone · 分離的 · Extensibility ·

2024 年 12 月 10 日

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

Hyeonho Jeong,Chun-Hao Paul Huang,Jong Chul Ye,Niloy Mitra,Duygu Ceylan

from arxiv, Project page: hyeonho99.github.io/track4gen

While recent foundational video generators produce visually rich output, they still struggle with appearance drift, where objects gradually degrade or change inconsistently across frames, breaking visual coherence. We hypothesize that this is because there is no explicit supervision in terms of spatial tracking at the feature level. We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. Track4Gen merges the video generation and point tracking tasks into a single network by making minimal changes to existing video generation architectures. Using Stable Video Diffusion as a backbone, Track4Gen demonstrates that it is possible to unify video generation and point tracking, which are typically handled as separate tasks. Our extensive evaluations show that Track4Gen effectively reduces appearance drift, resulting in temporally stable and visually coherent video generation. Project page: hyeonho99.github.io/track4gen

MoDELS · 多峰值 · 數據集 · LLaVA · 可理解性 ·

2024 年 12 月 10 日

Maya: An Instruction Finetuned Multilingual Multimodal Model

Nahid Alam,Karthik Reddy Kanjula,Surya Guthikonda,Timothy Chung,Bala Krishna S Vegesna,Abhipsha Das,Anthony Susevski,Ryan Sze-Yin Chan,S M Iftekhar Uddin,Shayekh Bin Islam,Roshan Santhosh,Snegha A,Drishti Sharma,Chen Liu,Isha Chaturvedi,Genta Indra Winata,Ashvanth. S,Snehanshu Mukherjee,Alham Fikri Aji

The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at //github.com/nahidalam/maya.

示例 · MoDELS · 大語言模型 · 特征提取器 · 語言模型化 ·

2024 年 12 月 9 日

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

Anwesa Choudhuri,Girish Chowdhary,Alexander G. Schwing

from arxiv, Project page: //anwesachoudhuri.github.io/OpenWorldVISCap/

We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.

流 · MoDELS · 閾值 · Performer · INTERACT ·

2024 年 12 月 9 日

MVAD: A Multiple Visual Artifact Detector for Video Streaming

Chen Feng,Duolikun Danier,Fan Zhang,Alex Mackin,Andrew Collins,David Bull

from arxiv, Paper has been accpeted by WACV 2025

Visual artifacts are often introduced into streamed video content, due to prevailing conditions during content production and delivery. Since these can degrade the quality of the user's experience, it is important to automatically and accurately detect them in order to enable effective quality measurement and enhancement. Existing detection methods often focus on a single type of artifact and/or determine the presence of an artifact through thresholding objective quality indices. Such approaches have been reported to offer inconsistent prediction performance and are also impractical for real-world applications where multiple artifacts co-exist and interact. In this paper, we propose a Multiple Visual Artifact Detector, MVAD, for video streaming which, for the first time, is able to detect multiple artifacts using a single framework that is not reliant on video quality assessment models. Our approach employs a new Artifact-aware Dynamic Feature Extractor (ADFE) to obtain artifact-relevant spatial features within each frame for multiple artifact types. The extracted features are further processed by a Recurrent Memory Vision Transformer (RMViT) module, which captures both short-term and long-term temporal information within the input video. The proposed network architecture is optimized in an end-to-end manner based on a new, large and diverse training database that is generated by simulating the video streaming pipeline and based on Adversarial Data Augmentation. This model has been evaluated on two video artifact databases, Maxwell and BVI-Artifact, and achieves consistent and improved prediction results for ten target visual artifacts when compared to seven existing single and multiple artifact detectors. The source code and training database will be available at //chenfeng-bristol.github.io/MVAD/.

優化器 · 蒸餾 · 可行 · Guidance · 可約的 ·

2024 年 12 月 9 日

MoViE: Mobile Diffusion for Video Editing

Adil Karjauv,Noor Fathima,Ioannis Lelekas,Fatih Porikli,Amir Ghodrati,Amirhossein Habibian

from arxiv, 8 pages

Recent progress in diffusion-based video editing has shown remarkable potential for practical applications. However, these methods remain prohibitively expensive and challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at 12 frames per second on mobile devices, while maintaining high quality. Our results are available at //qualcomm-ai-research.github.io/mobile-video-editing/

MoDELS · 多樣性 · Notability · 控制器 · contrastive ·

2024 年 12 月 6 日

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Tuna Han Salih Meral,Hidir Yesiltepe,Connor Dunlop,Pinar Yanardag

from arxiv, Project Page: //motionflow-diffusion.github.io

Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.

MoDELS · 生成模型 · Performer · 代碼 · 設計 ·

2024 年 12 月 6 日

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong,Qi Tian,Zijian Zhang,Rox Min,Zuozhuo Dai,Jin Zhou,Jiangfeng Xiong,Xin Li,Bo Wu,Jianwei Zhang,Kathrina Wu,Qin Lin,Junkun Yuan,Yanxin Long,Aladdin Wang,Andong Wang,Changlin Li,Duojun Huang,Fang Yang,Hao Tan,Hongmei Wang,Jacob Song,Jiawang Bai,Jianbing Wu,Jinbao Xue,Joey Wang,Kai Wang,Mengyang Liu,Pengyu Li,Shuai Li,Weiyan Wang,Wenqing Yu,Xinchi Deng,Yang Li,Yi Chen,Yutao Cui,Yuanbo Peng,Zhentao Yu,Zhiyu He,Zhiyong Xu,Zixiang Zhou,Zunnan Xu,Yangyu Tao,Qinglin Lu,Songtao Liu,Daquan Zhou,Hongfa Wang,Yong Yang,Di Wang,Yuhong Liu,Jie Jiang,Caesar Zhong

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at //github.com/Tencent/HunyuanVideo.

Networking · CNN · MoDELS · Performer · 數學 ·

2023 年 3 月 5 日

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Xuan Shen,Yaohua Wang,Ming Lin,Yilun Huang,Hao Tang,Xiuyu Sun,Yanzhi Wang

from arxiv, Accepted by CVPR 2023

The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in various vision tasks, overshadowing the conventional CNN-based models. This ignites a few recent striking-back research in the CNN world showing that pure CNN models can achieve as good performance as ViT models when carefully tuned. While encouraging, designing such high-performance CNN models is challenging, requiring non-trivial prior knowledge of network design. To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way. In DeepMAD, a CNN network is modeled as an information processing system whose expressiveness and effectiveness can be analytically formulated by their structural parameters. Then a constrained mathematical programming (MP) problem is proposed to optimize these structural parameters. The MP problem can be easily solved by off-the-shelf MP solvers on CPUs with a small memory footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or training data is required during network design. The superiority of DeepMAD is validated on multiple large-scale computer vision benchmark datasets. Notably on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves 0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and 0.8% and 0.9% higher on Small level.