97SE亚洲国产综合在线_18GAY国产小鲜肉可播放_理伦中文字幕去网站上观察_97福利视频精品第一导航_国产原创对白在线观看精品_国产99视频免费精品视看6_欧美一区二区三区综合在线

This paper presents a video inversion approach for zero-shot video editing, which models the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods. Project page: //stem-inv.github.io/page/.

相關內容

表示

關注 1

state-of-the-art · 貪心 · 貪心逐層預訓練 · Face Morphing · MoDELS ·

2024 年 7 月 2 日

Greedy-DiM: Greedy Algorithms for Unreasonably Effective Face Morphs

Zander W. Blasingame,Chen Liu

from arxiv, Accepted as a conference paper at IJCB 2024

Morphing attacks are an emerging threat to state-of-the-art Face Recognition (FR) systems, which aim to create a single image that contains the biometric information of multiple identities. Diffusion Morphs (DiM) are a recently proposed morphing attack that has achieved state-of-the-art performance for representation-based morphing attacks. However, none of the existing research on DiMs have leveraged the iterative nature of DiMs and left the DiM model as a black box, treating it no differently than one would a Generative Adversarial Network (GAN) or Varational AutoEncoder (VAE). We propose a greedy strategy on the iterative sampling process of DiM models which searches for an optimal step guided by an identity-based heuristic function. We compare our proposed algorithm against ten other state-of-the-art morphing algorithms using the open-source SYN-MAD 2022 competition dataset. We find that our proposed algorithm is unreasonably effective, fooling all of the tested FR systems with an MMPMR of 100%, outperforming all other morphing algorithms compared.

MoDELS · Performer · 可理解性 · 模型評估 · 樣本 ·

2024 年 7 月 2 日

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

Jr-Jen Chen,Yu-Chien Liao,Hsi-Che Lin,Yu-Chu Yu,Yen-Chun Chen,Yu-Chiang Frank Wang

from arxiv, Project page: //rextime.github.io/

We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.

機器人 · Performer · 回合 · 變換 · 控制器 ·

2024 年 7 月 2 日

Real Time Collision Avoidance with GPU-Computed Distance Maps

Wendwosen Bellete Bedada,Gianluca Palli

from arxiv, 11 pages, 13 figures

This paper presents reactive obstacle and self-collision avoidance of redundant robotic manipulators within real time kinematic feedback control using GPU-computed distance transform. The proposed framework utilizes discretized representation of the robot and the environment to calculate 3D Euclidean distance transform for task-priority based kinematic control. The environment scene is represented using a 3D GPU-voxel map created and updated from a live pointcloud data while the robotic link model is converted into a voxels offline and inserted into the voxel map according to the joint state of the robot to form the self-obstacle map. The proposed approach is evaluated using the Tiago robot, showing that all obstacle and self collision avoidance constraints are respected within one framework even with fast moving obstacles while the robot performs end-effector pose tracking in real time. A comparison of related works that depend on GPU and CPU computed distance fields is also presented to highlight the time performance as well as accuracy of the GPU distance field.

MoDELS · 蒸餾 · 標注 · 知識 (knowledge) · Performer ·

2024 年 7 月 1 日

uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation via Large-Scale Pseudo Labelling

Abdul Waheed,Karima Kadaoui,Muhammad Abdul-Mageed

from arxiv, Work in progress

Recent work on distilling Whisper's knowledge into small models using pseudo-labels shows promising performance while reducing the size by up to 50\%. This results in small, efficient, and dedicated models. However, a critical step of distillation from pseudo-labels involves filtering high-quality predictions and using only those during training. This step requires ground truth to compare and filter bad examples making the whole process supervised. In addition to that, the distillation process requires a large amount of data thereby limiting the ability to distil models in low-resource settings. To address this challenge, we propose an unsupervised or label-free framework for distillation, thus eliminating the requirement for labeled data altogether. Through experimentation, we show that our best distilled models outperform the teacher model by 5-7 points in terms of WER. Additionally, our models are on par with or better than similar supervised data filtering setup. When we scale the data, our models significantly outperform all zero-shot and supervised models. In this work, we demonstrate that it's possible to distill large Whisper models into relatively small models without using any labeled data. As a result, our distilled models are 25-50\% more compute and memory efficient while maintaining performance equal to or better than the teacher model.

可理解性 · 流 · Performer · MoDELS · 在線 ·

2024 年 6 月 30 日

Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Haoji Zhang,Yiqin Wang,Yansong Tang,Yong Liu,Jiashi Feng,Jifeng Dai,Xiaojie Jin

Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves significant reductions in inference latency and VRAM consumption, which is intimately related to performing understanding of online streaming video. In addition, given that existing video understanding benchmarks predominantly concentrate on offline scenario, we propose VStream-QA, a novel question answering benchmark specifically designed for online video streaming understanding. Comparisons with popular existing methods on the proposed benchmark demonstrate the superiority of our method for such challenging setting. To verify the generalizability of our approach, we further evaluate it on existing video understanding benchmarks and achieves state-of-the-art performance in offline scenarios as well. All code, models, and datasets are available at the //invinciblewyq.github.io/vstream-page/

DeepFakes · 卷積 · Performer · 圖 · 穩健性 ·

2024 年 6 月 28 日

GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection

Chih-Chung Hsu,Shao-Ning Chen,Mei-Hsuan Wu,Yi-Fang Wang,Chia-Ming Lee,Yi-Shiuan Chou

from arxiv, Submitted to TPAMI 2024

As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at //github.com/ming053l/GRACE.

有偏 · 優化器 · MoDELS · 樣本 · Facebook AI Research ·

2024 年 6 月 28 日

PopAlign: Population-Level Alignment for Fair Text-to-Image Generation

Shufan Li,Harkanwar Singh,Aditya Grover

from arxiv, 18 pages, 10 figures

Text-to-image (T2I) models achieve high-fidelity generation through extensive training on large datasets. However, these models may unintentionally pick up undesirable biases of their training data, such as over-representation of particular identities in gender or ethnicity neutral prompts. Existing alignment methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) fail to address this problem effectively because they operate on pairwise preferences consisting of individual samples, while the aforementioned biases can only be measured at a population level. For example, a single sample for the prompt "doctor" could be male or female, but a model generating predominantly male doctors even with repeated sampling reflects a gender bias. To address this limitation, we introduce PopAlign, a novel approach for population-level preference optimization, while standard optimization would prefer entire sets of samples over others. We further derive a stochastic lower bound that directly optimizes for individual samples from preferred populations over others for scalable training. Using human evaluation and standard image quality and bias metrics, we show that PopAlign significantly mitigates the bias of pretrained T2I models while largely preserving the generation quality. Code is available at //github.com/jacklishufan/PopAlignSDXL.

Agent · Processing（編程語言） · 多樣性 · INTERACT · Integration ·

2024 年 6 月 28 日

Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation

Zhe Hu,Hou Pong Chan,Jing Li,Yu Yin

Writing persuasive arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate surface tokens autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework can generate more diverse and persuasive arguments through both automatic and human evaluations.

優化器 · 設計 · 穩健性 · 訓練集 · 情景 ·

2024 年 6 月 28 日

Robust Model-Based Optimization for Challenging Fitness Landscapes

Saba Ghaffari,Ehsan Saleh,Alexander G. Schwing,Yu-Xiong Wang,Martin D. Burke,Saurabh Sinha

Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at //github.com/sabagh1994/PGVAE.

Networking · MoDELS · Stable Diffusion · Processing（編程語言） · 變換 ·

2024 年 6 月 28 日

Network Bending of Diffusion Models for Audio-Visual Generation

Luke Dzwonczyk,Carmine Emanuele Cella,David Ban

from arxiv, 8 pages, 5 figures, to be published in the proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), for additional image and video examples see //dzluke.github.io/DAFX2024/

In this paper we present the first steps towards the creation of a tool which enables artists to create music visualizations using pre-trained, generative, machine learning models. First, we investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. We identify a number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools. We find that this process allows for continuous, fine-grain control of image generation which can be helpful for creative applications. Next, we generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, we comment on certain transforms which radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms.