国产日黄色大片一区二区-高中小鲜肉自慰GAY免费

As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at //github.com/ming053l/GRACE.

相關內容

DeepFakes

關注 4

相關系數 · SLAM · 樣本 · 相似度 · MoDELS ·

2024 年 10 月 30 日

LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM

Yucheng Huang,Luping Ji,Hudong Liu,Mao Ye

Deep visual Simultaneous Localization and Mapping (SLAM) techniques, e.g., DROID, have made significant advancements by leveraging deep visual odometry on dense flow fields. In general, they heavily rely on global visual similarity matching. However, the ambiguous similarity interference in uncertain regions could often lead to excessive noise in correspondences, ultimately misleading SLAM in geometric modeling. To address this issue, we propose a Learnable Gaussian Uncertainty (LGU) matching. It mainly focuses on precise correspondence construction. In our scheme, a learnable 2D Gaussian uncertainty model is designed to associate matching-frame pairs. It could generate input-dependent Gaussian distributions for each correspondence map. Additionally, a multi-scale deformable correlation sampling strategy is devised to adaptively fine-tune the sampling of each direction by a priori look-up ranges, enabling reliable correlation construction. Furthermore, a KAN-bias GRU component is adopted to improve a temporal iterative enhancement for accomplishing sophisticated spatio-temporal modeling with limited parameters. The extensive experiments on real-world and synthetic datasets are conducted to validate the effectiveness and superiority of our method.

數據集 · Pair · 數據預處理 · state-of-the-art · Performer ·

2024 年 10 月 29 日

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

Yang Zhou,Tan Li Hui Faith,Yanyu Xu,Sicong Leng,Xinxing Xu,Yong Liu,Rick Siow Mong Goh

from arxiv, Accepted to NeurIPS24 Datasets and Benchmarks Track

Medical Vision-Language Pretraining (MedVLP) shows promise in learning generalizable and transferable visual representations from paired and unpaired medical images and reports. MedVLP can provide useful features to downstream tasks and facilitate adapting task-specific models to new setups using fewer examples. However, existing MedVLP methods often differ in terms of datasets, preprocessing, and finetuning implementations. This pose great challenges in evaluating how well a MedVLP method generalizes to various clinically-relevant tasks due to the lack of unified, standardized, and comprehensive benchmark. To fill this gap, we propose BenchX, a unified benchmark framework that enables head-to-head comparison and systematical analysis between MedVLP methods using public chest X-ray datasets. Specifically, BenchX is composed of three components: 1) Comprehensive datasets covering nine datasets and four medical tasks; 2) Benchmark suites to standardize data preprocessing, train-test splits, and parameter selection; 3) Unified finetuning protocols that accommodate heterogeneous MedVLP methods for consistent task adaptation in classification, segmentation, and report generation, respectively. Utilizing BenchX, we establish baselines for nine state-of-the-art MedVLP methods and found that the performance of some early MedVLP methods can be enhanced to surpass more recent ones, prompting a revisiting of the developments and conclusions from prior works in MedVLP. Our code are available at //github.com/yangzhou12/BenchX.

詞元分析器 · 離散化 · MoDELS · Learning · 增強現實（AR） ·

2024 年 10 月 28 日

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

Hanyu Wang,Saksham Suri,Yixuan Ren,Hao Chen,Abhinav Shrivastava

from arxiv, Project page: //hywang66.github.io/larp/

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

Integration · 小樣本學習 · GROUP · Extensibility · Networking ·

2024 年 10 月 23 日

TAGE: Trustworthy Attribute Group Editing for Stable Few-shot Image Generation

Ruicheng Zhang,Guoheng Huang,Yejing Huo,Xiaochen Yuan,Zhizhen Zhou,Xuhang Chen,Guo Zhong

from arxiv, Accepted by International Conference on Signal Processing Systems Conference

Generative Adversarial Networks (GANs) have emerged as a prominent research focus for image editing tasks, leveraging the powerful image generation capabilities of the GAN framework to produce remarkable results.However, prevailing approaches are contingent upon extensive training datasets and explicit supervision, presenting a significant challenge in manipulating the diverse attributes of new image classes with limited sample availability. To surmount this hurdle, we introduce TAGE, an innovative image generation network comprising three integral modules: the Codebook Learning Module (CLM), the Code Prediction Module (CPM) and the Prompt-driven Semantic Module (PSM). The CPM module delves into the semantic dimensions of category-agnostic attributes, encapsulating them within a discrete codebook. This module is predicated on the concept that images are assemblages of attributes, and thus, by editing these category-independent attributes, it is theoretically possible to generate images from unseen categories. Subsequently, the CPM module facilitates naturalistic image editing by predicting indices of category-independent attribute vectors within the codebook. Additionally, the PSM module generates semantic cues that are seamlessly integrated into the Transformer architecture of the CPM, enhancing the model's comprehension of the targeted attributes for editing. With these semantic cues, the model can generate images that accentuate desired attributes more prominently while maintaining the integrity of the original category, even with a limited number of samples. We have conducted extensive experiments utilizing the Animal Faces, Flowers, and VGGFaces datasets. The results of these experiments demonstrate that our proposed method not only achieves superior performance but also exhibits a high degree of stability when compared to other few-shot image generation techniques.

MoDELS · 語言模型化 · INFORMS · 覆蓋 · 大語言模型 ·

2024 年 10 月 22 日

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

Taewhoo Lee,Chanwoong Yoon,Kyochul Jang,Donghyeon Lee,Minju Song,Hyunjae Kim,Jaewoo Kang

from arxiv, 15 pages, 5 figures

Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our benchmark comprises 2,648 test instances spanning four long-context tasks with high IC scores in the domains of books, debates, medicine, and law. Our evaluations reveal significant performance drops in contemporary LLMs, highlighting a critical challenge in managing long contexts. Our benchmark is available at //github.com/dmis-lab/ETHIC.

控制器 · 表示 · REST · Weight · Guidance ·

2024 年 10 月 22 日

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Pengyang Ling,Jiazi Bu,Pan Zhang,Xiaoyi Dong,Yuhang Zang,Tong Wu,Huaian Chen,Jiaqi Wang,Yi Jin

from arxiv, 18 pages, 14 figures, //bujiazi.github.io/motionclone.github.io/

Motion-based controllable video generation offers the potential for creating captivating visual content. Existing methods typically necessitate model training to encode particular motion cues or incorporate fine-tuning to inject certain motion patterns, resulting in limited flexibility and generalization. In this work, we propose MotionClone, a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation, including text-to-video and image-to-video. Based on the observation that the dominant components in temporal-attention maps drive motion synthesis, while the rest mainly capture noisy or very subtle motions, MotionClone utilizes sparse temporal attention weights as motion representations for motion guidance, facilitating diverse motion transfer across varying scenarios. Meanwhile, MotionClone allows for the direct extraction of motion representation through a single denoising step, bypassing the cumbersome inversion processes and thus promoting both efficiency and flexibility. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

MoDELS · Pyramid · 卷積 · 可理解性 · 可辨認的 ·

2024 年 10 月 20 日

ContextDet: Temporal Action Detection with Adaptive Context Aggregation

Ning Wang,Yun Xiao,Xiaopeng Peng,Xiaojun Chang,Xuanhong Wang,Dingyi Fang

Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.

控制器 · 損失 · 正則的 · 優化器 · 評論員 ·

2024 年 10 月 18 日

Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization

Bin Lin,Yanzhen Yu,Jianhao Ye,Ruitao Lv,Yuguang Yang,Ruoye Xie,Pan Yu,Hongbin Zhou

from arxiv, under review

Existing audio-driven facial animation methods face critical challenges, including expression leakage, ineffective subtle expression transfer, and imprecise audio-driven synchronization. We discovered that these issues stem from limitations in motion representation and the lack of fine-grained control over facial expressions. To address these problems, we present Takin-ADA, a novel two-stage approach for real-time audio-driven portrait animation. In the first stage, we introduce a specialized loss function that enhances subtle expression transfer while reducing unwanted expression leakage. The second stage utilizes an advanced audio processing technique to improve lip-sync accuracy. Our method not only generates precise lip movements but also allows flexible control over facial expressions and head motions. Takin-ADA achieves high-resolution (512x512) facial animations at up to 42 FPS on an RTX 4090 GPU, outperforming existing commercial solutions. Extensive experiments demonstrate that our model significantly surpasses previous methods in video quality, facial dynamics realism, and natural head movements, setting a new benchmark in the field of audio-driven facial animation.

Color · Mamba · MoDELS · 通道 · 知識 (knowledge) ·

2024 年 10 月 18 日

MambaSCI: Efficient Mamba-UNet for Quad-Bayer Patterned Video Snapshot Compressive Imaging

Zhenghao Pan,Haijin Zeng,Jiezhang Cao,Yongyong Chen,Kai Zhang,Yong Xu

from arxiv, NeurIPS 2024

Color video snapshot compressive imaging (SCI) employs computational imaging techniques to capture multiple sequential video frames in a single Bayer-patterned measurement. With the increasing popularity of quad-Bayer pattern in mainstream smartphone cameras for capturing high-resolution videos, mobile photography has become more accessible to a wider audience. However, existing color video SCI reconstruction algorithms are designed based on the traditional Bayer pattern. When applied to videos captured by quad-Bayer cameras, these algorithms often result in color distortion and ineffective demosaicing, rendering them impractical for primary equipment. To address this challenge, we propose the MambaSCI method, which leverages the Mamba and UNet architectures for efficient reconstruction of quad-Bayer patterned color video SCI. To the best of our knowledge, our work presents the first algorithm for quad-Bayer patterned SCI reconstruction, and also the initial application of the Mamba model to this task. Specifically, we customize Residual-Mamba-Blocks, which residually connect the Spatial-Temporal Mamba (STMamba), Edge-Detail-Reconstruction (EDR) module, and Channel Attention (CA) module. Respectively, STMamba is used to model long-range spatial-temporal dependencies with linear complexity, EDR is for better edge-detail reconstruction, and CA is used to compensate for the missing channel information interaction in Mamba model. Experiments demonstrate that MambaSCI surpasses state-of-the-art methods with lower computational and memory costs. PyTorch style pseudo-code for the core modules is provided in the supplementary materials.

Networking · 圖 · 去噪 · Ad hoc · Ad Hoc Networks ·

2024 年 10 月 9 日

NetDiff: Deep Graph Denoising Diffusion for Ad Hoc Network Topology Generation

Félix Marcoccia,Cédric Adjih,Paul Mühlethaler

This work introduces NetDiff, an expressive graph denoising diffusion probabilistic architecture that generates wireless ad hoc network link topologies. Such networks, with directional antennas, can achieve unmatched performance when the communication links are designed to provide good geometric properties, notably by reducing interference between these links while respecting diverse physical constraints. How to craft such a link assignment algorithm is yet a real problem. Deep graph generation offers multiple advantages compared to traditional approaches: it allows to relieve the network nodes of the communication burden caused by the search of viable links and to avoid resorting to heavy combinatorial methods to find a good link topology. Denoising diffusion also provides a built-in method to update the network over time. Given that graph neural networks sometimes tend to struggle with global, structural properties, we augment the popular graph transformer with cross-attentive modulation tokens in order to improve global control over the predicted topology. We also incorporate simple node and edge features, as well as additional loss terms, to facilitate the compliance with the network topology physical constraints. A network evolution algorithm based on partial diffusion is also proposed to maintain a stable network topology over time when the nodes move. Our results show that the generated links are realistic, present structural properties similar to the dataset graphs', and require only minor corrections and verification steps to be operational.