动漫AV观看网站不卡无码,91资源电影网站,A国产乱理伦片在线观看,亚洲综合一二三四区在线

Humans can easily imagine a scene from auditory information based on their prior knowledge of audio-visual events. In this paper, we mimic this innate human ability in deep learning models to improve the quality of video inpainting. To implement the prior knowledge, we first train the audio-visual network, which learns the correspondence between auditory and visual information. Then, the audio-visual network is employed as a guider that conveys the prior knowledge of audio-visual correspondence to the video inpainting network. This prior knowledge is transferred through our proposed two novel losses: audio-visual attention loss and audio-visual pseudo-class consistency loss. These two losses further improve the performance of the video inpainting by encouraging the inpainting result to have a high correspondence to its synchronized audio. Experimental results demonstrate that our proposed method can restore a wider domain of video scenes and is particularly effective when the sounding object in the scene is partially blinded.

相關內容

圖像修復

關注 53

圖像修復（英語：Inpainting）指重建的圖像和視頻中丟失或損壞的部分的過程。例如在博物館中，這項工作常由經驗豐富的博物館管理員或者藝術品修復師來進行。數碼世界中，圖像修復又稱圖像插值或視頻插值，指利用復雜的算法來替換已丟失、損壞的圖像數據，主要替換一些小區域和瑕疵。

分解的 · 反向傳播 · 損失函數（機器學習） · MoDELS · Performer ·

2023 年 11 月 28 日

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow

Zhaoying Pan,Daniel Geng,Andrew Owens

This paper presents a simple, self-supervised method for magnifying subtle motions in video: given an input video and a magnification factor, we manipulate the video such that its new optical flow is scaled by the desired amount. To train our model, we propose a loss function that estimates the optical flow of the generated video and penalizes how far if deviates from the given magnification factor. Thus, training involves differentiating through a pretrained optical flow network. Since our model is self-supervised, we can further improve its performance through test-time adaptation, by finetuning it on the input video. It can also be easily extended to magnify the motions of only user-selected objects. Our approach avoids the need for synthetic magnification datasets that have been used to train prior learning-based approaches. Instead, it leverages the existing capabilities of off-the-shelf motion estimators. We demonstrate the effectiveness of our method through evaluations of both visual quality and quantitative metrics on a range of real-world and synthetic videos, and we show our method works for both supervised and unsupervised optical flow methods.

塑造 · 損失 · MoDELS · Processing（編程語言） · 情景 ·

2023 年 11 月 28 日

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Danah Yatim,Rafail Fridman,Omer Bar Tal,Yoni Kasten,Tali Dekel

from arxiv, Project page: //diffusion-motion-transfer.github.io/

We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

Continuity · 控制器 · INFORMS · Extensibility · state-of-the-art ·

2023 年 11 月 28 日

Continuously Controllable Facial Expression Editing in Talking Face Videos

Zhiyao Sun,Yu-Hui Wen,Tian Lv,Yanan Sun,Ziyang Zhang,Yaoyuan Wang,Yong-Jin Liu

from arxiv, Accepted by IEEE Transactions on Affective Computing (DOI: 10.1109/TAFFC.2023.3334511). Demo video: //youtu.be/WD-bNVya6kM . Project page: //raineggplant.github.io/FEE4TV

Recently audio-driven talking face video generation has attracted considerable attention. However, very few researches address the issue of emotional editing of these talking face videos with continuously controllable expressions, which is a strong demand in the industry. The challenge is that speech-related expressions and emotion-related expressions are often highly coupled. Meanwhile, traditional image-to-image translation methods cannot work well in our application due to the coupling of expressions with other attributes such as poses, i.e., translating the expression of the character in each frame may simultaneously change the head pose due to the bias of the training data distribution. In this paper, we propose a high-quality facial expression editing method for talking face videos, allowing the user to control the target emotion in the edited video continuously. We present a new perspective for this task as a special case of motion information editing, where we use a 3DMM to capture major facial movements and an associated texture map modeled by a StyleGAN to capture appearance details. Both representations (3DMM and texture map) contain emotional information and can be continuously modified by neural networks and easily smoothed by averaging in coefficient/latent spaces, making our method simple yet effective. We also introduce a mouth shape preservation loss to control the trade-off between lip synchronization and the degree of exaggeration of the edited expression. Extensive experiments and a user study show that our method achieves state-of-the-art performance across various evaluation criteria.

MIMO · MoDELS · 機器翻譯 · 比特 · Learning ·

2023 年 11 月 28 日

Enhanced Low-Complexity FDD System Feedback with Variable Bit Lengths via Generative Modeling

Nurettin Turan,Benedikt Fesl,Wolfgang Utschick

Recently, a versatile limited feedback scheme based on a Gaussian mixture model (GMM) was proposed for frequency division duplex (FDD) systems. This scheme provides high flexibility regarding various system parameters and is applicable to both point-to-point multiple-input multiple-output (MIMO) and multi-user MIMO (MU-MIMO) communications. The GMM is learned to cover the operation of all mobile terminals (MTs) located inside the base station (BS) cell, and each MT only needs to evaluate its strongest mixture component as feedback, eliminating the need for channel estimation at the MT. In this work, we extend the GMM-based feedback scheme to variable feedback lengths by leveraging a single learned GMM through merging or pruning of dispensable mixture components. Additionally, the GMM covariances are restricted to Toeplitz or circulant structure through model-based insights. These extensions significantly reduce the offloading amount and enhance the clustering ability of the GMM which, in turn, leads to an improved system performance. Simulation results for both point-to-point and multi-user systems demonstrate the effectiveness of the proposed extensions.

MoDELS · 無偏 · 數據集 · 有偏 · 訓練集 ·

2023 年 11 月 27 日

A Closer Look at Audio-Visual Segmentation

Yuanhong Chen,Yuyuan Liu,Hu Wang,Fengbei Liu,Chong Wang,Gustavo Carneiro

Audio-visual segmentation (AVS) is a complex task that involves accurately segmenting the corresponding sounding object based on audio-visual queries. Successful audio-visual learning requires two essential components: 1) an unbiased dataset with high-quality pixel-level multi-class labels, and 2) a model capable of effectively linking audio information with its corresponding visual object. However, these two requirements are only partially addressed by current methods, with training sets containing biased audio-visual data, and models that generalise poorly beyond this biased training set. In this work, we propose a new strategy to build cost-effective and relatively unbiased audio-visual semantic segmentation benchmarks. Our strategy, called Visual Post-production (VPO), explores the observation that it is not necessary to have explicit audio-visual pairs extracted from single video sources to build such benchmarks. We also refine the previously proposed AVSBench to transform it into the audio-visual semantic segmentation benchmark AVSBench-Single+. Furthermore, this paper introduces a new pixel-wise audio-visual contrastive learning method to enable a better generalisation of the model beyond the training set. We verify the validity of the VPO strategy by showing that state-of-the-art (SOTA) models trained with datasets built by matching audio and visual data from different sources or with datasets containing audio and visual data from the same video source produce almost the same accuracy. Then, using the proposed VPO benchmarks and AVSBench-Single+, we show that our method produces more accurate audio-visual semantic segmentation than SOTA models. Code and dataset will be available.

state-of-the-art · 縮放 · Networking · Processing（編程語言） · Performer ·

2023 年 11 月 27 日

Scale-Adaptive Feature Aggregation for Efficient Space-Time Video Super-Resolution

Zhewei Huang,Ailin Huang,Xiaotao Hu,Chen Hu,Jun Xu,Shuchang Zhou

from arxiv, WACV2024, 16 pages

The Space-Time Video Super-Resolution (STVSR) task aims to enhance the visual quality of videos, by simultaneously performing video frame interpolation (VFI) and video super-resolution (VSR). However, facing the challenge of the additional temporal dimension and scale inconsistency, most existing STVSR methods are complex and inflexible in dynamically modeling different motion amplitudes. In this work, we find that choosing an appropriate processing scale achieves remarkable benefits in flow-based feature propagation. We propose a novel Scale-Adaptive Feature Aggregation (SAFA) network that adaptively selects sub-networks with different processing scales for individual samples. Experiments on four public STVSR benchmarks demonstrate that SAFA achieves state-of-the-art performance. Our SAFA network outperforms recent state-of-the-art methods such as TMNet and VideoINR by an average improvement of over 0.5dB on PSNR, while requiring less than half the number of parameters and only 1/3 computational costs.

矩 · 正則化項 · Networking · Branch · MoDELS ·

2023 年 11 月 23 日

Weakly-Supervised Video Moment Retrieval via Regularized Two-Branch Proposal Networks with Erasing Mechanism

Haoyuan Li,Zhou Zhao,Zhu Zhang,Zhijie Lin

Video moment retrieval is to identify the target moment according to the given sentence in an untrimmed video. Due to temporal boundary annotations of the video are extremely time-consuming to acquire, modeling in the weakly-supervised setting is increasingly focused, where we only have access to the video-sentence pairs during training. Most existing weakly-supervised methods adopt a MIL-based framework to develop inter-sample confrontment, but neglect the intra-sample confrontment between moments with similar semantics. Therefore, these methods fail to distinguish the correct moment from plausible negative moments. Further, the previous attention models in cross-modal interaction tend to focus on a few dominant words exorbitantly, ignoring the comprehensive video-sentence correspondence. In this paper, we propose a novel Regularized Two-Branch Proposal Network with Erasing Mechanism to consider the inter-sample and intra-sample confrontments simultaneously. Concretely, we first devise a language-aware visual filter to generate both enhanced and suppressed video streams. Then, we design the sharable two-branch proposal module to generate positive and plausible negative proposals from the enhanced and suppressed branch respectively, contributing to sufficient confrontment. Besides, we introduce an attention-guided dynamic erasing mechanism in enhanced branch to discover the complementary video-sentence relation. Moreover, we apply two types of proposal regularization to stabilize the training process and improve model performance. The extensive experiments on ActivityCaption, Charades-STA and DiDeMo datasets show the effectiveness of our method.

無監督 · 表示學習 · 學成 · CASES · state-of-the-art ·

2021 年 4 月 29 日

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

Christoph Feichtenhofer,Haoqi Fan,Bo Xiong,Ross Girshick,Kaiming He

from arxiv, CVPR 2021

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at //github.com/facebookresearch/SlowFast

圖 · Performer · 小樣本學習 · 元學習器 · 信息抽取 ·

2020 年 3 月 18 日

Few-Shot Graph Classification with Model Agnostic Meta-Learning

Ning Ma,Jiajun Bu,Jieyu Yang,Zhen Zhang,Chengwei Yao,Zhi Yu

Graph classification aims to perform accurate information extraction and classification over graphstructured data. In the past few years, Graph Neural Networks (GNNs) have achieved satisfactory performance on graph classification tasks. However, most GNNs based methods focus on designing graph convolutional operations and graph pooling operations, overlooking that collecting or labeling graph-structured data is more difficult than grid-based data. We utilize meta-learning for fewshot graph classification to alleviate the scarce of labeled graph samples when training new tasks.More specifically, to boost the learning of graph classification tasks, we leverage GNNs as graph embedding backbone and meta-learning as training paradigm to capture task-specific knowledge rapidly in graph classification tasks and transfer them to new tasks. To enhance the robustness of meta-learner, we designed a novel step controller driven by Reinforcement Learning. The experiments demonstrate that our framework works well compared to baselines.

學成 · Networking · INFORMS · Performer · Neural Networks ·

2020 年 2 月 27 日

Meta-Transfer Learning for Zero-Shot Super-Resolution

Jae Woong Soh,Sunwoo Cho,Nam Ik Cho

from arxiv, Will be presented in CVPR 2020

Convolutional neural networks (CNNs) have shown dramatic improvements in single image super-resolution (SISR) by using large-scale external samples. Despite their remarkable performance based on the external dataset, they cannot exploit internal information within a specific image. Another problem is that they are applicable only to the specific condition of data that they are supervised. For instance, the low-resolution (LR) image should be a "bicubic" downsampled noise-free image from a high-resolution (HR) one. To address both issues, zero-shot super-resolution (ZSSR) has been proposed for flexible internal learning. However, they require thousands of gradient updates, i.e., long inference time. In this paper, we present Meta-Transfer Learning for Zero-Shot Super-Resolution (MZSR), which leverages ZSSR. Precisely, it is based on finding a generic initial parameter that is suitable for internal learning. Thus, we can exploit both external and internal information, where one single gradient update can yield quite considerable results. (See Figure 1). With our method, the network can quickly adapt to a given image condition. In this respect, our method can be applied to a large spectrum of image conditions within a fast adaptation process.