男女一边脱一边亲一边膜,91人妻社区论坛精选,露脸公妇仑乱在线观看日本

We present MoDist as a novel method to explicitly distill motion information into self-supervised video representations. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, we show that the representation learned with our MoDist method focus more on foreground motion regions and thus generalizes better to downstream tasks. To achieve this, MoDist enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between a Motion pathway and a Visual pathway. We evaluate MoDist on several datasets for both action recognition (UCF101/HMDB51/SSv2) as well as action detection (AVA), and demonstrate state-of-the-art self-supervised performance on all datasets. Furthermore, we show that MoDist representation can be as effective as (in some cases even better than) representations learned with full supervision. Given its simplicity, we hope MoDist could serve as a strong baseline for future research in self-supervised video representation learning.

相關內容

表示學習

關注 186

表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)是通過(guo)利用(yong)訓練數據(ju)來學(xue)習(xi)(xi)得到(dao)向量(liang)(liang)表(biao)(biao)(biao)(biao)示(shi)，這可以(yi)(yi)克(ke)服人工方法(fa)(fa)的(de)局限性。表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)通常可分為兩大類(lei)，無(wu)監(jian)督和有監(jian)督表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)。大多(duo)數無(wu)監(jian)督表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)方法(fa)(fa)利用(yong)自(zi)(zi)(zi)動編(bian)碼器（如去噪(zao)自(zi)(zi)(zi)動編(bian)碼器和稀疏自(zi)(zi)(zi)動編(bian)碼器等）中(zhong)的(de)隱變量(liang)(liang)作為表(biao)(biao)(biao)(biao)示(shi)。目前出現(xian)的(de)變分自(zi)(zi)(zi)動編(bian)碼器能夠更(geng)好的(de)容(rong)忍噪(zao)聲和異常值。然(ran)而，推斷給定數據(ju)的(de)潛在結構(gou)幾乎是不可能的(de)。目前有一些(xie)(xie)近(jin)似(si)(si)(si)推斷的(de)策略。此外，一些(xie)(xie)無(wu)監(jian)督表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)方法(fa)(fa)旨(zhi)在近(jin)似(si)(si)(si)某(mou)種(zhong)特定的(de)相似(si)(si)(si)性度量(liang)(liang)。提(ti)出了一種(zhong)無(wu)監(jian)督的(de)相似(si)(si)(si)性保持表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)框(kuang)(kuang)架，該(gai)框(kuang)(kuang)架使(shi)用(yong)矩陣分解來保持成(cheng)對的(de)DTW相似(si)(si)(si)性。通過(guo)學(xue)習(xi)(xi)保持DTW的(de)shaplets，即在轉(zhuan)換后的(de)空間中(zhong)的(de)歐式距離(li)近(jin)似(si)(si)(si)原始數據(ju)的(de)真實DTW距離(li)。有監(jian)督表(biao)(biao)(biao)(biao)示(shi)學(xue)習(xi)(xi)方法(fa)(fa)可以(yi)(yi)利用(yong)數據(ju)的(de)標(biao)簽(qian)信息，更(geng)好地捕獲數據(ju)的(de)語(yu)義結構(gou)。孿生(sheng)網絡和三(san)元組網絡是目前兩種(zhong)比較流行的(de)模型，它們的(de)目標(biao)是最大化類(lei)別之間的(de)距離(li)并最小化了類(lei)別內部的(de)距離(li)。

contrastive · 學成 · 對比學習 · 目標檢測 · 優化器 ·

2021 年 4 月 4 日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Xinlong Wang,Rufeng Zhang,Chunhua Shen,Tao Kong,Lei Li

from arxiv, 11 pages. Accepted to IEEE/CVF Conf. Comp. Vision Pattern Recognition (CVPR) 2021; Oral paper

To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: //git.io/AdelaiDet

表示學習 · 學成 · 有偏 · 成對型 · INFORMS ·

2021 年 4 月 2 日

Self-supervised Video Representation Learning by Context and Motion Decoupling

Lianghua Huang,Yu Liu,Bin Wang,Pan Pan,Yinghui Xu,Rong Jin

from arxiv, Accepted by CVPR2021

A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. While most existing works implicitly achieve this with video-specific pretext tasks (e.g., predicting clip orders, time arrows, and paces), we develop a method that explicitly decouples motion supervision from context bias through a carefully designed pretext task. Specifically, we take the keyframes and motion vectors in compressed videos (e.g., in H.264 format) as the supervision sources for context and motion, respectively, which can be efficiently extracted at over 500 fps on the CPU. Then we design two pretext tasks that are jointly optimized: a context matching task where a pairwise contrastive loss is cast between video clip and keyframe features; and a motion prediction task where clip features, passed through an encoder-decoder network, are used to estimate motion features in a near future. These two tasks use a shared video backbone and separate MLP heads. Experiments show that our approach improves the quality of the learned video representation over previous works, where we obtain absolute gains of 16.0% and 11.1% in video retrieval recall on UCF101 and HMDB51, respectively. Moreover, we find the motion prediction to be a strong regularization for video networks, where using it as an auxiliary task improves the accuracy of action recognition with a margin of 7.4%~13.8%.

表示學習 · INFORMS · Backbone · 原點 · 學成 ·

2020 年 12 月 4 日

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion

Jinpeng Wang,Yuting Gao,Ke Li,Jianguo Hu,Xinyang Jiang,Xiaowei Guo,Rongrong Ji,Xing Sun

from arxiv, AAAI2021

One significant factor we expect the video representation learning to capture, especially in contrast with the image representation learning, is the object motion. However, we found that in the current mainstream video datasets, some action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded. For example, a trained model may predict a video as playing football simply because it sees the field, neglecting that the subject is dancing as a cheerleader on the field. This is against our original intention towards the video representation learning and may bring scene bias on different dataset that can not be ignored. In order to tackle this problem, we propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid. Specifically, we construct a positive clip and a negative clip for each video. Compared to the original video, the positive/negative is motion-untouched/broken but scene-broken/untouched by Spatial Local Disturbance and Temporal Local Disturbance. Our objective is to pull the positive closer while pushing the negative farther to the original clip in the latent space. In this way, the impact of the scene is weakened while the temporal sensitivity of the network is further enhanced. We conduct experiments on two tasks with various backbones and different pre-training datasets, and find that our method surpass the SOTA methods with a remarkable 8.1% and 8.8% improvement towards action recognition task on the UCF101 and HMDB51 datasets respectively using the same backbone.

簇 · 相關系數 · 模型評估 · 學成 · 模態 ·

2020 年 10 月 26 日

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel,Dhruv Mahajan,Bruno Korbar,Lorenzo Torresani,Bernard Ghanem,Du Tran

from arxiv, Accepted to NeurIPS 2020 (spotlight presentation)

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

contrastive · Performer · 無監督 · 控制器 · 學成 ·

2020 年 4 月 28 日

CURL: Contrastive Unsupervised Representations for Reinforcement Learning

Aravind Srinivas,Michael Laskin,Pieter Abbeel

from arxiv, First two authors contributed equally, website: //mishalaskin.github.io/curl code: //github.com/MishaLaskin/curl

We present CURL: Contrastive Unsupervised Representations for Reinforcement Learning. CURL extracts high-level features from raw pixels using contrastive learning and performs off-policy control on top of the extracted features. CURL outperforms prior pixel-based methods, both model-based and model-free, on complex tasks in the DeepMind Control Suite and Atari Games showing 1.9x and 1.6x performance gains at the 100K environment and interaction steps benchmarks respectively. On the DeepMind Control Suite, CURL is the first image-based algorithm to nearly match the sample-efficiency and performance of methods that use state-based features.

無監督 · 表示學習 · 損失函數（機器學習） · 學成 · 未標記 ·

2020 年 2 月 26 日

Evolving Losses for Unsupervised Video Representation Learning

AJ Piergiovanni,Anelia Angelova,Michael S. Ryoo

from arxiv, arXiv admin note: text overlap with arXiv:1906.03248

We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets.

contrastive · 對比學習 · 學成 · SimPLe · SimCLR ·

2020 年 2 月 13 日

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen,Simon Kornblith,Mohammad Norouzi,Geoffrey Hinton

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

Continuity · 學成 · Performer · 無監督 · 表示學習 ·

2019 年 10 月 31 日

Continual Unsupervised Representation Learning

Dushyant Rao,Francesco Visin,Andrei A. Rusu,Yee Whye Teh,Razvan Pascanu,Raia Hadsell

from arxiv, NeurIPS 2019

Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled. The proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. Further, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.

學成 · MoDELS · 判別器 · Faster R-CNN · 估計/估計量 ·

2018 年 12 月 11 日

Learning Discriminative Motion Features Through Detection

Gedas Bertasius,Christoph Feichtenhofer,Du Tran,Jianbo Shi,Lorenzo Torresani

Despite huge success in the image domain, modern detection models such as Faster R-CNN have not been used nearly as much for video analysis. This is arguably due to the fact that detection models are designed to operate on single frames and as a result do not have a mechanism for learning motion representations directly from video. We propose a learning procedure that allows detection models such as Faster R-CNN to learn motion features directly from the RGB video data while being optimized with respect to a pose estimation task. Given a pair of video frames---Frame A and Frame B---we force our model to predict human pose in Frame A using the features from Frame B. We do so by leveraging deformable convolutions across space and time. Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A. This naturally encourages our network to learn motion offsets encoding the spatial correspondences between the two frames. We refer to these motion offsets as DiMoFs (Discriminative Motion Features). In our experiments we show that our training scheme helps learn effective motion cues, which can be used to estimate and localize salient human motion. Furthermore, we demonstrate that as a byproduct, our model also learns features that lead to improved pose detection in still-images, and better keypoint tracking. Finally, we show how to leverage our learned model for the tasks of spatiotemporal action localization and fine-grained action recognition.