亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

<dir id='ka0xx'><del id='ka0xx'><del id='ka0xx'></del><pre id='ka0xx'><pre id='ka0xx'><option id='ka0xx'><address id='ka0xx'></address><bdo id='ka0xx'><tr id='ka0xx'><acronym id='ka0xx'><pre id='ka0xx'></pre></acronym><div id='ka0xx'></div></tr></bdo></option></pre><small id='ka0xx'><address id='ka0xx'><u id='ka0xx'><legend id='ka0xx'><option id='ka0xx'><abbr id='ka0xx'></abbr><li id='ka0xx'><pre id='ka0xx'></pre></li></option></legend><select id='ka0xx'></select></u></address></small></pre></del><sup id='ka0xx'></sup><blockquote id='ka0xx'><dt id='ka0xx'></dt></blockquote><blockquote id='ka0xx'></blockquote></dir><tt id='ka0xx'></tt><u id='ka0xx'><tt id='ka0xx'><form id='ka0xx'></form></tt><td id='ka0xx'><dt id='ka0xx'></dt></td></u>

<code id='ka0xx'><i id='ka0xx'><q id='ka0xx'><legend id='ka0xx'><pre id='ka0xx'><style id='ka0xx'><acronym id='ka0xx'><i id='ka0xx'><form id='ka0xx'><option id='ka0xx'><center id='ka0xx'></center></option></form></i></acronym></style><tt id='ka0xx'></tt></pre></legend></q></i></code><center id='ka0xx'></center>

<dd id='ka0xx'></dd>

<style id='ka0xx'></style><sub id='ka0xx'><dfn id='ka0xx'><abbr id='ka0xx'><big id='ka0xx'><bdo id='ka0xx'></bdo></big></abbr></dfn></sub>_{<dir id='ka0xx'></dir>}

·

視頻分類 · Vision · Processing（編程語言） · MoDELS · 值域 ·

2022 年 1 月 20 日

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Chao-Yuan Wu,Yanghao Li,Karttikeya Mangalam,Haoqi Fan,Bo Xiong,Jitendra Malik,Christoph Feichtenhofer

from arxiv, Technical report

While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models will be made publicly available.

相關內容

視頻分類

注意力機制 · 視頻分類 · 相關系數 · Performer · Extensibility ·

2022 年 4 月 20 日

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Yanbin Hao,Shuo Wang,Pei Cao,Xinjian Gao,Tong Xu,Jinmeng Wu,Xiangnan He

from arxiv, 13 pages

Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to the utilization of perspective contexts. However, the current research on video attention generally focuses on adopting a specific aspect of contexts (e.g., channel, spatial/temporal, or global context) to refine the features and neglects their underlying correlation when computing attentions. This leads to incomplete context utilization and hence bears the weakness of limited performance improvement. To tackle the problem, this paper proposes an efficient attention-in-attention (AIA) method for element-wise feature refinement, which investigates the feasibility of inserting the channel context into the spatio-temporal attention learning module, referred to as CinST, and also its reverse variant, referred to as STinC. Specifically, we instantiate the video feature contexts as dynamics aggregated along a specific axis with global average and max pooling operations. The workflow of an AIA module is that the first attention block uses one kind of context information to guide the gating weights calculation of the second attention that targets at the other context. Moreover, all the computational operations in attention units act on the pooled dimension, which results in quite few computational cost increase ($<$0.02\%). To verify our method, we densely integrate it into two classical video network backbones and conduct extensive experiments on several standard video classification benchmarks. The source code of our AIA is available at \url{//github.com/haoyanbin918/Attention-in-Attention}.

知識 (knowledge) · 學成 · Performer · MoDELS · Extensibility ·

2022 年 4 月 20 日

K-LITE: Learning Transferable Visual Models with External Knowledge

Sheng Shen,Chunyuan Li,Xiaowei Hu,Yujia Xie,Jianwei Yang,Pengchuan Zhang,Anna Rohrbach,Zhe Gan,Lijuan Wang,Lu Yuan,Ce Liu,Kurt Keutzer,Trevor Darrell,Jianfeng Gao

from arxiv, Preprint. The first three authors contribute equally

Recent state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This free form of supervision ensures high generality and usability of the learned visual models, based on extensive heuristics on data collection to cover as many visual concepts as possible. Alternatively, learning with external knowledge about images is a promising way which leverages a much more structured source of supervision. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods.

變換 · 學成 · 詞元分析器 · Extensibility · 可約的 ·

2022 年 4 月 20 日

Learning Trajectory-Aware Transformer for Video Super-Resolution

Chengxu Liu,Huan Yang,Jianlong Fu,Xueming Qian

from arxiv, CVPR 2022 Oral

Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to overcome scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at //github.com/researchmm/TTVSR.

變換 · 示例 · Vision · INFORMS · Backbone ·

2022 年 4 月 18 日

Temporally Efficient Vision Transformer for Video Instance Segmentation

Shusheng Yang,Xinggang Wang,Yu Li,Yuxin Fang,Jiemin Fang,Wenyu Liu,Xun Zhao,Ying Shan

from arxiv, To appear in CVPR 2022

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at //github.com/hustvl/TeViT.

Processing（編程語言） · 講稿 · Transformer · state-of-the-art · 模型評估 ·

2022 年 4 月 18 日

Event Transformer. A sparse-aware solution for efficient event data processing

Alberto Sabater,Luis Montesano,Ana C. Murillo

Event cameras are sensors of great interest for many applications that run in low-resource and challenging environments. They log sparse illumination changes with high temporal resolution and high dynamic range, while they present minimal power consumption. However, top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms. Efforts toward efficient solutions usually do not achieve top-accuracy results for complex tasks. This work proposes a novel framework, Event Transformer (EvT), that effectively takes advantage of event-data properties to be highly efficient and accurate. We introduce a new patch-based event representation and a compact transformer-like architecture to process it. EvT is evaluated on different event-based benchmarks for action and gesture recognition. Evaluation results show better or comparable accuracy to the state-of-the-art while requiring significantly less computation resources, which makes EvT able to work with minimal latency both on GPU and CPU.

變換 · Performer · state-of-the-art · 查準率/準確率 · MoDELS ·

2022 年 4 月 15 日

TubeR: Tubelet Transformer for Video Action Detection

Jiaojiao Zhao,Yanyi Zhang,Xinyu Li,Hao Chen,Shuai Bing,Mingze Xu,Chunhui Liu,Kaustav Kundu,Yuanjun Xiong,Davide Modolo,Ivan Marsic,Cees G. M. Snoek,Joseph Tighe

from arxiv, Accepted at CVPR 2022 (Oral)

We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet-queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21.

多峰值 · 學成 · Extensibility · 深度學習 · Processing（編程語言） ·

2021 年 5 月 24 日

Recent Advances and Trends in Multimodal Deep Learning: A Review

Jabeen Summaira,Xi Li,Amin Muhammad Shoib,Songyuan Li,Jabbar Abdul

Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of past and current baseline approaches and an in-depth study of recent advancements in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth. Architectures and datasets used in these applications are also discussed, along with their evaluation metrics. Last, main issues are highlighted separately for each domain along with their possible future research directions.

contrastive · 學成 · 對比學習 · 目標檢測 · 優化器 ·

2021 年 4 月 4 日

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

Xinlong Wang,Rufeng Zhang,Chunhua Shen,Tao Kong,Lei Li

from arxiv, 11 pages. Accepted to IEEE/CVF Conf. Comp. Vision Pattern Recognition (CVPR) 2021; Oral paper

To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images. Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation. Code is available at: //git.io/AdelaiDet

MoDELS · state-of-the-art · 視頻分類 · Extensibility · Networking ·

2021 年 1 月 5 日

MVFNet: Multi-View Fusion Network for Efficient Video Recognition

Wenhao Wu,Dongliang He,Tianwei Lin,Fu Li,Chuang Gan,Errui Ding

from arxiv, Accepted by AAAI2021

Conventionally, spatiotemporal modeling network and its complexity are the two most concentrated research topics in video action recognition. Existing state-of-the-art methods have achieved excellent accuracy regardless of the complexity meanwhile efficient spatiotemporal modeling solutions are slightly inferior in performance. In this paper, we attempt to acquire both efficiency and effectiveness simultaneously. First of all, besides traditionally treating H x W x T video frames as space-time signal (viewing from the Height-Width spatial plane), we propose to also model video from the other two Height-Time and Width-Time planes, to capture the dynamics of video thoroughly. Secondly, our model is designed based on 2D CNN backbones and model complexity is well kept in mind by design. Specifically, we introduce a novel multi-view fusion (MVF) module to exploit video dynamics using separable convolution for efficiency. It is a plug-and-play module and can be inserted into off-the-shelf 2D CNNs to form a simple yet effective model called MVFNet. Moreover, MVFNet can be thought of as a generalized video modeling framework and it can specialize to be existing methods such as C2D, SlowOnly, and TSM under different settings. Extensive experiments are conducted on popular benchmarks (i.e., Something-Something V1 & V2, Kinetics, UCF-101, and HMDB-51) to show its superiority. The proposed MVFNet can achieve state-of-the-art performance with 2D CNN's complexity.

視頻分類 · Networking · 可約的 · FAST · INFORMS ·

2018 年 12 月 10 日

SlowFast Networks for Video Recognition

Christoph Feichtenhofer,Haoqi Fan,Jitendra Malik,Kaiming He

from arxiv, Technical report

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report 79.0% accuracy on the Kinetics dataset without using any pre-training, largely surpassing the previous best results of this kind. On AVA action detection we achieve a new state-of-the-art of 28.3 mAP. Code will be made publicly available.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

視頻分類(lei)

Processing（編程語(yu)言）

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<form id='TqlqN'></form>

<bdo id='mIyfo'><sup id='Y2dIU'><div id='Lox6d'><bdo id='eYseu'></bdo></div></sup></bdo>