亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

<tr id='faCIw'><strong id='MUjSR'></strong><small id='zrqNc'></small><button id='h9piJ'></button><li id='tyYTq'><noscript id='dbSYZ'><big id='9B2br'></big><dt id='TJ8Iq'></dt></noscript></li></tr><ol id='CmksI'><option id='0rErM'><table id='bbC1M'><blockquote id='RlSst'><tbody id='Lcu4S'></tbody></blockquote></table></option></ol><u id='9Aovt'></u><kbd id='IbR3h'><kbd id='gnGkr'></kbd></kbd>

<code id='46kFt'><strong id='DxwAm'></strong></code>

<fieldset id='XxKCl'></fieldset>

<span id='btOZa'></span>

<ins id='ZXVOd'></ins>

<acronym id='9gjC4'><em id='PWRAD'></em><td id='Adi31'><div id='2Bg6T'></div></td></acronym><address id='gLQE6'><big id='6dfW3'><big id='BV0zg'></big><legend id='mvLq6'></legend></big></address>

<i id='ECtKf'><div id='FVati'><ins id='6JVcE'></ins></div></i>

<i id='y8xs5'></i>

·

可理解性 · 多峰值 · MoDELS · Extensibility · Performer ·

2020 年 2 月 15 日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo,Lei Ji,Botian Shi,Haoyang Huang,Nan Duan,Tianrui Li,Xilin Chen,Ming Zhou

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pre-training using narrated instructional videos. Different from their works which only pre-train understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.

相關內容

可理解性

圖像字幕 · 視覺問答 · MoDELS · 可理解性 · 自動問答 ·

2019 年 10 月 3 日

Unified Vision-Language Pre-Training for Image Captioning and VQA

Luowei Zhou,Hamid Palangi,Lei Zhang,Houdong Hu,Jason J. Corso,Jianfeng Gao

from arxiv, The code and the pre-trained models are available at //github.com/LuoweiZhou/VLP

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at //github.com/LuoweiZhou/VLP.

BERT · MoDELS · 語言模型化 · 變換 · state-of-the-art ·

2019 年 8 月 22 日

Text Summarization with Pretrained Encoders

Yang Liu,Mirella Lapata

from arxiv, To appear in EMNLP 2019

Bidirectional Encoder Representations from Transformers (BERT) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several inter-sentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves state-of-the-art results across the board in both extractive and abstractive settings. Our code is available at //github.com/nlpyang/PreSumm

BERT · 語言表示 · state-of-the-art · 可理解性 · MoDELS ·

2019 年 5 月 24 日

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin,Ming-Wei Chang,Kenton Lee,Kristina Toutanova

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

文本分類 · 語言模型化 · BERT · state-of-the-art · MoDELS ·

2019 年 5 月 14 日

How to Fine-Tune BERT for Text Classification?

Chi Sun,Xipeng Qiu,Yige Xu,Xuanjing Huang

Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets.

多峰值 · 視頻描述生成（Video Caption） · 注意力機制 · Weight · Networking ·

2019 年 5 月 8 日

Multimodal Semantic Attention Network for Video Captioning

Liang Sun,Bing Li,Chunfeng Yuan,Zhengjun Zha,Weiming Hu

from arxiv, 6 pages, 4 figures, accepted by IEEE International Conference on Multimedia and Expo (ICME) 2019

Inspired by the fact that different modalities in videos carry complementary information, we propose a Multimodal Semantic Attention Network(MSAN), which is a new encoder-decoder framework incorporating multimodal semantic attributes for video captioning. In the encoding phase, we detect and generate multimodal semantic attributes by formulating it as a multi-label classification problem. Moreover, we add auxiliary classification loss to our model that can obtain more effective visual features and high-level multimodal semantic attribute distributions for sufficient video encoding. In the decoding phase, we extend each weight matrix of the conventional LSTM to an ensemble of attribute-dependent weight matrices, and employ attention mechanism to pay attention to different attributes at each time of the captioning process. We evaluate algorithm on two popular public benchmarks: MSVD and MSR-VTT, achieving competitive results with current state-of-the-art across six evaluation metrics.

語言模型化 · Machine Translation · MoDELS · Integration · BLEU ·

2019 年 4 月 1 日

Pre-trained Language Model Representations for Language Generation

Sergey Edunov,Alexei Baevski,Michael Auli

from arxiv, NAACL 2019

Pre-trained language model representations have been successful in a wide range of language understanding tasks. In this paper, we examine different strategies to integrate pre-trained representations into sequence to sequence models and apply it to neural machine translation and abstractive summarization. We find that pre-trained representations are most effective when added to the encoder network which slows inference by only 14%. Our experiments in machine translation show gains of up to 5.3 BLEU in a simulated resource-poor setup. While returns diminish with more labeled data, we still observe improvements when millions of sentence-pairs are available. Finally, on abstractive summarization we achieve a new state of the art on the full text version of CNN/DailyMail.

視頻描述生成（Video Caption） · GRU · 解碼 · 語言模型化 · Extensibility ·

2018 年 7 月 8 日

Video Captioning with Boundary-aware Hierarchical Language Decoding and Joint Video Prediction

Xiangxi Shi,Jianfei Cai,Jiuxiang Gu,Shafiq Joty

The explosion of video data on the internet requires effective and efficient technology to generate captions automatically for people who are not able to watch the videos. Despite the great progress of video captioning research, particularly on video feature encoding, the language decoder is still largely based on the prevailing RNN decoder such as LSTM, which tends to prefer the frequent word that aligns with the video. In this paper, we propose a boundary-aware hierarchical language decoder for video captioning, which consists of a high-level GRU based language decoder, working as a global (caption-level) language model, and a low-level GRU based language decoder, working as a local (phrase-level) language model. Most importantly, we introduce a binary gate into the low-level GRU language decoder to detect the language boundaries. Together with other advanced components including joint video prediction, shared soft attention, and boundary-aware video encoding, our integrated video captioning framework can discover hierarchical language information and distinguish the subject and the object in a sentence, which are usually confusing during the language generation. Extensive experiments on two widely-used video captioning datasets, MSR-Video-to-Text (MSR-VTT) \cite{xu2016msr} and YouTube-to-Text (MSVD) \cite{chen2011collecting} show that our method is highly competitive, compared with the state-of-the-art methods.

視頻描述生成（Video Caption） · Extensibility · 全局優化 · state-of-the-art · Integration ·

2018 年 4 月 23 日

Jointly Localizing and Describing Events for Dense Video Captioning

Yehao Li,Ting Yao,Yingwei Pan,Hongyang Chao,Tao Mei

from arxiv, CVPR 2018 Spotlight, Rank 1 in ActivityNet Captions Challenge 2017

Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. The problem nevertheless is not trivial especially when a video contains multiple events to be worthy of mention, which often happens in real videos. A valid question is how to temporally localize and then describe events, which is known as "dense video captioning." In this paper, we present a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. To combine these two worlds, we integrate a new design, namely descriptiveness regression, into a single shot detection structure to infer the descriptive complexity of each detected proposal via sentence generation. This in turn adjusts the temporal locations of each event proposal. Our model differs from existing dense video captioning methods since we propose a joint and global optimization of detection and captioning, and the framework uniquely capitalizes on an attribute-augmented video captioning architecture. Extensive experiments are conducted on ActivityNet Captions dataset and our framework shows clear improvements when compared to the state-of-the-art techniques. More remarkably, we obtain a new record: METEOR of 12.96% on ActivityNet Captions official test set.

視頻描述生成（Video Caption） · Networking · 后向 · 前向 · 狀態序列 ·

2018 年 3 月 30 日

Reconstruction Network for Video Captioning

Bairui Wang,Lin Ma,Wei Zhang,Wei Liu

from arxiv, Accepted by CVPR 2018

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.

視頻分類 · 視頻描述生成（Video Caption） · INFORMS · AIM · 深度學習 ·

2018 年 2 月 22 日

Deep Learning for Video Classification and Captioning

Zuxuan Wu,Ting Yao,Yanwei Fu,Yu-Gang Jiang

from arxiv, Book chapter in Frontiers of Multimedia Research

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification concentrates on automatically labeling video clips based on their semantic contents like human actions or complex events, video captioning attempts to generate a complete and natural sentence, enriching the single label as in video classification, to capture the most informative dynamics in videos. In addition, we also provide a review of popular benchmarks and competitions, which are critical for evaluating the technical progress of this vibrant field.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<tfoot id='glvi4'></tfoot>

<legend id='glvi4'><style id='glvi4'><dir id='glvi4'><q id='glvi4'></q></dir></style></legend>

<i id='glvi4'><tr id='glvi4'><dt id='glvi4'><q id='glvi4'><span id='glvi4'><b id='glvi4'><form id='glvi4'><ins id='glvi4'></ins><ul id='glvi4'></ul><sub id='glvi4'></sub></form><legend id='glvi4'></legend><bdo id='glvi4'><pre id='glvi4'><center id='glvi4'></center></pre></bdo></b><th id='glvi4'></th></span></q></dt></tr></i><div id='glvi4'><tfoot id='glvi4'></tfoot><dl id='glvi4'><fieldset id='glvi4'></fieldset></dl></div>