18GAY国产小鲜肉可播放_国产日韩精品全集在线观看_黄片现看欧美一区_国产无遮挡色视频真人免费视频_欧美韩国日本国产一区二区_国产福利专区精品视频_亚洲国产中文美国国产综合一区

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

相關內容

INTERACT

關注 5

IFIP TC13 Conference on Human-Computer Interaction是人機交互領域的研究者和實踐者展示其工作的重要平臺。多年來，這些會議吸引了來自幾個國家和文化的研究人員。官網鏈接： · MoDELS · 變換 · INFORMS · Performer ·

2021 年 5 月 29 日

K-XLNet: A General Method for Combining Explicit Knowledge with Language Model Pretraining

Ruiqing Yan,Lanchang Sun,Fang Wang,Xiaoming Zhang

Though pre-trained language models such as Bert and XLNet, have rapidly advanced the state-of-the-art on many NLP tasks, they implicit semantics only relying on surface information between words in corpus. Intuitively, background knowledge influences the efficacy of understanding. Inspired by this common sense, we focus on improving model pretraining by leveraging explicit knowledge. Different from recent research that optimize pretraining model by knowledge masking strategies, we propose a simple but general method to combine explicit knowledge with pretraining. To be specific, we first match knowledge facts from knowledge graph (KG) and then add a knowledge injunction layer to transformer directly without changing its architecture. The present study seeks to find the direct impact of explicit knowledge on transformer per-training. We conduct experiments on various datasets for different downstream tasks. The experimental results show that solely by adding external knowledge to transformer can improve the learning performance on many NLP tasks.

多峰值 · MoDELS · Performer · 下游任務 · 值域 ·

2021 年 5 月 29 日

M6: A Chinese Multimodal Pretrainer

Junyang Lin,Rui Men,An Yang,Chang Zhou,Ming Ding,Yichang Zhang,Peng Wang,Ang Wang,Le Jiang,Xianyan Jia,Jie Zhang,Jianwei Zhang,Xu Zou,Zhikang Li,Xiaodong Deng,Jie Liu,Jinbao Xue,Huiling Zhou,Jianxin Ma,Jin Yu,Yong Li,Wei Lin,Jingren Zhou,Jie Tang,Hongxia Yang

from arxiv, 12 pages, technical report. Extension of paper "M6" accepted to KDD 2021

In this work, we construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains. We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. We scale the model size up to 10 billion and 100 billion parameters, and build the largest pretrained model in Chinese. We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation, and show that the finetuned M6 can create high-quality images with high resolution and abundant details.

語言模型化 · 主動學習 · MoDELS · 未標記 · 下游任務 ·

2021 年 4 月 16 日

Bayesian Active Learning with Pretrained Language Models

Katerina Margatina,Loic Barrault,Nikolaos Aletras

Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data, aiming to achieve better model performance than random selection. Previous AL approaches in Natural Language Processing (NLP) have been limited to either task-specific models that are trained from scratch at each iteration using only the labeled data at hand or using off-the-shelf pretrained language models (LMs) that are not adapted effectively to the downstream task. In this paper, we address these limitations by introducing BALM; Bayesian Active Learning with pretrained language Models. We first propose to adapt the pretrained LM to the downstream task by continuing training with all the available unlabeled data and then use it for AL. We also suggest a simple yet effective fine-tuning method to ensure that the adapted LM is properly trained in both low and high resource scenarios during AL. We finally apply Monte Carlo dropout to the downstream model to obtain well-calibrated confidence scores for data selection with uncertainty sampling. Our experiments in five standard natural language understanding tasks demonstrate that BALM provides substantial data efficiency improvements compared to various combinations of acquisition functions, models and fine-tuning methods proposed in recent AL literature.

2021 年 3 月 17 日

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Yuqi Huo,Manli Zhang,Guangzhen Liu,Haoyu Lu,Yizhao Gao,Guoxing Yang,Jingyuan Wen,Heng Zhang,Baogui Xu,Weihao Zheng,Zongzheng Xi,Yueqian Yang,Anwen Hu,Jinming Zhao,Ruichen Li,Yida Zhao,Liang Zhang,Yuqing Song,Xin Hong,Wanqing Cui,Danyang Hou,Yingyan Li,Junyi Li,Peiyu Liu,Zheng Gong,Chuhao Jin,Yuchong Sun,Shizhe Chen,Zhiwu Lu,Zhicheng Dou,Qin Jin,Yanyan Lan,Wayne Xin Zhao,Ruihua Song,Ji-Rong Wen

from arxiv, This paper is the outcome of the Chinese multi-modal pre-training project called 'WenLan'

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.

Extensibility · 多任務學習 · 學成 · 優化器 · Networking ·

2020 年 9 月 16 日

Multi-Task Learning for Dense Prediction Tasks: A Survey

Simon Vandenhende,Stamatios Georgoulis,Wouter Van Gansbeke,Marc Proesmans,Dengxin Dai,Luc Van Gool

from arxiv, Code is available: //github.com/SimonVandenhende/Multi-Task-Learning-PyTorch

With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance, computations and/or memory footprint, by jointly tackling multiple tasks through a learned shared representation. In this survey, we provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision, explicitly emphasizing on dense prediction tasks. Our contributions concern the following. First, we consider MTL from a network architecture point-of-view. We include an extensive overview and discuss the advantages/disadvantages of recent popular MTL models. Second, we examine various optimization methods to tackle the joint learning of multiple tasks. We summarize the qualitative elements of these works and explore their commonalities and differences. Finally, we provide an extensive experimental evaluation across a variety of dense prediction benchmarks to examine the pros and cons of the different methods, including both architectural and optimization based strategies.

表示學習 · Performer · 正則化項 · 學成 · 視覺問答 ·

2020 年 6 月 11 日

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Zhe Gan,Yen-Chun Chen,Linjie Li,Chen Zhu,Yu Cheng,Jingjing Liu

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

語言表示 · 小樣本學習 · 文本分類 · 學成 · Performer ·

2019 年 8 月 22 日

Improving Few-shot Text Classification via Pretrained Language Representations

Ningyu Zhang,Zhanlin Sun,Shumin Deng,Jiaoyan Chen,Huajun Chen

from arxiv, arXiv admin note: substantial text overlap with arXiv:1902.10482, arXiv:1803.02400 by other authors

Text classification tends to be difficult when the data is deficient or when it is required to adapt to unseen classes. In such challenging scenarios, recent studies have often used meta-learning to simulate the few-shot task, thus negating explicit common linguistic features across tasks. Deep language representations have proven to be very effective forms of unsupervised pretraining, yielding contextualized features that capture linguistic properties and benefit downstream natural language understanding tasks. However, the effect of pretrained language representation for few-shot learning on text classification tasks is still not well understood. In this study, we design a few-shot learning model with pretrained language representations and report the empirical results. We show that our approach is not only simple but also produces state-of-the-art performance on a well-studied sentiment classification dataset. It can thus be further suggested that pretraining could be a promising solution for few shot learning of many other NLP tasks. The code and the dataset to replicate the experiments are made available at //github.com/zxlzr/FewShotNLP.

BERT · MoDELS · 語言模型化 · 變換 · state-of-the-art ·

2019 年 8 月 22 日

Text Summarization with Pretrained Encoders

Yang Liu,Mirella Lapata

from arxiv, To appear in EMNLP 2019

Bidirectional Encoder Representations from Transformers (BERT) represents the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks. In this paper, we showcase how BERT can be usefully applied in text summarization and propose a general framework for both extractive and abstractive models. We introduce a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences. Our extractive model is built on top of this encoder by stacking several inter-sentence Transformer layers. For abstractive summarization, we propose a new fine-tuning schedule which adopts different optimizers for the encoder and the decoder as a means of alleviating the mismatch between the two (the former is pretrained while the latter is not). We also demonstrate that a two-staged fine-tuning approach can further boost the quality of the generated summaries. Experiments on three datasets show that our model achieves state-of-the-art results across the board in both extractive and abstractive settings. Our code is available at //github.com/nlpyang/PreSumm

state-of-the-art · 可理解性 · BERT · 去噪自編碼器 · Performer ·

2019 年 6 月 19 日

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang,Zihang Dai,Yiming Yang,Jaime Carbonell,Ruslan Salakhutdinov,Quoc V. Le

from arxiv, Pretrained models and code are available at //github.com/zihangdai/xlnet

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

視覺問答 · 自動問答 · 多峰值 · MoDELS · Automator ·

2018 年 8 月 29 日

From VQA to Multimodal CQA: Adapting Visual QA Models for Community QA Tasks

Avikalp Srivastava,Hsin Wen Liu,Sumio Fujita

from arxiv, Submitted for review at AAAI 2019

In this work, we present novel methods to adapt visual QA models for community QA tasks of practical significance - automated question category classification and finding experts for question answering - on questions containing both text and image. To the best of our knowledge, this is the first work to tackle the multimodality challenge in CQA, and is an enabling step towards basic question-answering on image-based CQA. First, we analyze the differences between visual QA and community QA datasets, discussing the limitations of applying VQA models directly to CQA tasks, and then we propose novel augmentations to VQA-based models to best address those limitations. Our model, with the augmentations of an image-text combination method tailored for CQA and use of auxiliary tasks for learning better grounding features, significantly outperforms the text-only and VQA model baselines for both tasks on real-world CQA data from Yahoo! Chiebukuro, a Japanese counterpart of Yahoo! Answers.