嘘嘘中国免费观看网站_99久久国产精品综合久久国产_久久中文字幕亚洲精品视色_91午夜在线视频_97狠狠色丁香婷婷综合久久_日韩亚洲无码国产AV午夜精品_久久国内精品久久久久久久久

In this paper, we propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Finally, the outputs from the above modules are processed by a global-local attentional answering module to produce an answer splicing together tokens from both OCR and general vocabulary iteratively by following M4C. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP. Demonstrating strong reasoning ability, it also won first place in TextVQA Challenge 2020. We extensively test different OCR methods on several reasoning models and investigate the impact of gradually increased OCR performance on TextVQA benchmark. With better OCR results, different models share dramatic improvement over the VQA accuracy, but our model benefits most blessed by strong textual-visual reasoning ability. To grant our method an upper bound and make a fair testing base available for further works, we also provide human-annotated ground-truth OCR annotations for the TextVQA dataset, which were not given in the original release. The code and ground-truth OCR annotations for the TextVQA dataset are available at //github.com/ChenyuGAO-CS/SMA

相關內容

多(duo)峰值(zhi)

關注 2

視頻分類 · 數據集 · MVC · 自動問答 · 多模態學習 ·

2022 年 1 月 30 日

A Dataset for Medical Instructional Video Classification and Question Answering

Deepak Gupta,Kush Attal,Dina Demner-Fushman

This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best possible answers to many first aids, medical emergency, and medical education questions. Toward this, we created the MedVidCL and MedVidQA datasets and introduce the tasks of Medical Video Classification (MVC) and Medical Visual Answer Localization (MVAL), two tasks that focus on cross-modal (medical language and medical video) understanding. The proposed tasks and datasets have the potential to support the development of sophisticated downstream applications that can benefit the public and medical practitioners. Our datasets consist of 6,117 annotated videos for the MVC task and 3,010 annotated questions and answers timestamps from 899 videos for the MVAL task. These datasets have been verified and corrected by medical informatics experts. We have also benchmarked each task with the created MedVidCL and MedVidQA datasets and proposed the multimodal learning methods that set competitive baselines for future research.

多峰值 · 光學字符識別 · 變換 · MoDELS · OCR（光學字符識別） ·

2020 年 12 月 7 日

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

Zhaokai Wang,Renda Bao,Qi Wu,Si Liu

from arxiv, 9 pages; Accepted by AAAI 2021

When describing an image, reading text in the visual scene is crucial to understand the key information. Recent work explores the TextCaps task, \emph{i.e.} image captioning with reading Optical Character Recognition (OCR) tokens, which requires models to read text and cover them in generated captions. Existing approaches fail to generate accurate descriptions because of their (1) poor reading ability; (2) inability to choose the crucial words among all extracted OCR tokens; (3) repetition of words in predicted captions. To this end, we propose a Confidence-aware Non-repetitive Multimodal Transformers (CNMT) to tackle the above challenges. Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens. To address the issue of word redundancy in captions, our Generation Module includes a repetition mask to avoid predicting repeated word in captions. Our model outperforms state-of-the-art models on TextCaps dataset, improving from 81.0 to 93.0 in CIDEr. Our source code is publicly available.

自動問答 · Integration · 圖 · 知識圖譜 · MoDELS ·

2020 年 10 月 30 日

Dynamic Neuro-Symbolic Knowledge Graph Construction for Zero-shot Commonsense Question Answering

Antoine Bosselut,Ronan Le Bras,Yejin Choi

Understanding narratives requires reasoning about implicit world knowledge related to the causes, effects, and states of situations described in text. At the core of this challenge is how to access contextually relevant knowledge on demand and reason over it. In this paper, we present initial studies toward zero-shot commonsense question answering by formulating the task as inference over dynamically generated commonsense knowledge graphs. In contrast to previous studies for knowledge integration that rely on retrieval of existing knowledge from static knowledge graphs, our study requires commonsense knowledge integration where contextually relevant knowledge is often not present in existing knowledge bases. Therefore, we present a novel approach that generates contextually-relevant symbolic knowledge structures on demand using generative neural commonsense knowledge models. Empirical results on two datasets demonstrate the efficacy of our neuro-symbolic approach for dynamically constructing knowledge graphs for reasoning. Our approach achieves significant performance boosts over pretrained language models and vanilla knowledge models, all while providing interpretable reasoning paths for its predictions.

Performer · MoDELS · Networking · 歸納偏好 · state-of-the-art ·

2019 年 12 月 10 日

Neural Module Networks for Reasoning over Text

Nitish Gupta,Kevin Lin,Dan Roth,Sameer Singh,Matt Gardner

Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. However, we find that it is challenging to learn these models for non-synthetic questions on open-domain text, where a model needs to deal with the diversity of natural language and perform a broader range of reasoning. We extend NMNs by: (a) introducing modules that reason over a paragraph of text, performing symbolic reasoning (such as arithmetic, sorting, counting) over numbers and dates in a probabilistic and differentiable manner; and (b) proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text. Additionally, we show that a limited amount of heuristically-obtained question program and intermediate module output supervision provides sufficient inductive bias for accurate learning. Our proposed model significantly outperforms state-of-the-art models on a subset of the DROP dataset that poses a variety of reasoning challenges that are covered by our modules.

視覺問答 · 圖 · 自動問答 · 學習器 · 學成 ·

2018 年 6 月 20 日

Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Will Norcliffe-Brown,Efstathios Vafeias,Sarah Parisot

from arxiv, 11 pages, 6 figures, submitted to NIPS 2018

Visual Question answering is a challenging problem requiring a combination of concepts from Computer Vision and Natural Language Processing. Most existing approaches use a two streams strategy, computing image and question features that are consequently merged using a variety of techniques. Nonetheless, very few rely on higher level image representations, which allow to capture semantic and spatial relationships. In this paper, we propose a novel graph-based approach for Visual Question Answering. Our method combines a graph learner module, which learns a question specific graph representation of the input image, with the recent concept of graph convolutions, aiming to learn image representations that capture question specific interactions. We test our approach on the VQA v2 dataset using a simple baseline architecture enhanced by the proposed graph learner module. We obtain state of the art results with 65.77% accuracy and demonstrate the interpretability of the proposed method.

視覺問答 · entity · 注意力機制 · 多峰值 · 學成 ·

2018 年 5 月 24 日

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

Pan Lu,Lei Ji,Wei Zhang,Nan Duan,Ming Zhou,Jianyong Wang

from arxiv, 10 pages, 5 figures, accepted in SIGKDD 2018

Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to learn their joint feature embedding via multimodal fusion or attention mechanism. Some recent studies utilize external VQA-independent models to detect candidate entities or attributes in images, which serve as semantic knowledge complementary to the VQA task. However, these candidate entities or attributes might be unrelated to the VQA task and have limited semantic capacities. To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA. Specifically, we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome dataset via a semantic similarity module, in which each data consists of an image, a corresponding question, a correct answer and a supporting relation fact. A well-defined relation detector is then adopted to predict visual question-related relation facts. We further propose a multi-step attention model composed of visual attention and semantic attention sequentially to extract related visual knowledge and semantic knowledge. We conduct comprehensive experiments on the two benchmark datasets, demonstrating that our model achieves state-of-the-art performance and verifying the benefit of considering visual relation facts.

視覺問答 · 注意力機制 · INFORMS · 自動問答 · Performer ·

2018 年 5 月 11 日

Reciprocal Attention Fusion for Visual Question Answering

Moshiur R Farazi,Salman Khan

Existing attention mechanisms either attend to local image grid or object level features for Visual Question Answering (VQA). Motivated by the observation that questions can relate to both object instances and their parts, we propose a novel attention mechanism that jointly considers reciprocal relationships between the two levels of visual details. The bottom-up attention thus generated is further coalesced with the top-down information to only focus on the scene elements that are most relevant to a given question. Our design hierarchically fuses multi-modal information i.e., language, object- and gird-level features, through an efficient tensor decomposition scheme. The proposed model improves the state-of-the-art single model performances from 67.9% to 68.2% on VQAv1 and from 65.3% to 67.4% on VQAv2, demonstrating a significant boost.

視覺問答 · 注意力機制 · Performer · state-of-the-art · MoDELS ·

2018 年 2 月 1 日

Dual Recurrent Attention Units for Visual Question Answering

Ahmed Osman,Wojciech Samek

from arxiv, 10 pages, 5 figures

We propose an architecture for VQA which utilizes recurrent layers to generate visual and textual attention. The memory characteristic of the proposed recurrent attention units offers a rich joint embedding of visual and textual features and enables the model to reason relations between several parts of the image and question. Our single model outperforms the first place winner on the VQA 1.0 dataset, performs within margin to the current state-of-the-art ensemble model. We also experiment with replacing attention mechanisms in other state-of-the-art models with our implementation and show increased accuracy. In both cases, our recurrent attention mechanism improves performance in tasks requiring sequential or relational reasoning on the VQA dataset.

視頻描述生成（Video Caption） · Neural Networks · Networking · 多峰值 · state-of-the-art ·

2018 年 1 月 29 日

Predicting Visual Features from Text for Image and Video Caption Retrieval

Jianfeng Dong,Xirong Li,Cees G. M. Snoek

from arxiv, 12 pages, 7 figures, under view of TMM

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

視覺問答 · Performer · INTERACT · 注意力機制 · 自動問答 ·

2018 年 1 月 24 日

Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering

Zhe Wang,Xiaoyi Liu,Liangjian Chen,Limin Wang,Yu Qiao,Xiaohui Xie,Charless Fowlkes

from arxiv, 8 pages, 5 figures, state-of-the-art VQA system; //github.com/wangzheallen/STL-VQA

Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-of-speech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.