在過去的幾年里,Meta AI產生了一系列的研究項目,每個項目都解決了多模態感知的一個重要挑戰從解決用于訓練的公開可用數據的短缺(Hateful 的模因),到為視覺、語音和文本創建單一算法(Data2vec),到建立跨多個任務工作的基礎模型(FLAVA),到找到正確的模型參數(Omnivore),以及其他許多。綜合來看,它們代表了一個明顯的趨勢: 在不久的將來,對多模態的理解將對更智能的AI系統至關重要。
近年來,人們致力于將神經模型應用于自然語言的生成。挑戰在于生成自然的類人文本,并控制生成過程。本文提出了一個任務無關定的神經文本生成綜述。這些進步已經通過大量的發展取得,我們將其歸為以下四個標題:數據構建、神經框架、訓練和推理策略,以及評估指標。最后討論了神經文本生成的未來發展方向,包括神經通道和背景知識開發。
本文從時間的角度對視覺語言智能進行了全面的研究。這項研究的靈感來自于計算機視覺和自然語言處理的顯著進展,以及從單一模態處理到多模態理解的最新趨勢。我們將這一領域的發展總結為三個時期,即任務特定方法,視覺語言預訓練(VLP)方法,以及由大規模弱標記數據訓練的大模型。我們首先以一些常見的VL任務為例,介紹了特定于任務的開發方法。然后我們重點介紹了VLP方法,并全面回顧了模型結構和訓練方法的關鍵組成部分。之后,我們展示了最近的工作是如何利用大規模的原始圖像-文本數據來學習語言對齊的視覺表示,這種視覺表示在零或少數樣本學習任務中得到了更好的泛化。最后,我們討論了在模態協同、統一表示和知識整合方面的一些潛在的未來趨勢。我們相信這篇綜述將有助于人工智能和ML的研究人員和實踐者,特別是那些對計算機視覺和自然語言處理感興趣的人。
【多模態視頻字幕的端到端生成預訓練】End-to-end Generative Pretraining for Multimodal Video Captioning
● 論文摘要:最近的視頻和語言前訓練框架缺乏生成句子的能力。我們提出了多模態視頻生成預訓練(MV-GPT),這是一個新的用于從無標簽視頻學習的預訓練框架,它可以有效地用于生成任務,如多模態視頻字幕。與最近的視頻語言預訓練框架不同,我們的框架同時訓練多模態視頻編碼器和句子解碼器。為了克服無標簽視頻中字幕的缺乏,我們利用未來話語作為一個額外的文本源,并提出一個雙向生成目標——我們在當前多模態語境下生成未來話語,在未來觀察下也生成當前話語。基于此目標,我們訓練一個端到端的編碼器-解碼器模型來直接從原始像素和轉錄語音生成標題。我們的模型在四個標準基準上的多模態視頻字幕以及其他視頻理解任務(如VideoQA、視頻檢索和動作分類)上都達到了最先進的性能。
● 論文鏈接://arxiv.org/abs/2201.08264
● 作者單位:Google Research
Deep Learning (DL) is the most widely used tool in the contemporary field of computer vision. Its ability to accurately solve complex problems is employed in vision research to learn deep neural models for a variety of tasks, including security critical applications. However, it is now known that DL is vulnerable to adversarial attacks that can manipulate its predictions by introducing visually imperceptible perturbations in images and videos. Since the discovery of this phenomenon in 2013~[1], it has attracted significant attention of researchers from multiple sub-fields of machine intelligence. In [2], we reviewed the contributions made by the computer vision community in adversarial attacks on deep learning (and their defenses) until the advent of year 2018. Many of those contributions have inspired new directions in this area, which has matured significantly since witnessing the first generation methods. Hence, as a legacy sequel of [2], this literature review focuses on the advances in this area since 2018. To ensure authenticity, we mainly consider peer-reviewed contributions published in the prestigious sources of computer vision and machine learning research. Besides a comprehensive literature review, the article also provides concise definitions of technical terminologies for non-experts in this domain. Finally, this article discusses challenges and future outlook of this direction based on the literature reviewed herein and [2].
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals. Detailed analysis of past and current baseline approaches and an in-depth study of recent advancements in multimodal deep learning applications has been provided. A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth. Architectures and datasets used in these applications are also discussed, along with their evaluation metrics. Last, main issues are highlighted separately for each domain along with their possible future research directions.
論文主題: Recent Advances in Deep Learning for Object Detection
論文摘要: 目標檢測是計算機視覺中的基本視覺識別問題,并且在過去的幾十年中已得到廣泛研究。目標檢測指的是在給定圖像中找到具有精確定位的特定目標,并為每個目標分配一個對應的類標簽。由于基于深度學習的圖像分類取得了巨大的成功,因此近年來已經積極研究了使用深度學習的對象檢測技術。在本文中,我們對深度學習中視覺對象檢測的最新進展進行了全面的調查。通過復習文獻中最近的大量相關工作,我們系統地分析了現有的目標檢測框架并將調查分為三個主要部分:(i)檢測組件,(ii)學習策略(iii)應用程序和基準。在調查中,我們詳細介紹了影響檢測性能的各種因素,例如檢測器體系結構,功能學習,建議生成,采樣策略等。最后,我們討論了一些未來的方向,以促進和刺激未來的視覺對象檢測研究。與深度學習。
Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is to enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 6 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods.