題目: MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION
摘要: 盡管人們對無監督學習越來越感興趣,但從無標簽的音頻中提取有意義的知識仍然是一個公開的挑戰。為了在這個方向上邁出一步,我們最近提出了一個問題不可知的語音編碼器(PASE),它結合了一個卷積編碼器和多個神經網絡,稱為workers,其任務是解決自監督的問題,不需要手動注釋的真值。PASE證明能夠捕捉相關的語音信息,包括說話者的聲紋和音素。本文提出了一種改進的PASE+,用于在噪聲和混響環境下進行魯棒語音識別。為此,我們使用了一個在線語音失真模塊,它用各種隨機干擾來污染輸入信號。然后,我們提出一種改進的編碼器,更好地學習短期和長期語音動態與遞歸網絡和卷積網絡的有效結合。最后,我們完善了用于自監督的workers,以鼓勵更好的合作。
TIMIT、DIRHA和CHiME-5的結果表明,PASE+ sig-明顯優于之前版本的PASE以及常見的聲學特性。有趣的是,PASE+學習適用于高度不匹配的聲學條件的可轉移特征。
自監督式VO方法在視頻中聯合估計攝像機姿態和深度方面取得了很大的成功。然而,與大多數數據驅動的方法一樣,現有的VO網絡在面對與訓練數據不同的場景時,性能顯著下降,不適合實際應用。在本文中,我們提出了一種在線元學習算法,使VO網絡能夠以一種自監督的方式不斷適應新的環境。該方法利用卷積長短時記憶(convLSTM)來聚合過去的豐富時空信息。網絡能夠記憶和學習過去的經驗,以便更好地估計和快速適應當前幀。在開放環境中運行VO時,為了應對環境的變化,我們提出了一種在線的特征對齊方法,即在不同的時刻對特征分布進行對齊。我們的VO網絡能夠無縫地適應不同的環境。在看不見的戶外場景、虛擬到真實世界和戶外到室內環境的大量實驗表明,我們的方法始終比最先進的自監督的VO基線性能更好。
題目: Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
摘要: 無監督學習的表示仍然是機器學習中的一個開放問題,尤其是語音信號的挑戰,語音信號的特征通常是長序列和復雜的層次結構。然而,最近的一些研究表明,通過使用一種自監督的編碼器-鑒別器方法來獲得有用的語音表示是可能的。本文提出了一種改進的自監督方法,即一個神經編碼器由多個工作者共同完成不同的自監督任務。不同任務之間所需的一致意見自然會給編碼人員帶來有意義的約束,有助于發現一般的表示,并將學習淺顯表示的風險降至最低。實驗表明,該方法可以學習可遷移的、具有魯棒性的、與問題無關的特征,這些特征從語音信號中傳遞相關信息,如說話人身份、音素,甚至更高層次的特征,如情感線索。此外,大量的設計選擇使編碼器易于輸出,方便其直接使用或適應不同的問題。
自監督學習(Self-Supervised Learning)是一種介于無監督和監督學習之間的一種新范式,旨在減少對大量帶注釋數據的挑戰性需求。它通過定義無注釋(annotation-free)的前置任務(pretext task),為特征學習提供代理監督信號。jason718整理了關于自監督學習最新的論文合集,非常值得查看!
地址: //github.com/jason718/awesome-self-supervised-learning
A curated list of awesome Self-Supervised Learning resources. Inspired by , , , and
Self-Supervised Learning has become an exciting direction in AI community.
Please help contribute this list by contacting or add
Markdown format:
- Paper Name.
[[pdf]](link)
[[code]](link)
- Author 1, Author 2, and Author 3. *Conference Year*
FAIR Self-Supervision Benchmark : various benchmark (and legacy) tasks for evaluating quality of visual representations learned by various self-supervision approaches.
Unsupervised Visual Representation Learning by Context Prediction.
Unsupervised Learning of Visual Representations using Videos.
Learning to See by Moving.
Learning image representations tied to ego-motion.
Joint Unsupervised Learning of Deep Representations and Image Clusters.
Unsupervised Deep Embedding for Clustering Analysis.
Slow and steady feature analysis: higher order temporal coherence in video.
Context Encoders: Feature Learning by Inpainting.
Colorful Image Colorization.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles.
Ambient Sound Provides Supervision for Visual Learning.
Learning Representations for Automatic Colorization.
Unsupervised Visual Representation Learning by Graph-based Consistent Constraints.
Adversarial Feature Learning.
Self-supervised learning of visual features through embedding images into text topic spaces.
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction.
Learning Features by Watching Objects Move.
Colorization as a Proxy Task for Visual Understanding.
DeepPermNet: Visual Permutation Learning.
Unsupervised Learning by Predicting Noise.
Multi-task Self-Supervised Visual Learning.
Representation Learning by Learning to Count.
Transitive Invariance for Self-supervised Visual Representation Learning.
Look, Listen and Learn.
Unsupervised Representation Learning by Sorting Sequences.
Unsupervised Feature Learning via Non-parameteric Instance Discrimination
Learning Image Representations by Completing Damaged Jigsaw Puzzles.
Unsupervised Representation Learning by Predicting Image Rotations.
Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization.
Improvements to context based self-supervised learning.
Self-Supervised Feature Learning by Learning to Spot Artifacts.
Boosting Self-Supervised Learning via Knowledge Transfer.
Cross-domain Self-supervised Multi-task Feature Learning Using Synthetic Imagery.
ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids.
Deep Clustering for Unsupervised Learning of Visual Features
Cross Pixel Optical-Flow Similarity for Self-Supervised Learning.
Representation Learning with Contrastive Predictive Coding.
Self-Supervised Learning via Conditional Motion Propagation.
Self-Supervised Representation Learning by Rotation Feature Decoupling.
Revisiting Self-Supervised Visual Representation Learning.
AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data.
Unsupervised Deep Learning by Neighbourhood Discovery. . .
Contrastive Multiview Coding.
Large Scale Adversarial Representation Learning.
Learning Representations by Maximizing Mutual Information Across Views.
Selfie: Self-supervised Pretraining for Image Embedding.
Data-Efficient Image Recognition with Contrastive Predictive Coding
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty
Boosting Few-Shot Visual Learning with Self-Supervision
Self-Supervised Generalisation with Meta Auxiliary Learning
Wasserstein Dependency Measure for Representation Learning
Scaling and Benchmarking Self-Supervised Visual Representation Learning
A critical analysis of self-supervision, or what we can learn from a single image
On Mutual Information Maximization for Representation Learning
Understanding the Limitations of Variational Mutual Information Estimators
Automatic Shortcut Removal for Self-Supervised Representation Learning
Momentum Contrast for Unsupervised Visual Representation Learning
A Simple Framework for Contrastive Learning of Visual Representations
ClusterFit: Improving Generalization of Visual Representations
Self-Supervised Learning of Pretext-Invariant Representations
Unsupervised Learning of Video Representations using LSTMs.
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification.
LSTM Self-Supervision for Detailed Behavior Analysis
Self-Supervised Video Representation Learning With Odd-One-Out Networks.
Unsupervised Learning of Long-Term Motion Dynamics for Videos.
Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning.
Improving Spatiotemporal Self-Supervision by Deep Reinforcement Learning.
Self-supervised learning of a facial attribute embedding from video.
Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles.
Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics.
DynamoNet: Dynamic Action and Motion Network.
Learning Correspondence from the Cycle-consistency of Time.
Joint-task Self-supervised Learning for Temporal Correspondence.
Self-supervised Learning of Motion Capture.
Unsupervised Learning of Depth and Ego-Motion from Video.
Active Stereo Net: End-to-End Self-Supervised Learning for Active Stereo Systems.
Self-Supervised Relative Depth Learning for Urban Scene Understanding.
Geometry-Aware Learning of Maps for Camera Localization.
Self-supervised Learning of Geometrically Stable Features Through Probabilistic Introspection.
Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry.
SelFlow: Self-Supervised Learning of Optical Flow.
Unsupervised Learning of Landmarks by Descriptor Vector Exchange.
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features.
Objects that Sound.
Learning to Separate Object Sounds by Watching Unlabeled Video.
The Sound of Pixels.
Learnable PINs: Cross-Modal Embeddings for Person Identity.
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization.
Self-Supervised Generation of Spatial Audio for 360° Video.
TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
Self-taught Learning: Transfer Learning from Unlabeled Data.
Representation Learning: A Review and New Perspectives.
Curiosity-driven Exploration by Self-supervised Prediction.
Large-Scale Study of Curiosity-Driven Learning.
Playing hard exploration games by watching YouTube.
Unsupervised State Representation Learning in Atari.
Improving Robot Navigation Through Self-Supervised Online Learning
Reverse Optical Flow for Self-Supervised Adaptive Autonomous Robot Navigation
Online self-supervised learning for dynamic object segmentation
Self-Supervised Online Learning of Basic Object Push Affordances
Self-supervised learning of grasp dependent tool affordances on the iCub Humanoid robot
Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance
The Curious Robot: Learning Visual Representations via Physical Interactions.
Learning to Poke by Poking: Experiential Learning of Intuitive Physics.
Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours.
Supervision via Competition: Robot Adversaries for Learning Tasks.
Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge.
Combining Self-Supervised Learning and Imitation for Vision-Based Rope Manipulation.
Learning to Fly by Crashing
Self-supervised learning as an enabling technology for future space exploration robots: ISS experiments on monocular distance learning
Unsupervised Perceptual Rewards for Imitation Learning.
Self-Supervised Visual Planning with Temporal Skip Connections.
CASSL: Curriculum Accelerated Self-Supervised Learning.
Time-Contrastive Networks: Self-Supervised Learning from Video.
Self-Supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation.
Learning Actionable Representations from Visual Observations.
Learning Synergies between Pushing and Grasping with Self-supervised Deep Reinforcement Learning.
Visual Reinforcement Learning with Imagined Goals.
Grasp2Vec: Learning Object Representations from Self-Supervised Grasping.
Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning.
Learning Long-Range Perception Using Self-Supervision from Short-Range Sensors and Odometry.
Learning Latent Plans from Play.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Self-Supervised Dialogue Learning
Self-Supervised Learning for Contextualized Extractive Summarization
A Mutual Information Maximization Perspective of Language Representation Learning
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Learning Robust and Multilingual Speech Representations
Unsupervised pretraining transfers well across languages
wav2vec: Unsupervised Pre-Training for Speech Recognition
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
Effectiveness of self-supervised pre-training for speech recognition
Towards Unsupervised Speech Recognition and Synthesis with Quantized Speech Representation Learning
Self-Training for End-to-End Speech Recognition
Generative Pre-Training for Speech with Autoregressive Predictive Coding
To the extent possible under law, has waived all copyright and related or neighboring rights to this work.
題目: Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
摘要: 為了在計算機視覺應用中從圖像或視頻中獲得更好的視覺特征學習性能,通常需要大規模的標記數據來訓練深度神經網絡。為了避免大規模數據集收集和標注的大量開銷,作為無監督學習方法的一個子集,提出了一種自監督學習方法,在不使用任何人類標注的標簽的情況下,從大規模無標記數據中學習圖像和視頻的一般特征。本文對基于深度學習的自監督一般視覺特征學習方法進行了廣泛的綜述。首先,描述了該領域的動機、通用管道和術語。在此基礎上,總結了常用的用于自監督學習的深度神經網絡體系結構。接下來,回顧了自監督學習方法的模式和評價指標,然后介紹了常用的圖像和視頻數據集以及現有的自監督視覺特征學習方法。最后,總結和討論了基于基準數據集的定量性能比較方法在圖像和視頻特征學習中的應用。最后,對本文的研究進行了總結,并提出了一套具有發展前景的自監督視覺特征學習方法。
題目: Self-supervised learning for audio-visual speaker diarization
摘要:
主講人二值化是一種尋找特定主講人語音片段的技術,在視頻會議、人機交互系統等以人為中心的應用中得到了廣泛的應用。在這篇論文中,我們提出一種自監督的音視頻同步學習方法來解決說話人的二值化問題,而不需要大量的標注工作。我們通過引入兩個新的損失函數:動態三重損失和多項式損失來改進前面的方法。我們在一個真實的人機交互系統上進行了測試,結果表明我們的最佳模型獲得了顯著的+8%的f1分數,并降低了二值化的錯誤率。最后,我們介紹了一種新的大規模的音視頻語料庫,以填補漢語音視頻數據集的空白。
題目: Adversarial-Learned Loss for Domain Adaptation
摘要: 近年來,在跨領域學習可轉移表征方面取得了顯著的進展。以往的領域適應研究主要基于兩種技術:領域對抗學習和自我訓練。然而,領域對抗性學習只會調整領域之間的特征分布,而不考慮目標特征是否具有區分性。另一方面,自訓練利用模型預測來增強目標特征的識別,但無法明確地指定領域分布。為了將這兩種方法的優點結合起來,我們提出了一種新的領域自適應的通用學習損失(ALDA)方法,首先分析了一種典型的自訓練方法偽標簽方法。然而,偽標簽和地面真實性之間存在差距,這可能導致錯誤的訓練。因此,我們引入了混淆矩陣,通過對抗性的方式在ALDA中學習,以減少gap并對齊特征分布。最后,從學習的混淆矩陣中自動構造一個新的損失函數,作為未標記目標樣本的損失。在四標準域自適應數據集中,OurALDA優于最新方法。
作者簡介: Haifeng Liu,博士,浙江大學計算機學院副教授。個人主頁://person.zju.edu.cn/en/hfliu
題目: Unsupervised pre-training for sequence to sequence speech recognition
摘要:
本文提出了一種新的編碼-解碼器序列到序列預訓練模型(seq2seq)。我們的前訓練方法分為兩個階段,分別是聲學前訓練和語言前訓練。在聲學預訓練階段,我們使用大量的語音來預訓練編碼器,通過預測掩蔽語音特征塊及其上下文。在語言前訓練階段,我們使用單說話文本到語音(TTS)系統從大量的文本中生成合成語音,并使用合成的成對數據對譯碼器進行預訓練。這種兩階段預訓練方法將豐富的聲學和語言知識整合到seq2seq模型中,有利于后續的自動語音識別(ASR)任務。在AISHELL-2數據集上完成無監督的預訓練,我們將預訓練模型應用于AISHELL-1和香港科技大學的多重配對數據比率。我們的相對錯誤率由AISHELL-1的38.24%降至7.88%,由香港科技大學的12.00%降至1.20%。此外,將我們的預訓練模型應用到帶有CALLHOME數據集的跨語言案例中。對于CALLHOME數據集中的所有六種語言,我們的預訓練方法使模型始終優于基線。
作者:
徐波,研究員,1988年畢業于浙江大學,現任中國科學院自動化所所長 ,研究領域包括:多語言語音識別與機器翻譯、多媒體網絡內容智能處理、互動沉浸式3D互聯網等。
題目: Adversarial Cross-Domain Action Recognition with Co-Attention
摘要: 動作識別是一個被廣泛研究的課題,其研究重點是有監督的學習,包括足夠多的視頻。然而,跨域動作識別的問題,即訓練和測試視頻是從不同的底層分布中提取出來的,在很大程度上仍然沒有得到充分的研究。以往的方法直接采用跨域圖像識別技術,容易出現嚴重的時間錯位問題。提出了一種時間協同注意網絡(TCoN),該網絡利用一種新的跨域協同注意機制,對源域和目標域之間的時間對準動作特征分布進行了匹配。在三個跨域動作識別數據集上的實驗結果表明,在跨域設置下,TCoN顯著地改進了以往的單域和跨域方法。
作者簡介: Boxiao Pan,斯坦福大學視覺與學習實驗室的碩士。他對構建能夠解釋和理解以人為中心的行為、場景和事件的智能系統非常著迷,尤其是通過視頻輸入。//cs.stanford.edu/~bxpan/
Zhangjie Cao,斯坦福大學計算機科學系的博士。
摘要: 現有的不流利檢測方法大多嚴重依賴人工標注的數據,而在實踐中獲取這些數據的成本很高。為了解決訓練數據的瓶頸,我們研究了將多個自監督任務相結合的方法。在監督任務中,無需人工標記就可以收集數據。首先,我們通過隨機添加或刪除未標記新聞數據中的單詞來構建大規模的偽訓練數據,并提出了兩個自我監督的訓練前任務:(i)標記任務來檢測添加的噪聲單詞。(ii)對句子進行分類,區分原句和語法錯誤句子。然后我們將這兩個任務結合起來共同訓練一個網絡。然后使用人工標注的不流利檢測訓練數據對訓練前的網絡進行微調。在常用的英語交換機測試集上的實驗結果表明,與以前的系統(使用完整數據集進行訓練)相比,我們的方法只需使用不到1%(1000個句子)的訓練數據,就可以獲得具有競爭力的性能。我們的方法在全數據集上進行訓練,明顯優于以前的方法,在英語Switchboard上將錯誤率降低了21%。
論文題目: A Divergence Minimization Perspective on Imitation Learning Methods
論文摘要: 在許多情況下,希望通過專家演示的學習或引導來學習決策和控制策略。這種模仿學習(IL)框架下最常見的方法是行為克隆(BC)和逆強化學習(IRL)。IRL的最新方法已經證明了可以通過訪問非常有限的一組演示來學習有效策略的能力,一種情況BC方法經常失敗。不幸的是,由于變化的多種因素,直接比較這些方法并不能提供足夠的直覺來理解這種性能差異。在這項工作中,我們提出了基于散度最小化的IL算法的統一概率觀點。我們提出了f-MAX,這是AIRL的一種泛化概括,它是一種最新的IRL方法。 f-MAX使我們能夠關聯以前的IRL方法,例如GAIL和AIRL,并了解它們的算法特性。通過散度最小化的鏡頭,我們可以找出BC和成功的IRL方法之間的差異,并在模擬的高維連續控制域上經驗地評估這些細微差別。我們的發現最終確定了IRL的州際匹配目標是其卓越績效的最大貢獻。最后,我們將對IL方法的新理解應用于狀態-邊際匹配的問題,其中我們證明了在模擬推臂環境中,我們可以使用簡單的手動指定狀態分布來教給代理各種行為,而無需獎勵函數或專家。
論文作者: Richard Zemel ,Vector人工智能研究所的聯合創始人兼研究總監,多倫多大學機器學習工業研究主席,加拿大高級研究所高級研究員,研究興趣包括:圖像和文本的生成模型,基于圖的機器學習,少量數據學習,詞典,單詞列表和公平性。
github鏈接: //github.com/KamyarGh/rl_swiss/blob/master/reproducing/fmax_paper.md