題目: Wavesplit: End-to-End Speech Separation by Speaker Clustering
摘要:
本文介紹了一種端到端的語音分離系統Wavesplit。從混合語音的單一記錄中,該模型推斷和聚集了每個說話者的表征,然后根據推斷的表征估計每個源信號。該模型根據原始波形進行訓練,共同完成這兩項任務。該模型通過聚類的方法推導出一組說話人表示,解決了語音分離的基本排列問題。此外,與以前的方法相比,序列范圍的揚聲器表示提供了更健壯的長而有挑戰性的序列分離。我們證明Wavesplit在2個或3個揚聲器(WSJ0-2mix、WSJ0-3mix)的混合物上,以及在有噪聲(WHAM!)和混響 (WHAMR!)的情況下,都比以前的技術水平要好。此外,我們通過引入在線數據增強來進一步改進我們的模型。
題目: Self-supervised learning for audio-visual speaker diarization
摘要:
主講人二值化是一種尋找特定主講人語音片段的技術,在視頻會議、人機交互系統等以人為中心的應用中得到了廣泛的應用。在這篇論文中,我們提出一種自監督的音視頻同步學習方法來解決說話人的二值化問題,而不需要大量的標注工作。我們通過引入兩個新的損失函數:動態三重損失和多項式損失來改進前面的方法。我們在一個真實的人機交互系統上進行了測試,結果表明我們的最佳模型獲得了顯著的+8%的f1分數,并降低了二值化的錯誤率。最后,我們介紹了一種新的大規模的音視頻語料庫,以填補漢語音視頻數據集的空白。
題目: Speech and Language Processing
摘要: Speech and Language Processing是一個全面的,對讀者友好的,最新的計算語言學指南,涵蓋統計和符號方法及其應用。它既能吸引那些既不太專業的高年級本科生,也能吸引那些研究人員,他們會發現它對一個快速發展的研究領域的新技術有幫助。
We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.