在线亚洲91SE亚洲综合在线-中文字幕AV一区二区三区亭亭色

<dd id='zb0hu'></dd>

語音識別 · 聚類 · 序列分析 · Google ·

2020 年 2 月 26 日

[付費5元查看完整內容]【Google Research】Wavesplit:通過說話者聚類實現端到端的語音分離，Wavesplit: End-to-End Speech Separation by Speaker Clustering

專知會員服務

專知，提供專業可信的知識分發服務，讓認知協作更快更好！

題目： Wavesplit: End-to-End Speech Separation by Speaker Clustering

摘要：

本文介紹了一種端到端的語音分離系統Wavesplit。從混合語音的單一記錄中，該模型推斷和聚集了每個說話者的表征，然后根據推斷的表征估計每個源信號。該模型根據原始波形進行訓練，共同完成這兩項任務。該模型通過聚類的方法推導出一組說話人表示，解決了語音分離的基本排列問題。此外，與以前的方法相比，序列范圍的揚聲器表示提供了更健壯的長而有挑戰性的序列分離。我們證明Wavesplit在2個或3個揚聲器(WSJ0-2mix、WSJ0-3mix)的混合物上，以及在有噪聲(WHAM!)和混響 (WHAMR!)的情況下，都比以前的技術水平要好。此外，我們通過引入在線數據增強來進一步改進我們的模型。

付費5元查看完整內容

相關內容

語音識別

關注 753

語音識別是計算機科學和計算語言學的一個跨學科子領域，它發展了一些方法和技術，使計算機可以將口語識別和翻譯成文本。它也被稱為自動語音識別（ASR），計算機語音識別或語音轉文本（STT）。它整合了計算機科學，語言學和計算機工程領域的知識和研究。

自主學習 · 語音處理 · 人機交互 · 自監督學習 · 語料庫 ·

2020 年 2 月 16 日

[付費5元查看完整內容]【北郵-騰訊AI】自監督學習音視覺說話人認證，Self-supervised learning for audio-visual speaker diarization

專知會員服務

專知，提供專業可信的知識分發服務，讓認知協作更快更好！

題目： Self-supervised learning for audio-visual speaker diarization

摘要：

主講人二值化是一種尋找特定主講人語音片段的技術，在視頻會議、人機交互系統等以人為中心的應用中得到了廣泛的應用。在這篇論文中，我們提出一種自監督的音視頻同步學習方法來解決說話人的二值化問題，而不需要大量的標注工作。我們通過引入兩個新的損失函數:動態三重損失和多項式損失來改進前面的方法。我們在一個真實的人機交互系統上進行了測試，結果表明我們的最佳模型獲得了顯著的+8%的f1分數，并降低了二值化的錯誤率。最后，我們介紹了一種新的大規模的音視頻語料庫，以填補漢語音視頻數據集的空白。

付費5元查看完整內容

語言 · 人工智能 · 斯坦福大學 (Stanford University) · 自然語言處理 · 計算機語言 ·

2019 年 11 月 24 日

專知會員服務

專知，提供專業可信的知識分發服務，讓認知協作更快更好！

題目： Speech and Language Processing

摘要： Speech and Language Processing是一個全面的，對讀者友好的，最新的計算語言學指南，涵蓋統計和符號方法及其應用。它既能吸引那些既不太專業的高年級本科生，也能吸引那些研究人員，他們會發現它對一個快速發展的研究領域的新技術有幫助。

付費5元查看完整內容

語音增強 · 分離的 · 可約的 · INFORMS · state-of-the-art ·

2018 年 11 月 27 日

Improved Speech Enhancement with the Wave-U-Net

Craig Macartney,Tillman Weyde

from arxiv, 5 pages (including 1 for References), 1 figure, 2 tables

We study the use of the Wave-U-Net architecture for speech enhancement, a model introduced by Stoller et al for the separation of music vocals and accompaniment. This end-to-end learning method for audio source separation operates directly in the time domain, permitting the integrated modelling of phase information and being able to take large temporal contexts into account. Our experiments show that the proposed method improves several metrics, namely PESQ, CSIG, CBAK, COVL and SSNR, over the state-of-the-art with respect to the speech enhancement task on the Voice Bank corpus (VCTK) dataset. We find that a reduced number of hidden layers is sufficient for speech enhancement in comparison to the original system designed for singing voice separation in music. We see this initial result as an encouraging signal to further explore speech enhancement in the time-domain, both as an end in itself and as a pre-processing step to speech recognition systems.