主題: Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems
摘要: 詞匯不足(OOV)的問題對于任何語音識別系統都是典型的,混合系統通常被構造為識別一組固定的單詞,并且很少包含系統開發過程中會遇到的所有單詞。 覆蓋OOV的一種流行方法是使用子詞單位而不是詞。 如果可以從當前子詞單元構建該詞,則這樣的系統可以潛在地識別任何以前看不見的詞,但是也可以識別不存在的詞。 另一種流行的方法是修改系統的HMM部分,以便可以使用我們要添加到系統中的自定義單詞集輕松有效地擴展它。 在本文中,我們在圖形構造和搜索方法級別上探索了該解決方案的不同現有方法。 我們還提出了一種新穎的詞匯擴展技術,該技術解決了有關識別圖處理的一些常見內部子例程問題。
題目: Natural Language Processing and Query Expansion
簡介:
大量知識資源的可用性刺激了開發和增強信息檢索技術的大量工作。用戶的信息需求以自然語言表達,成功的檢索很大程度上取決于預期目的的有效溝通。自然語言查詢包含多種語言功能,這些語言功能代表了預期的搜索目標。導致語義歧義和對查詢的誤解以及其他因素(例如,對搜索環境缺乏了解)的語言特征會影響用戶準確表示其信息需求的能力,這是由概念意圖差距造成的。后者直接影響返回的搜索結果的相關性,而這可能不會使用戶滿意,因此是影響信息檢索系統有效性的主要問題。我們討論的核心是通過手動或自動捕獲有意義的術語,短語甚至潛在的表示形式來識別表征查詢意圖及其豐富特征的重要組成部分,以手動或自動捕獲它們的預期含義。具體而言,我們討論了實現豐富化的技術,尤其是那些利用從文檔語料庫中的術語相關性的統計處理或從諸如本體之類的外部知識源中收集的信息的技術。我們提出了基于通用語言的查詢擴展框架的結構,并提出了基于模塊的分解,涵蓋了來自查詢處理,信息檢索,計算語言學和本體工程的主題問題。對于每個模塊,我們都會根據所使用的技術回顧分類和分析的文獻中的最新解決方案。
題目: MULTI-TASK SELF-SUPERVISED LEARNING FOR ROBUST SPEECH RECOGNITION
摘要: 盡管人們對無監督學習越來越感興趣,但從無標簽的音頻中提取有意義的知識仍然是一個公開的挑戰。為了在這個方向上邁出一步,我們最近提出了一個問題不可知的語音編碼器(PASE),它結合了一個卷積編碼器和多個神經網絡,稱為workers,其任務是解決自監督的問題,不需要手動注釋的真值。PASE證明能夠捕捉相關的語音信息,包括說話者的聲紋和音素。本文提出了一種改進的PASE+,用于在噪聲和混響環境下進行魯棒語音識別。為此,我們使用了一個在線語音失真模塊,它用各種隨機干擾來污染輸入信號。然后,我們提出一種改進的編碼器,更好地學習短期和長期語音動態與遞歸網絡和卷積網絡的有效結合。最后,我們完善了用于自監督的workers,以鼓勵更好的合作。
TIMIT、DIRHA和CHiME-5的結果表明,PASE+ sig-明顯優于之前版本的PASE以及常見的聲學特性。有趣的是,PASE+學習適用于高度不匹配的聲學條件的可轉移特征。
題目: Speech and Language Processing
摘要: Speech and Language Processing是一個全面的,對讀者友好的,最新的計算語言學指南,涵蓋統計和符號方法及其應用。它既能吸引那些既不太專業的高年級本科生,也能吸引那些研究人員,他們會發現它對一個快速發展的研究領域的新技術有幫助。
This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR). In our prior work, we have proposed a multi-level LM, in which character-based and word-based RNN-LMs are combined in hybrid CTC/attention-based ASR. Although this multi-level approach achieves significant error reduction in the Wall Street Journal (WSJ) task, two different LMs need to be trained and used for decoding, which increase the computational cost and memory usage. In this paper, we further propose a novel word-based RNN-LM, which allows us to decode with only the word-based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM. We demonstrate the efficacy of the word-based RNN-LMs using a larger corpus, LibriSpeech, in addition to WSJ we used in the prior work. Furthermore, we show that the proposed model achieves 5.1 %WER for WSJ Eval'92 test set when the vocabulary size is increased, which is the best WER reported for end-to-end ASR systems on this benchmark.
State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.
Sequence-to-sequence attention-based models have recently shown very promising results on automatic speech recognition (ASR) tasks, which integrate an acoustic, pronunciation and language model into a single neural network. In these models, the Transformer, a new sequence-to-sequence attention-based model relying entirely on self-attention without using RNNs or convolutions, achieves a new single-model state-of-the-art BLEU on neural machine translation (NMT) tasks. Since the outstanding performance of the Transformer, we extend it to speech and concentrate on it as the basic architecture of sequence-to-sequence attention-based model on Mandarin Chinese ASR tasks. Furthermore, we investigate a comparison between syllable based model and context-independent phoneme (CI-phoneme) based model with the Transformer in Mandarin Chinese. Additionally, a greedy cascading decoder with the Transformer is proposed for mapping CI-phoneme sequences and syllable sequences into word sequences. Experiments on HKUST datasets demonstrate that syllable based model with the Transformer performs better than CI-phoneme based counterpart, and achieves a character error rate (CER) of \emph{$28.77\%$}, which is competitive to the state-of-the-art CER of $28.0\%$ by the joint CTC-attention based encoder-decoder network.
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER.