久久久久久久精品少妇9999-欧美体内SHE精高潮

Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications. These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS. This makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.

相關內容

語音合成

關注 0

語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)（Speech Synthesis），也稱為文語(yu)(yu)(yu)轉換(huan)（Text-to-Speech, TTS,它是將(jiang)任意的(de)(de)(de)(de)輸(shu)入文本轉換(huan)成(cheng)(cheng)自然流暢的(de)(de)(de)(de)語(yu)(yu)(yu)音(yin)(yin)輸(shu)出。語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)涉(she)及(ji)到人(ren)工智能(neng)、心理學、聲(sheng)學、語(yu)(yu)(yu)言(yan)學、數字信(xin)號處理、計算機(ji)科學等(deng)多(duo)個學科技(ji)術(shu)(shu)(shu)，是信(xin)息(xi)處理領(ling)域中的(de)(de)(de)(de)一(yi)項前(qian)沿(yan)技(ji)術(shu)(shu)(shu)。隨著計算機(ji)技(ji)術(shu)(shu)(shu)的(de)(de)(de)(de)不(bu)斷提(ti)高，語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)技(ji)術(shu)(shu)(shu)從早期的(de)(de)(de)(de)共振峰合(he)(he)成(cheng)(cheng),逐(zhu)步發展為波形拼接合(he)(he)成(cheng)(cheng)和(he)統計參數語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)，再發展到混合(he)(he)語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)；合(he)(he)成(cheng)(cheng)語(yu)(yu)(yu)音(yin)(yin)的(de)(de)(de)(de)質量(liang)、自然度已經得到明(ming)顯(xian)提(ti)高，基本能(neng)滿足一(yi)些特定場合(he)(he)的(de)(de)(de)(de)應用需求(qiu)。目(mu)前(qian)，語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)技(ji)術(shu)(shu)(shu)在(zai)(zai)銀行、醫(yi)院(yuan)等(deng)的(de)(de)(de)(de)信(xin)息(xi)播報系(xi)統、汽(qi)車導航系(xi)統、自動應答呼叫中心等(deng)都有廣泛應用，取得了巨大的(de)(de)(de)(de)經濟效益。另外，隨著智能(neng)手機(ji)、MP3、PDA 等(deng)與我們(men)生活密(mi)切(qie)相關(guan)的(de)(de)(de)(de)媒介的(de)(de)(de)(de)大量(liang)涌現，語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)的(de)(de)(de)(de)應用也在(zai)(zai)逐(zhu)漸向娛樂、語(yu)(yu)(yu)音(yin)(yin)教(jiao)學、康(kang)復治療等(deng)領(ling)域深入。可(ke)以說語(yu)(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)正(zheng)在(zai)(zai)影響(xiang)著人(ren)們(men)生活的(de)(de)(de)(de)方方面面。

音素 · 變換 · 注意力機制 · 語音合成 · 語音識別 ·

2020 年 4 月 14 日

Transformer based Grapheme-to-Phoneme Conversion

Sevinj Yolchuyeva,Géza Németh,Bálint Gyires-Tóth

from arxiv, INTERSPEECH 2019

Attention mechanism is one of the most successful techniques in deep learning based Natural Language Processing (NLP). The transformer network architecture is completely based on attention mechanisms, and it outperforms sequence-to-sequence models in neural machine translation without recurrent and convolutional layers. Grapheme-to-phoneme (G2P) conversion is a task of converting letters (grapheme sequence) to their pronunciations (phoneme sequence). It plays a significant role in text-to-speech (TTS) and automatic speech recognition (ASR) systems. In this paper, we investigate the application of transformer architecture to G2P conversion and compare its performance with recurrent and convolutional neural network based approaches. Phoneme and word error rates are evaluated on the CMUDict dataset for US English and the NetTalk dataset. The results show that transformer based G2P outperforms the convolutional-based approach in terms of word error rate and our results significantly exceeded previous recurrent approaches (without attention) regarding word and phoneme error rates on both datasets. Furthermore, the size of the proposed model is much smaller than the size of the previous approaches.

Transformer · 機器翻譯 ·

2019 年 10 月 17 日

[付費5元查看完整內容]Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

專知會員服務

專知，提供專業可信的知識分發服務，讓認知協作更快更好！

《Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation》K Murray, J Kinnison, T Q. Nguyen, W Scheirer, D Chiang [University of Notre Dame] (2019)

付費5元查看完整內容

語音合成 · Networking · Performer · 變換 · state-of-the-art ·

2019 年 1 月 30 日

Neural Speech Synthesis with Transformer Network

Naihan Li,Shujie Liu,Yanqing Liu,Sheng Zhao,Ming Liu,Ming Zhou

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

學成 · Neural Networks · 表示學習 · 分離的 · Performer ·

2018 年 12 月 10 日

Adaptive Neural Trees

Ryutaro Tanno,Kai Arulkumaran,Daniel C. Alexander,Antonio Criminisi,Aditya Nori

Deep neural networks and decision trees operate on largely separate paradigms; typically, the former performs representation learning with pre-specified architectures, while the latter is characterised by learning hierarchies over pre-specified features with data-driven architectures. We unite the two via adaptive neural trees (ANTs), a model that incorporates representation learning into edges, routing functions and leaf nodes of a decision tree, along with a backpropagation-based training algorithm that adaptively grows the architecture from primitive modules (e.g., convolutional layers). ANTs allow increased interpretability via hierarchical clustering, e.g., learning meaningful class associations, such as separating natural vs. man-made objects. We demonstrate this on classification and regression tasks, achieving over 99% and 90% accuracy on the MNIST and CIFAR-10 datasets, and outperforming standard neural networks, random forests and gradient boosted trees on the SARCOS dataset. Furthermore, ANT optimisation naturally adapts the architecture to the size and complexity of the training data.

去噪 · 白盒 · 穩健性 · Networking · 模型評估 ·

2018 年 12 月 9 日

Feature Denoising for Improving Adversarial Robustness

Cihang Xie,Yuxin Wu,Laurens van der Maaten,Alan Yuille,Kaiming He

from arxiv, tech report

Adversarial attacks to image classification systems present challenges to convolutional networks and opportunities for understanding them. This study suggests that adversarial perturbations on images lead to noise in the features constructed by these networks. Motivated by this observation, we develop new network architectures that increase adversarial robustness by performing feature denoising. Specifically, our networks contain blocks that denoise the features using non-local means or other filters; the entire networks are trained end-to-end. When combined with adversarial training, our feature denoising networks substantially improve the state-of-the-art in adversarial robustness in both white-box and black-box attack settings. On ImageNet, under 10-iteration PGD white-box attacks where prior art has 27.9% accuracy, our method achieves 55.7%; even under extreme 2000-iteration PGD white-box attacks, our method secures 42.6% accuracy. A network based on our method was ranked first in Competition on Adversarial Attacks and Defenses (CAAD) 2018 --- it achieved 50.6% classification accuracy on a secret, ImageNet-like test dataset against 48 unknown attackers, surpassing the runner-up approach by ~10%. Code and models will be made publicly available.

Neural Networks · 自動問答 · Networking · 泛函 · 知識庫 ·

2018 年 10 月 5 日

Improving Question Answering by Commonsense-Based Pre-Training

Wanjun Zhong,Duyu Tang,Nan Duan,Ming Zhou,Jiahai Wang,Jian Yin

from arxiv, 8 pages

Although neural network approaches achieve remarkable success on a variety of NLP tasks, many of them struggle to answer questions that require commonsense knowledge. We believe the main reason is the lack of commonsense connections between concepts. To remedy this, we provide a simple and effective method that leverages external commonsense knowledge base such as ConceptNet. We pre-train direct and indirect relational functions between concepts, and show that these pre-trained functions could be easily added to existing neural network models. Results show that incorporating commonsense-based function improves the state-of-the-art on two question answering tasks that require commonsense reasoning. Further analysis shows that our system discovers and leverages useful evidences from an external commonsense knowledge base, which is missing in existing neural network models and help derive the correct answer.

語言模型化 · 語音識別 · 端到端 · RNN · MoDELS ·

2018 年 8 月 8 日

End-to-end Speech Recognition with Word-based RNN Language Models

Takaaki Hori,Jaejin Cho,Shinji Watanabe

This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR). In our prior work, we have proposed a multi-level LM, in which character-based and word-based RNN-LMs are combined in hybrid CTC/attention-based ASR. Although this multi-level approach achieves significant error reduction in the Wall Street Journal (WSJ) task, two different LMs need to be trained and used for decoding, which increase the computational cost and memory usage. In this paper, we further propose a novel word-based RNN-LM, which allows us to decode with only the word-based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM. We demonstrate the efficacy of the word-based RNN-LMs using a larger corpus, LibriSpeech, in addition to WSJ we used in the prior work. Furthermore, we show that the proposed model achieves 5.1 %WER for WSJ Eval'92 test set when the vocabulary size is increased, which is the best WER reported for end-to-end ASR systems on this benchmark.

聲紋識別 · Better · 學成 · Networking · Neural Networks ·

2018 年 7 月 29 日

Speaker Recognition from raw waveform with SincNet

Mirco Ravanelli,Yoshua Bengio

from arxiv, Submitted to SLT 2018

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

共軛 · Machine Translation · 詞表 · NMT · Softmax函數/軟最大化函數 ·

2018 年 5 月 25 日

Japanese Predicate Conjugation for Neural Machine Translation

Michiki Kurosawa,Yukio Matsumura,Hayahide Yamagishi,Mamoru Komachi

from arxiv, 6 pages; NAACL 2018 Student Research Workshop

Neural machine translation (NMT) has a drawback in that can generate only high-frequency words owing to the computational costs of the softmax function in the output layer. In Japanese-English NMT, Japanese predicate conjugation causes an increase in vocabulary size. For example, one verb can have as many as 19 surface varieties. In this research, we focus on predicate conjugation for compressing the vocabulary size in Japanese. The vocabulary list is filled with the various forms of verbs. We propose methods using predicate conjugation information without discarding linguistic information. The proposed methods can generate low-frequency words and deal with unknown words. Two methods were considered to introduce conjugation information: the first considers it as a token (conjugation token) and the second considers it as an embedded vector (conjugation feature). The results using these methods demonstrate that the vocabulary size can be compressed by approximately 86.1% (Tanaka corpus) and the NMT models can output the words not in the training data set. Furthermore, BLEU scores improved by 0.91 points in Japanese-to-English translation, and 0.32 points in English-to-Japanese translation with ASPEC.

NMT · 詞表 · 可約的 · INFORMS · Performer ·

2018 年 1 月 11 日

Improved English to Russian Translation by Neural Suffix Prediction

Kai Song,Yue Zhang,Min Zhang,Weihua Luo

from arxiv, 8 pages, 3 figures, 5 tables

Neural machine translation (NMT) suffers a performance deficiency when a limited vocabulary fails to cover the source or target side adequately, which happens frequently when dealing with morphologically rich languages. To address this problem, previous work focused on adjusting translation granularity or expanding the vocabulary size. However, morphological information is relatively under-considered in NMT architectures, which may further improve translation quality. We propose a novel method, which can not only reduce data sparsity but also model morphology through a simple but effective mechanism. By predicting the stem and suffix separately during decoding, our system achieves an improvement of up to 1.98 BLEU compared with previous work on English to Russian translation. Our method is orthogonal to different NMT architectures and stably gains improvements on various domains.