成人艳情一二三区按摩_很A很色很黄的免费视频_熟妇性爱在线观看_国产又色又爽的视频网站在线观看_51福利国产在线午夜天堂_久久国产夜色精品鲁鲁99_国产精品综合一区二区不卡

Neural network based end-to-end Text-to-Speech (TTS) has greatly improved the quality of synthesized speech. While how to use massive spontaneous speech without transcription efficiently still remains an open problem. In this paper, we propose MHTTS, a fast multi-speaker TTS system that is robust to transcription errors and speaking style speech data. Specifically, we introduce a multi-head model and transfer text information from high-quality corpus with manual transcription to spontaneous speech with imperfectly recognized transcription by jointly training them. MHTTS has three advantages: 1) Our system synthesizes better quality multi-speaker voice with faster inference speed. 2) Our system is capable of transferring correct text information to data with imperfect transcription, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). 3) Our system can utilize massive real spontaneous speech with imperfect transcription and synthesize expressive voice.

相關內容

語音合成

關注 491

語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)（Speech Synthesis），也稱為(wei)文語(yu)(yu)(yu)(yu)轉(zhuan)換(huan)（Text-to-Speech, TTS,它(ta)是(shi)(shi)將任意的(de)(de)(de)(de)輸入文本轉(zhuan)換(huan)成(cheng)(cheng)自(zi)然(ran)流暢的(de)(de)(de)(de)語(yu)(yu)(yu)(yu)音(yin)輸出。語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)涉及到人(ren)工智能、心(xin)理學(xue)(xue)、聲(sheng)學(xue)(xue)、語(yu)(yu)(yu)(yu)言(yan)學(xue)(xue)、數(shu)字信號處(chu)理、計算機科學(xue)(xue)等(deng)多個學(xue)(xue)科技(ji)術(shu)，是(shi)(shi)信息處(chu)理領域中(zhong)的(de)(de)(de)(de)一項前沿技(ji)術(shu)。隨著(zhu)(zhu)計算機技(ji)術(shu)的(de)(de)(de)(de)不斷(duan)提高(gao)，語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)技(ji)術(shu)從早期的(de)(de)(de)(de)共(gong)振峰合成(cheng)(cheng),逐(zhu)步發展為(wei)波形拼(pin)接合成(cheng)(cheng)和統(tong)計參數(shu)語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)，再發展到混合語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)；合成(cheng)(cheng)語(yu)(yu)(yu)(yu)音(yin)的(de)(de)(de)(de)質量、自(zi)然(ran)度(du)已經得(de)到明顯提高(gao)，基(ji)本能滿足一些(xie)特定場合的(de)(de)(de)(de)應(ying)用需求。目前，語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)技(ji)術(shu)在(zai)銀行、醫院等(deng)的(de)(de)(de)(de)信息播報(bao)系統(tong)、汽車導航系統(tong)、自(zi)動(dong)應(ying)答呼叫中(zhong)心(xin)等(deng)都有廣泛應(ying)用，取得(de)了巨大的(de)(de)(de)(de)經濟效益(yi)。另外，隨著(zhu)(zhu)智能手機、MP3、PDA 等(deng)與我們生活(huo)密(mi)切相關的(de)(de)(de)(de)媒(mei)介的(de)(de)(de)(de)大量涌現(xian)，語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)的(de)(de)(de)(de)應(ying)用也在(zai)逐(zhu)漸向(xiang)娛樂(le)、語(yu)(yu)(yu)(yu)音(yin)教(jiao)學(xue)(xue)、康復治療等(deng)領域深入。可以說語(yu)(yu)(yu)(yu)音(yin)合成(cheng)(cheng)正在(zai)影(ying)響(xiang)著(zhu)(zhu)人(ren)們生活(huo)的(de)(de)(de)(de)方(fang)方(fang)面面。

語音合成 · 講稿 · 話題 · MoDELS · 穩健性 ·

2022 年 4 月 20 日

KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics

Saida Mussakhojayeva,Yerbolat Khassanov,Huseyin Atakan Varol

from arxiv, 8 pages, 2 figures, 5 tables, accepted to LREC 2022

We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three females and two males), and the topic coverage has been diversified with the help of new sources, including a book and Wikipedia articles. This corpus is necessary for building high-quality TTS systems for Kazakh, a Central Asian agglutinative language from the Turkic family, which presents several linguistic challenges. We describe the corpus construction process and provide the details of the training and evaluation procedures for the TTS system. Our experimental results indicate that the constructed corpus is sufficient to build robust TTS models for real-world applications, with a subjective mean opinion score ranging from 3.6 to 4.2 for all the five speakers. We believe that our corpus will facilitate speech and language research for Kazakh and other Turkic languages, which are widely considered to be low-resource due to the limited availability of free linguistic data. The constructed corpus, code, and pretrained models are publicly available in our GitHub repository.

估計/估計量 · SOTA · MoDELS · Better · Performer ·

2022 年 4 月 18 日

Deep Equilibrium Optical Flow Estimation

Shaojie Bai,Zhengyang Geng,Yash Savani,J. Zico Kolter

from arxiv, CVPR 2022

Many recent state-of-the-art (SOTA) optical flow models use finite-step recurrent update operations to emulate traditional algorithms by encouraging iterative refinements toward a stable flow estimation. However, these RNNs impose large computation and memory overheads, and are not directly trained to model such stable estimation. They can converge poorly and thereby suffer from performance degradation. To combat these drawbacks, we propose deep equilibrium (DEQ) flow estimators, an approach that directly solves for the flow as the infinite-level fixed point of an implicit layer (using any black-box solver), and differentiates through this fixed point analytically (thus requiring $O(1)$ training memory). This implicit-depth approach is not predicated on any specific model, and thus can be applied to a wide range of SOTA flow estimation model designs. The use of these DEQ flow estimators allows us to compute the flow faster using, e.g., fixed-point reuse and inexact gradients, consumes $4\sim6\times$ times less training memory than the recurrent counterpart, and achieves better results with the same computation budget. In addition, we propose a novel, sparse fixed-point correction scheme to stabilize our DEQ flow estimators, which addresses a longstanding challenge for DEQ models in general. We test our approach in various realistic settings and show that it improves SOTA methods on Sintel and KITTI datasets with substantially better computational and memory efficiency.

Swin Transformer · INFORMS · Pyramid · 變換 · state-of-the-art ·

2022 年 4 月 18 日

BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment

Ziwei Luo,Youwei Li,Shen Cheng,Lei Yu,Qi Wu,Zhihong Wen,Haoqiang Fan,Jian Sun,Shuaicheng Liu

from arxiv, Winner method in NTIRE Burst Super-Resolution Challenge Real-World Track

This work addresses the Burst Super-Resolution (BurstSR) task using a new architecture, which requires restoring a high-quality image from a sequence of noisy, misaligned, and low-resolution RAW bursts. To overcome the challenges in BurstSR, we propose a Burst Super-Resolution Transformer (BSRT), which can significantly improve the capability of extracting inter-frame information and reconstruction. To achieve this goal, we propose a Pyramid Flow-Guided Deformable Convolution Network (Pyramid FG-DCN) and incorporate Swin Transformer Blocks and Groups as our main backbone. More specifically, we combine optical flows and deformable convolutions, hence our BSRT can handle misalignment and aggregate the potential texture information in multi-frames more efficiently. In addition, our Transformer-based structure can capture long-range dependency to further improve the performance. The evaluation on both synthetic and real-world tracks demonstrates that our approach achieves a new state-of-the-art in BurstSR task. Further, our BSRT wins the championship in the NTIRE2022 Burst Super-Resolution Challenge.

Performer · 講稿 · 可辨認的 · 語音識別 · Allo ·

2022 年 4 月 18 日

Intent Classification Using Pre-trained Language Agnostic Embeddings For Low Resource Languages

Hemant Yadav,Akshat Gupta,Sai Krishna Rallabandi,Alan W Black,Rajiv Ratn Shah

Building Spoken Language Understanding (SLU) systems that do not rely on language specific Automatic Speech Recognition (ASR) is an important yet less explored problem in language processing. In this paper, we present a comparative study aimed at employing a pre-trained acoustic model to perform SLU in low resource scenarios. Specifically, we use three different embeddings extracted using Allosaurus, a pre-trained universal phone decoder: (1) Phone (2) Panphone, and (3) Allo embeddings. These embeddings are then used in identifying the spoken intent. We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios. Our system improves on the state-of-the-art (SOTA) intent classification accuracy by approximately 2.11% for Sinhala and 7.00% for Tamil and achieves competitive results on English. Furthermore, we present a quantitative analysis of how the performance scales with the number of training examples used per intent.

成對型 · 轉錄系統 · 似然 · 估計/估計量 · 學成 ·

2022 年 4 月 17 日

A Data-Driven Methodology for Considering Feasibility and Pairwise Likelihood in Deep Learning Based Guitar Tablature Transcription Systems

Frank Cwitkowitz,Jonathan Driedger,Zhiyao Duan

from arxiv, Sound and Music Computing Conference (SMC) 2022

Guitar tablature transcription is an important but understudied problem within the field of music information retrieval. Traditional signal processing approaches offer only limited performance on the task, and there is little acoustic data with transcription labels for training machine learning models. However, guitar transcription labels alone are more widely available in the form of tablature, which is commonly shared among guitarists online. In this work, a collection of symbolic tablature is leveraged to estimate the pairwise likelihood of notes on the guitar. The output layer of a baseline tablature transcription model is reformulated, such that an inhibition loss can be incorporated to discourage the co-activation of unlikely note pairs. This naturally enforces playability constraints for guitar, and yields tablature which is more consistent with the symbolic data used to estimate pairwise likelihoods. With this methodology, we show that symbolic tablature can be used to shape the distribution of a tablature transcription model's predictions, even when little acoustic data is available.

因子分析 · 分解的 · Performer · 學成 · 潛在 ·

2022 年 4 月 16 日

Graph-incorporated Latent Factor Analysis for High-dimensional and Sparse Matrices

Di Wu,Yi He,Xin Luo

A High-dimensional and sparse (HiDS) matrix is frequently encountered in a big data-related application like an e-commerce system or a social network services system. To perform highly accurate representation learning on it is of great significance owing to the great desire of extracting latent knowledge and patterns from it. Latent factor analysis (LFA), which represents an HiDS matrix by learning the low-rank embeddings based on its observed entries only, is one of the most effective and efficient approaches to this issue. However, most existing LFA-based models perform such embeddings on a HiDS matrix directly without exploiting its hidden graph structures, thereby resulting in accuracy loss. To address this issue, this paper proposes a graph-incorporated latent factor analysis (GLFA) model. It adopts two-fold ideas: 1) a graph is constructed for identifying the hidden high-order interaction (HOI) among nodes described by an HiDS matrix, and 2) a recurrent LFA structure is carefully designed with the incorporation of HOI, thereby improving the representa-tion learning ability of a resultant model. Experimental results on three real-world datasets demonstrate that GLFA outperforms six state-of-the-art models in predicting the missing data of an HiDS matrix, which evidently supports its strong representation learning ability to HiDS data.

語音增強 · 損失函數（機器學習） · INFORMS · 傅立葉變換 · Microsoft Windows ·

2022 年 4 月 15 日

Improving Frame-Online Neural Speech Enhancement with Overlapped-Frame Prediction

Zhong-Qiu Wang,Shinji Watanabe

from arxiv, in submission

Frame-online speech enhancement systems in the short-time Fourier transform (STFT) domain usually have an algorithmic latency equal to the window size due to the use of the overlap-add algorithm in the inverse STFT (iSTFT). This algorithmic latency allows the enhancement models to leverage future contextual information up to a length equal to the window size. However, current frame-online systems only partially leverage this future information. To fully exploit this information, this study proposes an overlapped-frame prediction technique for deep learning based frame-online speech enhancement, where at each frame our deep neural network (DNN) predicts the current and several past frames that are necessary for overlap-add, instead of only predicting the current frame. In addition, we propose a novel loss function to account for the scale difference between predicted and oracle target signals. Evaluations results on a noisy-reverberant speech enhancement task show the effectiveness of the proposed algorithms.

變換 · Extensibility · INFORMS · Performer · MoDELS ·

2020 年 12 月 17 日

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Haoyi Zhou,Shanghang Zhang,Jieqi Peng,Shuai Zhang,Jianxin Li,Hui Xiong,Wancai Zhang

from arxiv, 7 pages (main), 5 pages (appendix) and to be appeared in AAAI2021

Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, such as quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a $ProbSparse$ Self-attention mechanism, which achieves $O(L \log L)$ in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.

元學習 · 語音識別 · MAML · 學成 · 端到端 ·

2019 年 10 月 26 日

Meta Learning for End-to-End Low-Resource Speech Recognition

Jui-Yang Hsu,Yuan-Jui Chen,Hung-yi Lee

from arxiv, 5 pages, submitted to ICASSP 2020

In this paper, we proposed to apply meta learning approach for low-resource automatic speech recognition (ASR). We formulated ASR for different languages as different tasks, and meta-learned the initialization parameters from many pretraining languages to achieve fast adaptation on unseen target language, via recently proposed model-agnostic meta learning algorithm (MAML). We evaluated the proposed approach using six languages as pretraining tasks and four languages as target tasks. Preliminary results showed that the proposed method, MetaASR, significantly outperforms the state-of-the-art multitask pretraining approach on all target languages with different combinations of pretraining languages. In addition, since MAML's model-agnostic property, this paper also opens new research direction of applying meta learning to more speech-related applications.

prototype · Networking · 小樣本學習 · Better · 穩健性 ·

2019 年 10 月 25 日

Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection

Shumin Deng,Ningyu Zhang,Jiaojian Kang,Yichi Zhang,Wei Zhang,Huajun Chen

from arxiv, Accepted by WSDM 2020

Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.