久草精品视频在线观看_亚洲午夜三级黄片_亚洲国产中文成人手机在线观看_精品亚洲成AV人天堂网_亚洲国产182TV精品天堂_偷拍视频偷窥区314_久久久999久久久精品

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at //speechresearch.github.io/naturalspeech2.

相關內容

語音合成

關注 491

語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)（Speech Synthesis），也(ye)稱為(wei)(wei)文(wen)語(yu)轉(zhuan)換（Text-to-Speech, TTS,它(ta)是將任(ren)意的(de)(de)(de)(de)(de)(de)輸入文(wen)本轉(zhuan)換成(cheng)(cheng)(cheng)(cheng)自(zi)然(ran)流暢(chang)的(de)(de)(de)(de)(de)(de)語(yu)音(yin)(yin)(yin)輸出。語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)涉及到人工(gong)智能(neng)、心理(li)學(xue)(xue)、聲學(xue)(xue)、語(yu)言學(xue)(xue)、數字信(xin)號處理(li)、計算(suan)機科學(xue)(xue)等(deng)多個學(xue)(xue)科技術(shu)(shu)，是信(xin)息處理(li)領域中(zhong)的(de)(de)(de)(de)(de)(de)一項前沿技術(shu)(shu)。隨(sui)著(zhu)計算(suan)機技術(shu)(shu)的(de)(de)(de)(de)(de)(de)不斷提高，語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)技術(shu)(shu)從(cong)早(zao)期的(de)(de)(de)(de)(de)(de)共振(zhen)峰合(he)(he)成(cheng)(cheng)(cheng)(cheng),逐步(bu)發展為(wei)(wei)波形拼接合(he)(he)成(cheng)(cheng)(cheng)(cheng)和統計參數語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)，再發展到混合(he)(he)語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)；合(he)(he)成(cheng)(cheng)(cheng)(cheng)語(yu)音(yin)(yin)(yin)的(de)(de)(de)(de)(de)(de)質量(liang)、自(zi)然(ran)度(du)已經(jing)(jing)得(de)到明(ming)顯(xian)提高，基本能(neng)滿足一些特定場(chang)合(he)(he)的(de)(de)(de)(de)(de)(de)應(ying)用需求。目前，語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)技術(shu)(shu)在(zai)銀行(xing)、醫院等(deng)的(de)(de)(de)(de)(de)(de)信(xin)息播報系(xi)統、汽(qi)車(che)導航系(xi)統、自(zi)動應(ying)答(da)呼叫中(zhong)心等(deng)都有(you)廣(guang)泛應(ying)用，取得(de)了巨大(da)的(de)(de)(de)(de)(de)(de)經(jing)(jing)濟(ji)效(xiao)益(yi)。另外，隨(sui)著(zhu)智能(neng)手(shou)機、MP3、PDA 等(deng)與我(wo)們(men)(men)生(sheng)活密切(qie)相關的(de)(de)(de)(de)(de)(de)媒介的(de)(de)(de)(de)(de)(de)大(da)量(liang)涌現(xian)，語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)的(de)(de)(de)(de)(de)(de)應(ying)用也(ye)在(zai)逐漸向娛(yu)樂、語(yu)音(yin)(yin)(yin)教(jiao)學(xue)(xue)、康(kang)復(fu)治療等(deng)領域深入。可(ke)以說語(yu)音(yin)(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)(cheng)正在(zai)影響著(zhu)人們(men)(men)生(sheng)活的(de)(de)(de)(de)(de)(de)方方面面。

tuning · Prompt · MoDELS · 泛化理論 · Vision ·

2023 年 6 月 20 日

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Yongzhu Miao,Shasha Li,Jintao Tang,Ting Wang

from arxiv, The paper has been accepted by ICME 2023

Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-language models like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-language models, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: //github.com/Mechrev0/MuDPT.

語言模型化 · Prompt · MoDELS · tuning · Continuity ·

2023 年 6 月 19 日

SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

Haibin Wu,Kai-Wei Chang,Yuan-Kuei Wu,Hung-yi Lee

from arxiv, Work in progress. The first three authors contributed equally

Large language models (LLMs) have gained considerable attention for Artificial Intelligence Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct adaptation of continuous speech to LLMs that process discrete tokens remains an unsolved challenge, hindering the application of LLMs for speech generation. The advanced speech LMs are in the corner, as that speech signals encapsulate a wealth of information, including speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated notable gains in parameter efficiency and competitive performance on some speech classification tasks. However, the extent to which prompts can effectively elicit generation tasks from speech LMs remains an open question. In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs, which will significantly enhance the capabilities of the framework. The code and demos of SpeechGen will be available on the project website: \url{//ga642381.github.io/SpeechPrompt/speechgen}

語言模型化 · 詞元分析器 · MoDELS · 掩碼 · state-of-the-art ·

2023 年 6 月 18 日

LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models

Zhichao Wang,Yuanzhe Chen,Lei Xie,Qiao Tian,Yuping Wang

Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM - Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the coarse acoustic modeling.

語音合成 · 訓練數據 · 數據增強 · 噪聲 · Extensibility ·

2023 年 6 月 16 日

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Kishor Kayyar Lakshminarayana,Christian Dittmar,Nicola Pia,Emanu?l Habets

from arxiv, Accepted for publication at EUSIPCO-2023, Helsinki

Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.

Performer · 無監督 · MoDELS · Learning · 無監督學習 ·

2023 年 6 月 16 日

Unsupervised Learning of Style-Aware Facial Animation from Real Acting Performances

Wolfgang Paier,Anna Hilsmann,Peter Eisert

from arxiv, 16 pages, submitted to Graphical Models (Feb 2023)

This paper presents a novel approach for text/speech-driven animation of a photo-realistic head model based on blend-shape geometry, dynamic textures, and neural rendering. Training a VAE for geometry and texture yields a parametric model for accurate capturing and realistic synthesis of facial expressions from a latent feature vector. Our animation method is based on a conditional CNN that transforms text or speech into a sequence of animation parameters. In contrast to previous approaches, our animation model learns disentangling/synthesizing different acting-styles in an unsupervised manner, requiring only phonetic labels that describe the content of training sequences. For realistic real-time rendering, we train a U-Net that refines rasterization-based renderings by computing improved pixel colors and a foreground matte. We compare our framework qualitatively/quantitatively against recent methods for head modeling as well as facial animation and evaluate the perceived rendering/animation quality in a user-study, which indicates large improvements compared to state-of-the-art approaches

MoDELS · 損失 · 模型平均 · 損失函數（機器學習） · Performer ·

2023 年 6 月 15 日

A loss discounting framework for model averaging and selection in time series models

Dawid Bernaciak,Jim E. Griffin

We introduce a Loss Discounting Framework for model and forecast combination which generalises and combines Bayesian model synthesis and generalized Bayes methodologies. We use a loss function to score the performance of different models and introduce a multilevel discounting scheme which allows a flexible specification of the dynamics of the model weights. This novel and simple model combination approach can be easily applied to large scale model averaging/selection, can handle unusual features such as sudden regime changes, and can be tailored to different forecasting problems. We compare our method to both established methodologies and state of the art methods for a number of macroeconomic forecasting examples. We find that the proposed method offers an attractive, computationally efficient alternative to the benchmark methodologies and often outperforms more complex techniques.

MoDELS · 可理解性 · Learning · 評分函數 · Markovian ·

2022 年 8 月 25 日

Understanding Diffusion Models: A Unified Perspective

Calvin Luo

Diffusion models have shown incredible capabilities as generative models; indeed, they power the current state-of-the-art models on text-conditioned image generation such as Imagen and DALL-E 2. In this work we review, demystify, and unify the understanding of diffusion models across both variational and score-based perspectives. We first derive Variational Diffusion Models (VDM) as a special case of a Markovian Hierarchical Variational Autoencoder, where three key assumptions enable tractable computation and scalable optimization of the ELBO. We then prove that optimizing a VDM boils down to learning a neural network to predict one of three potential objectives: the original source input from any arbitrary noisification of it, the original source noise from any arbitrarily noisified input, or the score function of a noisified input at any arbitrary noise level. We then dive deeper into what it means to learn the score function, and connect the variational perspective of a diffusion model explicitly with the Score-based Generative Modeling perspective through Tweedie's Formula. Lastly, we cover how to learn a conditional distribution using diffusion models via guidance.

Prompt · MoDELS · 學成 · Extensibility · 向量化 ·

2022 年 3 月 10 日

Conditional Prompt Learning for Vision-Language Models

Kaiyang Zhou,Jingkang Yang,Chen Change Loy,Ziwei Liu

from arxiv, CVPR 2022. TL;DR: We propose a conditional prompt learning approach to solve the generalizability issue of static prompts

With the rise of powerful pre-trained vision-language models like CLIP, it becomes essential to investigate ways to adapt these models to downstream datasets. A recently proposed method named Context Optimization (CoOp) introduces the concept of prompt learning -- a recent trend in NLP -- to the vision domain for adapting pre-trained vision-language models. Specifically, CoOp turns context words in a prompt into a set of learnable vectors and, with only a few labeled images for learning, can achieve huge improvements over intensively-tuned manual prompts. In our study we identify a critical problem of CoOp: the learned context is not generalizable to wider unseen classes within the same dataset, suggesting that CoOp overfits base classes observed during training. To address the problem, we propose Conditional Context Optimization (CoCoOp), which extends CoOp by further learning a lightweight neural network to generate for each image an input-conditional token (vector). Compared to CoOp's static prompts, our dynamic prompts adapt to each instance and are thus less sensitive to class shift. Extensive experiments show that CoCoOp generalizes much better than CoOp to unseen classes, even showing promising transferability beyond a single dataset; and yields stronger domain generalization performance as well. Code is available at //github.com/KaiyangZhou/CoOp.

類別 · 學成 · MoDELS · Performer · Better ·

2021 年 2 月 15 日

OntoZSL: Ontology-enhanced Zero-shot Learning

Yuxia Geng,Jiaoyan Chen,Zhuo Chen,Jeff Z. Pan,Zhiquan Ye,Zonggang Yuan,Yantao Jia,Huajun Chen

from arxiv, Accepted to The Web Conference (WWW) 2021

Zero-shot Learning (ZSL), which aims to predict for those classes that have never appeared in the training data, has arisen hot research interests. The key of implementing ZSL is to leverage the prior knowledge of classes which builds the semantic relationship between classes and enables the transfer of the learned models (e.g., features) from training classes (i.e., seen classes) to unseen classes. However, the priors adopted by the existing methods are relatively limited with incomplete semantics. In this paper, we explore richer and more competitive prior knowledge to model the inter-class relationship for ZSL via ontology-based knowledge representation and semantic embedding. Meanwhile, to address the data imbalance between seen classes and unseen classes, we developed a generative ZSL framework with Generative Adversarial Networks (GANs). Our main findings include: (i) an ontology-enhanced ZSL framework that can be applied to different domains, such as image classification (IMGC) and knowledge graph completion (KGC); (ii) a comprehensive evaluation with multiple zero-shot datasets from different domains, where our method often achieves better performance than the state-of-the-art models. In particular, on four representative ZSL baselines of IMGC, the ontology-based class semantics outperform the previous priors e.g., the word embeddings of classes by an average of 12.4 accuracy points in the standard ZSL across two example datasets (see Figure 4).

小樣本學習 · 語言模型化 · Better · MoDELS · Performer ·

2020 年 12 月 31 日

Making Pre-trained Language Models Better Few-shot Learners

Tianyu Gao,Adam Fisch,Danqi Chen

The recent GPT-3 model (Brown et al., 2020) achieves remarkable few-shot performance solely by leveraging a natural-language prompt and a few task demonstrations as input context. Inspired by their findings, we study few-shot learning in a more practical scenario, where we use smaller language models for which fine-tuning is computationally efficient. We present LM-BFF--better few-shot fine-tuning of language models--a suite of simple and complementary techniques for fine-tuning language models on a small number of annotated examples. Our approach includes (1) prompt-based fine-tuning together with a novel pipeline for automating prompt generation; and (2) a refined strategy for dynamically and selectively incorporating demonstrations into each context. Finally, we present a systematic evaluation for analyzing few-shot performance on a range of NLP tasks, including classification and regression. Our experiments demonstrate that our methods combine to dramatically outperform standard fine-tuning procedures in this low resource setting, achieving up to 30% absolute improvement, and 11% on average across all tasks. Our approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.