亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

<tfoot id='OqUX3'></tfoot>

<legend id='feDNC'><style id='IYotP'><dir id='Q8MRv'><q id='vibEH'></q></dir></style></legend>

<i id='Lc5Qt'><tr id='LqHhs'><dt id='jeIvB'><q id='EQl25'><span id='wKjXr'><b id='N8caJ'><form id='E3UuH'><ins id='dARbR'></ins><ul id='jApux'></ul><sub id='qjWnJ'></sub></form><legend id='twCzW'></legend><bdo id='b4gzP'><pre id='OnGHX'><center id='rtnXy'></center></pre></bdo></b><th id='e0kkV'></th></span></q></dt></tr></i><div id='tlsNZ'><tfoot id='QfQxl'></tfoot><dl id='DygrO'><fieldset id='28Pro'></fieldset></dl></div>

<li id='6fzk1'><abbr id='d5Gc0'></abbr></li>

·

語音合成 · Pair · MoDELS · 無監督 · Learning ·

2023 年 2 月 5 日

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Takaaki Saeki,Soumi Maiti,Xinjian Li,Shinji Watanabe,Shinnosuke Takamichi,Hiroshi Saruwatari

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language. All experiments were conducted using public datasets and the implementation will be made available for reproducibility.

相關內容

語音合成

語音合成（Speech Synthesis），也稱為文語轉換（Text-to-Speech, TTS,它是將任意的輸入文本轉換成自然流暢的語音輸出。語音合成涉及到人工智能、心理學、聲學、語言學、數字信號處理、計算機科學等多個學科技術，是信息處理領域中的一項前沿技術。隨著計算機技術的不斷提高，語音合成技術從早期的共振峰合成,逐步發展為波形拼接合成和統計參數語音合成，再發展到混合語音合成；合成語音的質量、自然度已經得到明顯提高，基本能滿足一些特定場合的應用需求。目前，語音合成技術在銀行、醫院等的信息播報系統、汽車導航系統、自動應答呼叫中心等都有廣泛應用，取得了巨大的經濟效益。另外，隨著智能手機、MP3、PDA 等與我們生活密切相關的媒介的大量涌現，語音合成的應用也在逐漸向娛樂、語音教學、康復治療等領域深入。可以說語音合成正在影響著人們生活的方方面面。

語音識別 · 零樣本 · 識別 · 樣本 · 適配 ·

2023 年 3 月 29 日

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Paul Hongsuck Seo,Arsha Nagrani,Cordelia Schmid

from arxiv, CVPR 2023

Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition.

上下文 · 微調 · 上下文嵌入 · 上下文感知 · 基準 ·

2023 年 3 月 28 日

Context-aware Fine-tuning of Self-supervised Speech Models

Suwon Shon,Felix Wu,Kwangyoun Kim,Prashant Sridhar,Karen Livescu,Shinji Watanabe

Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference.

適配 · 預訓練語言模型 · 泛化能力 · 語言模型 · 預訓練 ·

2023 年 3 月 28 日

AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models

Alexandra Chronopoulou,Matthew E. Peters,Alexander Fraser,Jesse Dodge

from arxiv, Accepted at EACL 2023; camera-ready version; fixed typo in related work

Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A parameter-efficient adaptation method suggests training an adapter for each domain on the task of language modeling. This leads to good in-domain scores but can be impractical for domain- or resource-restricted settings. A solution is to use a related-domain adapter for the novel domain at test time. In this paper, we introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains. Our approach is embarrassingly parallel: first, we train a set of domain-specific adapters; then, for each novel domain, we determine which adapters should be averaged at test time. We present extensive experiments showing that AdapterSoup consistently improves performance to new domains without extra training. We also explore weight averaging of adapters trained on the same domain with different hyper-parameters, and show that it preserves the performance of a PLM on new domains while obtaining strong in-domain results. We explore various approaches for choosing which adapters to combine, such as text clustering and semantic similarity. We find that using clustering leads to the most competitive results on novel domains.

無監督預訓練 · 語音合成 · 轉錄 · 合成 · 低資源 ·

2023 年 3 月 28 日

Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low Resource Languages

Seongyeon Park,Myungseo Song,Bohyung Kim,Tae-Hyun Oh

from arxiv, ICASSP 2023

Neural text-to-speech (TTS) models can synthesize natural human speech when trained on large amounts of transcribed speech. However, collecting such large-scale transcribed data is expensive. This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data. With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task. The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences. In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning. We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods. The code and audio samples are available at: //github.com/cnaigithub/SpeechDewarping

測試集 · 模型診斷 · 零樣本 · 敏感性 · 反事實 ·

2023 年 3 月 27 日

Zero-shot Model Diagnosis

Jinqi Luo,Zhaoning Wang,Chen Henry Wu,Dong Huang,Fernando De la Torre

from arxiv, Accepted in CVPR 2023

When it comes to deploying deep vision models, the behavior of these systems must be explicable to ensure confidence in their reliability and fairness. A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs. However, creating a balanced test set (i.e., one that is uniformly sampled over all the important traits) is often time-consuming, expensive, and prone to mistakes. The question we try to address is: can we evaluate the sensitivity of deep learning models to arbitrary visual attributes without an annotated test set? This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling. To avoid the need for test sets, our system relies on a generative model and CLIP. The key idea is enabling the user to select a set of prompts (relevant to the problem) and our system will automatically search for semantic counterfactual images (i.e., synthesized images that flip the prediction in the case of a binary classifier) using the generative model. We evaluate several visual tasks (classification, key-point detection, and segmentation) in multiple visual domains to demonstrate the viability of our methodology. Extensive experiments demonstrate that our method is capable of producing counterfactual images and offering sensitivity analysis for model diagnosis without the need for a test set.

零樣本 · 擴散模型 · 生成式預訓練 · 樣本 · 分類器 ·

2023 年 3 月 27 日

Text-to-Image Diffusion Models are Zero-Shot Classifiers

Kevin Clark,Priyank Jaini

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Imagen, using it to probe fine-grained aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it achieves state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.

零樣本 · 組合學 · 屬性 · 樣本 · 相似性度量 ·

2023 年 3 月 27 日

Learning Attention as Disentangler for Compositional Zero-shot Learning

Shaozhe Hao,Kai Han,Kwan-Yee K. Wong

from arxiv, CVPR 2023, available at //haoosz.github.io/ade-czsl/

Compositional zero-shot learning (CZSL) aims at learning visual concepts (i.e., attributes and objects) from seen compositions and combining concept knowledge into unseen compositions. The key to CZSL is learning the disentanglement of the attribute-object composition. To this end, we propose to exploit cross-attentions as compositional disentanglers to learn disentangled concept embeddings. For example, if we want to recognize an unseen composition "yellow flower", we can learn the attribute concept "yellow" and object concept "flower" from different yellow objects and different flowers respectively. To further constrain the disentanglers to learn the concept of interest, we employ a regularization at the attention level. Specifically, we adapt the earth mover's distance (EMD) as a feature similarity metric in the cross-attention module. Moreover, benefiting from concept disentanglement, we improve the inference process and tune the prediction score by combining multiple concept probabilities. Comprehensive experiments on three CZSL benchmark datasets demonstrate that our method significantly outperforms previous works in both closed- and open-world settings, establishing a new state-of-the-art.

語音識別 · 合成 · 合成數據 · 語音合成 · 可控 ·

2023 年 3 月 27 日

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Karren Yang,Ting-Yao Hu,Jen-Hao Rick Chang,Hema Swetha Koppula,Oncel Tuzel

from arxiv, ICASSP 2023

Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.

學成 · 負對數似然函數 · 對數似然函數 · 元學習 · 優化器 ·

2020 年 4 月 12 日

Pre-training Text Representations as Meta Learning

Shangwen Lv,Yuechen Wang,Daya Guo,Duyu Tang,Nan Duan,Fuqing Zhu,Ming Gong,Linjun Shou,Ryan Ma,Daxin Jiang,Guihong Cao,Ming Zhou,Songlin Hu

from arxiv, 2 figures, 3 tables

Pre-training text representations has recently been shown to significantly improve the state-of-the-art in many natural language processing tasks. The central goal of pre-training is to learn text representations that are useful for subsequent tasks. However, existing approaches are optimized by minimizing a proxy objective, such as the negative log likelihood of language modeling. In this work, we introduce a learning algorithm which directly optimizes model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps. The standard multi-task learning objective adopted in BERT is a special case of our learning algorithm where the depth of meta-train is zero. We study the problem in two settings: unsupervised pre-training and supervised pre-training with different pre-training objects to verify the generality of our approach.Experimental results show that our algorithm brings improvements and learns better initializations for a variety of downstream tasks.

情感分析 · MoDELS · 循環神經網絡 · entity · Neural Networks ·

2018 年 6 月 8 日

Multilingual Sentiment Analysis: An RNN-Based Framework for Limited Data

Ethem F. Can,Aysu Ezen-Can,Fazli Can

from arxiv, ACM SIGIR 2018 Workshop on Learning from Limited or Noisy Data (LND4IR'18)

Sentiment analysis is a widely studied NLP task where the goal is to determine opinions, emotions, and evaluations of users towards a product, an entity or a service that they are reviewing. One of the biggest challenges for sentiment analysis is that it is highly language dependent. Word embeddings, sentiment lexicons, and even annotated data are language specific. Further, optimizing models for each language is very time consuming and labor intensive especially for recurrent neural network models. From a resource perspective, it is very challenging to collect data for different languages. In this paper, we look for an answer to the following research question: can a sentiment analysis model trained on a language be reused for sentiment analysis in other languages, Russian, Spanish, Turkish, and Dutch, where the data is more limited? Our goal is to build a single model in the language with the largest dataset available for the task, and reuse it for languages that have limited resources. For this purpose, we train a sentiment analysis model using recurrent neural networks with reviews in English. We then translate reviews in other languages and reuse this model to evaluate the sentiments. Experimental results show that our robust approach of single model trained on English reviews statistically significantly outperforms the baselines in several different languages.

閱讀: 0 點贊: 0

小貼士

登錄享

相關主題

北京阿比特科技有限公司

注冊地址：北京市海淀區羊坊店路18號2幢3層301-191

<form id='5MGsA'></form>

<bdo id='BBcQA'><sup id='9ASPG'><div id='wKX4f'><bdo id='NrcBt'></bdo></div></sup></bdo>