一级a视频免费一区二区_99国产精品久久久久99_黄色视频APP下载_亚洲另类春光国产精品_国产日产精品国产精品毛片_92看看福利午夜影院_狠狠综合久久AV一区二区无码

Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.

相關內容

語音合(he)成

關注 491

語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)（Speech Synthesis），也稱為文(wen)語(yu)(yu)轉(zhuan)換（Text-to-Speech, TTS,它是將任(ren)意的(de)(de)(de)(de)輸入(ru)文(wen)本(ben)轉(zhuan)換成(cheng)(cheng)自(zi)然(ran)流暢的(de)(de)(de)(de)語(yu)(yu)音(yin)(yin)輸出(chu)。語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)涉(she)及到人工智能(neng)、心理學(xue)、聲學(xue)、語(yu)(yu)言學(xue)、數(shu)字(zi)信號(hao)處理、計算機科學(xue)等多(duo)個學(xue)科技(ji)術(shu)，是信息(xi)處理領(ling)域中的(de)(de)(de)(de)一(yi)項前沿(yan)技(ji)術(shu)。隨著計算機技(ji)術(shu)的(de)(de)(de)(de)不斷提(ti)高，語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)技(ji)術(shu)從早期的(de)(de)(de)(de)共振峰(feng)合(he)成(cheng)(cheng),逐步(bu)發展為波形(xing)拼(pin)接合(he)成(cheng)(cheng)和統(tong)計參數(shu)語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)，再發展到混(hun)合(he)語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)；合(he)成(cheng)(cheng)語(yu)(yu)音(yin)(yin)的(de)(de)(de)(de)質(zhi)量、自(zi)然(ran)度已經得到明(ming)顯提(ti)高，基本(ben)能(neng)滿足一(yi)些(xie)特定(ding)場合(he)的(de)(de)(de)(de)應用需求。目前，語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)技(ji)術(shu)在(zai)(zai)銀行、醫院(yuan)等的(de)(de)(de)(de)信息(xi)播(bo)報(bao)系(xi)統(tong)、汽車導(dao)航系(xi)統(tong)、自(zi)動(dong)應答呼叫(jiao)中心等都有廣泛(fan)應用，取得了(le)巨大的(de)(de)(de)(de)經濟效益。另外，隨著智能(neng)手機、MP3、PDA 等與我們生(sheng)活密切(qie)相關(guan)的(de)(de)(de)(de)媒(mei)介的(de)(de)(de)(de)大量涌(yong)現，語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)的(de)(de)(de)(de)應用也在(zai)(zai)逐漸(jian)向娛(yu)樂、語(yu)(yu)音(yin)(yin)教學(xue)、康復治療(liao)等領(ling)域深入(ru)。可以說(shuo)語(yu)(yu)音(yin)(yin)合(he)成(cheng)(cheng)正在(zai)(zai)影(ying)響(xiang)著人們生(sheng)活的(de)(de)(de)(de)方方面面。

語音合成 · 無監督 · SimPLe · 未標記 · 相似度 ·

2022 年 4 月 20 日

Simple and Effective Unsupervised Speech Synthesis

Alexander H. Liu,Cheng-I Jeff Lai,Wei-Ning Hsu,Michael Auli,Alexei Baevski,James Glass

from arxiv, preprint, equal contribution from first two authors

We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstrate the unsupervised system can synthesize speech similar to a supervised counterpart in terms of naturalness and intelligibility measured by human evaluation.

2022 年 4 月 19 日

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

Haiyang Liu,Zihao Zhu,Naoya Iwamoto,Yichen Peng,Zhengqing Li,You Zhou,Elif Bozkurt,Bo Zheng

Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem, due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations.Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating the human gestures, which may contribute to a number of different research fields including controllable gesture synthesis, cross-modality analysis, emotional gesture recognition. The data, code and model will be released for research.

學成 · 層 · 判別器 · Networking · Performer ·

2022 年 4 月 18 日

TSception: Capturing Temporal Dynamics and Spatial Asymmetry from EEG for Emotion Recognition

Yi Ding,Neethu Robinson,Su Zhang,Qiuhao Zeng,Cuntai Guan

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. (Accepted as a regular paper in IEEE Transactions on Affective Computing)

The high temporal resolution and the asymmetric spatial activations are essential attributes of electroencephalogram (EEG) underlying emotional processes in the brain. To learn the temporal dynamics and spatial asymmetry of EEG towards accurate and generalized emotion recognition, we propose TSception, a multi-scale convolutional neural network that can classify emotions from EEG. TSception consists of dynamic temporal, asymmetric spatial, and high-level fusion layers, which learn discriminative representations in the time and channel dimensions simultaneously. The dynamic temporal layer consists of multi-scale 1D convolutional kernels whose lengths are related to the sampling rate of EEG, which learns the dynamic temporal and frequency representations of EEG. The asymmetric spatial layer takes advantage of the asymmetric EEG patterns for emotion, learning the discriminative global and hemisphere representations. The learned spatial representations will be fused by a high-level fusion layer. Using more generalized cross-validation settings, the proposed method is evaluated on two publicly available datasets DEAP and MAHNOB-HCI. The performance of the proposed network is compared with prior reported methods such as SVM, KNN, FBFgMDM, FBTSC, Unsupervised learning, DeepConvNet, ShallowConvNet, and EEGNet. TSception achieves higher classification accuracies and F1 scores than other methods in most of the experiments. The codes are available at //github.com/yi-ding-cs/TSception

可約的 · 解碼 · 數據集 · 語音翻譯 · Machine Translation ·

2022 年 4 月 15 日

Consecutive Decoding for Speech-to-text Translation

Qianqian Dong,Mingxuan Wang,Hao Zhou,Shuang Xu,Bo Xu,Lei Li

from arxiv, Accepted by AAAI 2021, 11 pages, 3 figures, 13 tables

Speech-to-text translation (ST), which directly translates the source language speech to the target language text, has attracted intensive attention recently. However, the combination of speech recognition and machine translation in a single model poses a heavy burden on the direct cross-modal cross-lingual mapping. To reduce the learning difficulty, we propose COnSecutive Transcription and Translation (COSTT), an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. It benefits the model training so that additional large parallel text corpus can be fully exploited to enhance the speech translation training. Our method is verified on three mainstream datasets, including Augmented LibriSpeech English-French dataset, IWSLT2018 English-German dataset, and TED English-Chinese dataset. Experiments show that our proposed COSTT outperforms or on par with the previous state-of-the-art methods on the three datasets. We have released our code at \url{//github.com/dqqcasia/st}.

Guidance · 控制器 · MoDELS · Continuity · 去噪 ·

2022 年 4 月 14 日

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

Xihui Liu,Dong Huk Park,Samaneh Azadi,Gong Zhang,Arman Chopikyan,Yuxiao Hu,Humphrey Shi,Anna Rohrbach,Trevor Darrell

from arxiv, Project page //xh-liu.github.io/sdg/

Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. Guidance is injected into a pretrained unconditional diffusion model using the gradient of image-text or image matching scores. We explore CLIP-based language guidance as well as both content and style-based image guidance in a unified framework. Our text-guided synthesis approach can be applied to datasets without associated text annotations. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis, synthesis of images related to a style or content reference image, and examples with both textual and image guidance.

圖 · Networking · 學成 · Performer · 深度學習 ·

2020 年 10 月 9 日

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Emanuele Rossi,Ben Chamberlain,Fabrizio Frasca,Davide Eynard,Federico Monti,Michael Bronstein

Graph Neural Networks (GNNs) have recently become increasingly popular due to their ability to learn complex systems of relations or interactions arising in a broad spectrum of problems ranging from biology and particle physics to social networks and recommendation systems. Despite the plethora of different models for deep learning on graphs, few approaches have been proposed thus far for dealing with graphs that present some sort of dynamic nature (e.g. evolving features or connectivity over time). In this paper, we present Temporal Graph Networks (TGNs), a generic, efficient framework for deep learning on dynamic graphs represented as sequences of timed events. Thanks to a novel combination of memory modules and graph-based operators, TGNs are able to significantly outperform previous approaches being at the same time more computationally efficient. We furthermore show that several previous models for learning on dynamic graphs can be cast as specific instances of our framework. We perform a detailed ablation study of different components of our framework and devise the best configuration that achieves state-of-the-art performance on several transductive and inductive prediction tasks for dynamic graphs.

損失函數（機器學習） · 學習的學習 · 學成 · entity · 泛函 ·

2019 年 9 月 9 日

Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification

Jiawei Wu,Wenhan Xiong,William Yang Wang

from arxiv, 11pages, 5 figures, accepted to EMNLP 2019

Many tasks in natural language processing can be viewed as multi-label classification problems. However, most of the existing models are trained with the standard cross-entropy loss function and use a fixed prediction policy (e.g., a threshold of 0.5) for all the labels, which completely ignores the complexity and dependencies among different labels. In this paper, we propose a meta-learning method to capture these complex label dependencies. More specifically, our method utilizes a meta-learner to jointly learn the training policies and prediction policies for different labels. The training policies are then used to train the classifier with the cross-entropy loss function, and the prediction policies are further implemented for prediction. Experimental results on fine-grained entity typing and text classification demonstrate that our proposed method can obtain more accurate multi-label classification results.

圖卷積神經網絡/圖卷積網絡 · 情感分類 · 圖卷積 · INFORMS · 卷積 ·

2019 年 9 月 8 日

Aspect-based Sentiment Classification with Aspect-specific Graph Convolutional Networks

Chen Zhang,Qiuchi Li,Dawei Song

from arxiv, 11 pages, 4 figures, accepted to EMNLP 2019

Due to their inherent capability in semantic alignment of aspects and their context words, attention mechanism and Convolutional Neural Networks (CNNs) are widely applied for aspect-based sentiment classification. However, these models lack a mechanism to account for relevant syntactical constraints and long-range word dependencies, and hence may mistakenly recognize syntactically irrelevant contextual words as clues for judging aspect sentiment. To tackle this problem, we propose to build a Graph Convolutional Network (GCN) over the dependency tree of a sentence to exploit syntactical information and word dependencies. Based on it, a novel aspect-specific sentiment classification framework is raised. Experiments on three benchmarking collections illustrate that our proposed model has comparable effectiveness to a range of state-of-the-art models, and further demonstrate that both syntactical information and long-range word dependencies are properly captured by the graph convolution structure.

INFORMS · 圖像分割 · Networking · state-of-the-art · Integration ·

2019 年 4 月 9 日

Cross-Modal Self-Attention Network for Referring Image Segmentation

Linwei Ye,Mrigank Rochan,Zhi Liu,Yang Wang

from arxiv, Accepted to CVPR2019

We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

state-of-the-art · 學成 · Extensibility · INTERACT · Networking ·

2018 年 1 月 28 日

Multi-Pointer Co-Attention Networks for Recommendation

Yi Tay,Luu Anh Tuan,Siu Cheung Hui

Many recent state-of-the-art recommender systems such as D-ATT, TransNet and DeepCoNN exploit reviews for representation learning. This paper proposes a new neural architecture for recommendation with reviews. Our model operates on a multi-hierarchical paradigm and is based on the intuition that not all reviews are created equal, i.e., only a select few are important. The importance, however, should be dynamically inferred depending on the current target. To this end, we propose a review-by-review pointer-based learning scheme that extracts important reviews, subsequently matching them in a word-by-word fashion. This enables not only the most informative reviews to be utilized for prediction but also a deeper word-level interaction. Our pointer-based method operates with a novel gumbel-softmax based pointer mechanism that enables the incorporation of discrete vectors within differentiable neural architectures. Our pointer mechanism is co-attentive in nature, learning pointers which are co-dependent on user-item relationships. Finally, we propose a multi-pointer learning scheme that learns to combine multiple views of interactions between user and item. Overall, we demonstrate the effectiveness of our proposed model via extensive experiments on \textbf{24} benchmark datasets from Amazon and Yelp. Empirical results show that our approach significantly outperforms existing state-of-the-art, with up to 19% and 71% relative improvement when compared to TransNet and DeepCoNN respectively. We study the behavior of our multi-pointer learning mechanism, shedding light on evidence aggregation patterns in review-based recommender systems.