亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn.

相關內容

語音合成(Speech Synthesis),也稱為文語轉換(Text-to-Speech, TTS,它是將任意的輸入文本轉換成自然流暢的語音輸出。語音合成涉及到人工智能、心理學、聲學、語言學、數字信號處理、計算機科學等多個學科技術,是信息處理領域中的一項前沿技術。 隨著計算機技術的不斷提高,語音合成技術從早期的共振峰合成,逐步發展為波形拼接合成和統計參數語音合成,再發展到混合語音合成;合成語音的質量、自然度已經得到明顯提高,基本能滿足一些特定場合的應用需求。目前,語音合成技術在銀行、醫院等的信息播報系統、汽車導航系統、自動應答呼叫中心等都有廣泛應用,取得了巨大的經濟效益。 另外,隨著智能手機、MP3、PDA 等與我們生活密切相關的媒介的大量涌現,語音合成的應用也在逐漸向娛樂、語音教學、康復治療等領域深入。可以說語音合成正在影響著人們生活的方方面面。

We introduce SPEAR-TTS, a multi-speaker text-to-speech (TTS) system that can be trained with minimal supervision. By combining two types of discrete speech representations, we cast TTS as a composition of two sequence-to-sequence tasks: from text to high-level semantic tokens (akin to "reading") and from semantic tokens to low-level acoustic tokens ("speaking"). Decoupling these two tasks enables training of the "speaking" module using abundant audio-only data, and unlocks the highly efficient combination of pretraining and backtranslation to reduce the need for parallel data when training the "reading" component. To control the speaker identity, we adopt example prompting, which allows SPEAR-TTS to generalize to unseen speakers using only a short sample of 3 seconds, without any explicit speaker representation or speaker-id labels. Our experiments demonstrate that SPEAR-TTS achieves a character error rate that is competitive with state-of-the-art methods using only 15 minutes of parallel data, while matching ground-truth speech in terms of naturalness and acoustic quality, as measured in subjective tests.

Self-supervised learning is a popular and powerful method for utilizing large amounts of unlabeled data, for which a wide variety of training objectives have been proposed in the literature. In this study, we perform a Bayesian analysis of state-of-the-art self-supervised learning objectives and propose a unified formulation based on likelihood learning. Our analysis suggests a simple method for integrating self-supervised learning with generative models, allowing for the joint training of these two seemingly distinct approaches. We refer to this combined framework as GEDI, which stands for GEnerative and DIscriminative training. Additionally, we demonstrate an instantiation of the GEDI framework by integrating an energy-based model with a cluster-based self-supervised learning model. Through experiments on synthetic and real-world data, including SVHN, CIFAR10, and CIFAR100, we show that GEDI outperforms existing self-supervised learning strategies in terms of clustering performance by a wide margin. We also demonstrate that GEDI can be integrated into a neural-symbolic framework to address tasks in the small data regime, where it can use logical constraints to further improve clustering and classification performance.

Text-guided generative diffusion models unlock powerful image creation and editing tools. While these have been extended to video generation, current approaches that edit the content of existing footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.

Real-world non-mydriatic retinal fundus photography is prone to artifacts, imperfections and low-quality when certain ocular or systemic co-morbidities exist. Artifacts may result in inaccuracy or ambiguity in clinical diagnoses. In this paper, we proposed a simple but effective end-to-end framework for enhancing poor-quality retinal fundus images. Leveraging the optimal transport theory, we proposed an unpaired image-to-image translation scheme for transporting low-quality images to their high-quality counterparts. We theoretically proved that a Generative Adversarial Networks (GAN) model with a generator and discriminator is sufficient for this task. Furthermore, to mitigate the inconsistency of information between the low-quality images and their enhancements, an information consistency mechanism was proposed to maximally maintain structural consistency (optical discs, blood vessels, lesions) between the source and enhanced domains. Extensive experiments were conducted on the EyeQ dataset to demonstrate the superiority of our proposed method perceptually and quantitatively.

Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically capture inter- and intra-modal interactions of audio and text. Specially, the mid-level fusion and residual module are employed to model long-term contextual dependencies and learn modality-specific patterns. We evaluate the approach on the MELD dataset and the experimental results show the proposed approach achieves the state-of-art performance on the dataset.

Federated learning enables multiple parties to collaboratively train a machine learning model without communicating their local data. A key challenge in federated learning is to handle the heterogeneity of local data distribution across parties. Although many studies have been proposed to address this challenge, we find that they fail to achieve high performance in image datasets with deep learning models. In this paper, we propose MOON: model-contrastive federated learning. MOON is a simple and effective federated learning framework. The key idea of MOON is to utilize the similarity between model representations to correct the local training of individual parties, i.e., conducting contrastive learning in model-level. Our extensive experiments show that MOON significantly outperforms the other state-of-the-art federated learning algorithms on various image classification tasks.

Sequential recommendation as an emerging topic has attracted increasing attention due to its important practical significance. Models based on deep learning and attention mechanism have achieved good performance in sequential recommendation. Recently, the generative models based on Variational Autoencoder (VAE) have shown the unique advantage in collaborative filtering. In particular, the sequential VAE model as a recurrent version of VAE can effectively capture temporal dependencies among items in user sequence and perform sequential recommendation. However, VAE-based models suffer from a common limitation that the representational ability of the obtained approximate posterior distribution is limited, resulting in lower quality of generated samples. This is especially true for generating sequences. To solve the above problem, in this work, we propose a novel method called Adversarial and Contrastive Variational Autoencoder (ACVAE) for sequential recommendation. Specifically, we first introduce the adversarial training for sequence generation under the Adversarial Variational Bayes (AVB) framework, which enables our model to generate high-quality latent variables. Then, we employ the contrastive loss. The latent variables will be able to learn more personalized and salient characteristics by minimizing the contrastive loss. Besides, when encoding the sequence, we apply a recurrent and convolutional structure to capture global and local relationships in the sequence. Finally, we conduct extensive experiments on four real-world datasets. The experimental results show that our proposed ACVAE model outperforms other state-of-the-art methods.

In this paper, we proposed to apply meta learning approach for low-resource automatic speech recognition (ASR). We formulated ASR for different languages as different tasks, and meta-learned the initialization parameters from many pretraining languages to achieve fast adaptation on unseen target language, via recently proposed model-agnostic meta learning algorithm (MAML). We evaluated the proposed approach using six languages as pretraining tasks and four languages as target tasks. Preliminary results showed that the proposed method, MetaASR, significantly outperforms the state-of-the-art multitask pretraining approach on all target languages with different combinations of pretraining languages. In addition, since MAML's model-agnostic property, this paper also opens new research direction of applying meta learning to more speech-related applications.

ASR (automatic speech recognition) systems like Siri, Alexa, Google Voice or Cortana has become quite popular recently. One of the key techniques enabling the practical use of such systems in people's daily life is deep learning. Though deep learning in computer vision is known to be vulnerable to adversarial perturbations, little is known whether such perturbations are still valid on the practical speech recognition. In this paper, we not only demonstrate such attacks can happen in reality, but also show that the attacks can be systematically conducted. To minimize users' attention, we choose to embed the voice commands into a song, called CommandSong. In this way, the song carrying the command can spread through radio, TV or even any media player installed in the portable devices like smartphones, potentially impacting millions of users in long distance. In particular, we overcome two major challenges: minimizing the revision of a song in the process of embedding commands, and letting the CommandSong spread through the air without losing the voice "command". Our evaluation demonstrates that we can craft random songs to "carry" any commands and the modify is extremely difficult to be noticed. Specially, the physical attack that we play the CommandSongs over the air and record them can success with 94 percentage.

Most previous event extraction studies have relied heavily on features derived from annotated event mentions, thus cannot be applied to new event types without annotation effort. In this work, we take a fresh look at event extraction and model it as a grounding problem. We design a transferable neural architecture, mapping event mentions and types jointly into a shared semantic space using structural and compositional neural networks, where the type of each event mention can be determined by the closest of all candidate types . By leveraging (1)~available manual annotations for a small set of existing event types and (2)~existing event ontologies, our framework applies to new event types without requiring additional annotation. Experiments on both existing event types (e.g., ACE, ERE) and new event types (e.g., FrameNet) demonstrate the effectiveness of our approach. \textit{Without any manual annotations} for 23 new event types, our zero-shot framework achieved performance comparable to a state-of-the-art supervised model which is trained from the annotations of 500 event mentions.

北京阿比特科技有限公司