亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.

相關內容

語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)(Speech Synthesis),也稱(cheng)為文語(yu)(yu)(yu)轉換(Text-to-Speech, TTS,它是將(jiang)任意的輸入文本轉換成(cheng)(cheng)(cheng)(cheng)(cheng)自(zi)然流暢的語(yu)(yu)(yu)音(yin)(yin)(yin)輸出。語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)涉及到(dao)人工智能(neng)、心理(li)學(xue)(xue)、聲學(xue)(xue)、語(yu)(yu)(yu)言(yan)學(xue)(xue)、數字信號處理(li)、計(ji)算機(ji)科(ke)學(xue)(xue)等(deng)多個學(xue)(xue)科(ke)技術(shu),是信息處理(li)領(ling)域(yu)中的一項前沿技術(shu)。 隨(sui)著計(ji)算機(ji)技術(shu)的不斷提(ti)高,語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)技術(shu)從早期(qi)的共振峰合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng),逐(zhu)步發展為波形拼接合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)和統計(ji)參數語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng),再發展到(dao)混(hun)合(he)(he)(he)語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng);合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)語(yu)(yu)(yu)音(yin)(yin)(yin)的質量、自(zi)然度已經得到(dao)明顯提(ti)高,基本能(neng)滿足一些特定(ding)場合(he)(he)(he)的應用(yong)需求。目前,語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)技術(shu)在銀行(xing)、醫院等(deng)的信息播報系(xi)(xi)統、汽車導航系(xi)(xi)統、自(zi)動應答(da)呼(hu)叫中心等(deng)都有廣(guang)泛應用(yong),取(qu)得了巨大的經濟效益。 另(ling)外,隨(sui)著智能(neng)手(shou)機(ji)、MP3、PDA 等(deng)與我(wo)們生(sheng)活(huo)密(mi)切相關的媒介的大量涌現,語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)的應用(yong)也在逐(zhu)漸向娛樂、語(yu)(yu)(yu)音(yin)(yin)(yin)教學(xue)(xue)、康復治(zhi)療等(deng)領(ling)域(yu)深入。可以說語(yu)(yu)(yu)音(yin)(yin)(yin)合(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)正在影響著人們生(sheng)活(huo)的方方面(mian)面(mian)。

Existing knowledge graph (KG) embedding models have primarily focused on static KGs. However, real-world KGs do not remain static, but rather evolve and grow in tandem with the development of KG applications. Consequently, new facts and previously unseen entities and relations continually emerge, necessitating an embedding model that can quickly learn and transfer new knowledge through growth. Motivated by this, we delve into an expanding field of KG embedding in this paper, i.e., lifelong KG embedding. We consider knowledge transfer and retention of the learning on growing snapshots of a KG without having to learn embeddings from scratch. The proposed model includes a masked KG autoencoder for embedding learning and update, with an embedding transfer strategy to inject the learned knowledge into the new entity and relation embeddings, and an embedding regularization method to avoid catastrophic forgetting. To investigate the impacts of different aspects of KG growth, we construct four datasets to evaluate the performance of lifelong KG embedding. Experimental results show that the proposed model outperforms the state-of-the-art inductive and lifelong embedding baselines.

Vision-language representation learning largely benefits from image-text alignment through contrastive losses (e.g., InfoNCE loss). The success of this alignment strategy is attributed to its capability in maximizing the mutual information (MI) between an image and its matched text. However, simply performing cross-modal alignment (CMA) ignores data potential within each modality, which may result in degraded representations. For instance, although CMA-based models are able to map image-text pairs close together in the embedding space, they fail to ensure that similar inputs from the same modality stay close by. This problem can get even worse when the pre-training data is noisy. In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieve the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.

We propose GAN-Supervised Learning, a framework for learning discriminative models and their GAN-generated training data jointly end-to-end. We apply our framework to the dense visual alignment problem. Inspired by the classic Congealing method, our GANgealing algorithm trains a Spatial Transformer to map random samples from a GAN trained on unaligned data to a common, jointly-learned target mode. We show results on eight datasets, all of which demonstrate our method successfully aligns complex data and discovers dense correspondences. GANgealing significantly outperforms past self-supervised correspondence algorithms and performs on-par with (and sometimes exceeds) state-of-the-art supervised correspondence algorithms on several datasets -- without making use of any correspondence supervision or data augmentation and despite being trained exclusively on GAN-generated data. For precise correspondence, we improve upon state-of-the-art supervised methods by as much as $3\times$. We show applications of our method for augmented reality, image editing and automated pre-processing of image datasets for downstream GAN training.

As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges.

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on answering questions that have rare answers. In addition, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, which achieves similar performance to separately optimized single-task models. Our code will be publicly available at: //github.com/j-min/VL-T5

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

Knowledge graphs (KGs) serve as useful resources for various natural language processing applications. Previous KG completion approaches require a large number of training instances (i.e., head-tail entity pairs) for every relation. The real case is that for most of the relations, very few entity pairs are available. Existing work of one-shot learning limits method generalizability for few-shot scenarios and does not fully use the supervisory information; however, few-shot KG completion has not been well studied yet. In this work, we propose a novel few-shot relation learning model (FSRL) that aims at discovering facts of new relations with few-shot references. FSRL can effectively capture knowledge from heterogeneous graph structure, aggregate representations of few-shot references, and match similar entity pairs of reference set for every relation. Extensive experiments on two public datasets demonstrate that FSRL outperforms the state-of-the-art.

Few-shot learning aims to learn novel categories from very few samples given some base categories with sufficient training samples. The main challenge of this task is the novel categories are prone to dominated by color, texture, shape of the object or background context (namely specificity), which are distinct for the given few training samples but not common for the corresponding categories (see Figure 1). Fortunately, we find that transferring information of the correlated based categories can help learn the novel concepts and thus avoid the novel concept being dominated by the specificity. Besides, incorporating semantic correlations among different categories can effectively regularize this information transfer. In this work, we represent the semantic correlations in the form of structured knowledge graph and integrate this graph into deep neural networks to promote few-shot learning by a novel Knowledge Graph Transfer Network (KGTN). Specifically, by initializing each node with the classifier weight of the corresponding category, a propagation mechanism is learned to adaptively propagate node message through the graph to explore node interaction and transfer classifier information of the base categories to those of the novel ones. Extensive experiments on the ImageNet dataset show significant performance improvement compared with current leading competitors. Furthermore, we construct an ImageNet-6K dataset that covers larger scale categories, i.e, 6,000 categories, and experiments on this dataset further demonstrate the effectiveness of our proposed model.

Meta-learning extracts the common knowledge acquired from learning different tasks and uses it for unseen tasks. It demonstrates a clear advantage on tasks that have insufficient training data, e.g., few-shot learning. In most meta-learning methods, tasks are implicitly related via the shared model or optimizer. In this paper, we show that a meta-learner that explicitly relates tasks on a graph describing the relations of their output dimensions (e.g., classes) can significantly improve the performance of few-shot learning. This type of graph is usually free or cheap to obtain but has rarely been explored in previous works. We study the prototype based few-shot classification, in which a prototype is generated for each class, such that the nearest neighbor search between the prototypes produces an accurate classification. We introduce "Gated Propagation Network (GPN)", which learns to propagate messages between prototypes of different classes on the graph, so that learning the prototype of each class benefits from the data of other related classes. In GPN, an attention mechanism is used for the aggregation of messages from neighboring classes, and a gate is deployed to choose between the aggregated messages and the message from the class itself. GPN is trained on a sequence of tasks from many-shot to few-shot generated by subgraph sampling. During training, it is able to reuse and update previously achieved prototypes from the memory in a life-long learning cycle. In experiments, we change the training-test discrepancy and test task generation settings for thorough evaluations. GPN outperforms recent meta-learning methods on two benchmark datasets in all studied cases.

This paper presents a new multi-objective deep reinforcement learning (MODRL) framework based on deep Q-networks. We propose the use of linear and non-linear methods to develop the MODRL framework that includes both single-policy and multi-policy strategies. The experimental results on two benchmark problems including the two-objective deep sea treasure environment and the three-objective mountain car problem indicate that the proposed framework is able to converge to the optimal Pareto solutions effectively. The proposed framework is generic, which allows implementation of different deep reinforcement learning algorithms in different complex environments. This therefore overcomes many difficulties involved with standard multi-objective reinforcement learning (MORL) methods existing in the current literature. The framework creates a platform as a testbed environment to develop methods for solving various problems associated with the current MORL. Details of the framework implementation can be referred to //www.deakin.edu.au/~thanhthi/drl.htm.

北京阿比特科技有限公司