国产裸体美女永久免费无遮挡久久_亚洲视频华人在线播放_丁香五月亚洲综合色婷婷_涩涩伊人久久无码欧美_黄色视频AAA在线观看_国产在线播放口爆吞精_国产精品人成在线观看免费网站

Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.

相關內容

語音合(he)成

關注 491

語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)（Speech Synthesis），也稱為文(wen)語(yu)(yu)轉(zhuan)換(huan)（Text-to-Speech, TTS,它是(shi)將(jiang)任意的(de)(de)(de)(de)輸(shu)入文(wen)本轉(zhuan)換(huan)成(cheng)(cheng)(cheng)自(zi)然流(liu)暢的(de)(de)(de)(de)語(yu)(yu)音(yin)(yin)輸(shu)出。語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)涉及到人工智(zhi)能、心理(li)學(xue)、聲學(xue)、語(yu)(yu)言學(xue)、數(shu)字信號處(chu)理(li)、計算機(ji)科學(xue)等(deng)多個學(xue)科技(ji)術(shu)，是(shi)信息處(chu)理(li)領域中的(de)(de)(de)(de)一(yi)(yi)項前沿技(ji)術(shu)。隨著(zhu)計算機(ji)技(ji)術(shu)的(de)(de)(de)(de)不斷提(ti)高，語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)技(ji)術(shu)從早期的(de)(de)(de)(de)共(gong)振(zhen)峰(feng)合(he)(he)成(cheng)(cheng)(cheng),逐(zhu)步發(fa)展(zhan)為波形拼(pin)接合(he)(he)成(cheng)(cheng)(cheng)和統計參數(shu)語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)，再(zai)發(fa)展(zhan)到混合(he)(he)語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)；合(he)(he)成(cheng)(cheng)(cheng)語(yu)(yu)音(yin)(yin)的(de)(de)(de)(de)質量、自(zi)然度已經得到明顯提(ti)高，基本能滿足(zu)一(yi)(yi)些特定場合(he)(he)的(de)(de)(de)(de)應(ying)用需求。目前，語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)技(ji)術(shu)在(zai)(zai)銀行、醫院(yuan)等(deng)的(de)(de)(de)(de)信息播報(bao)系(xi)(xi)統、汽車導航系(xi)(xi)統、自(zi)動應(ying)答呼叫中心等(deng)都有廣泛應(ying)用，取得了巨大(da)的(de)(de)(de)(de)經濟效益。另外，隨著(zhu)智(zhi)能手機(ji)、MP3、PDA 等(deng)與我們(men)生活密(mi)切相關的(de)(de)(de)(de)媒(mei)介的(de)(de)(de)(de)大(da)量涌現，語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)的(de)(de)(de)(de)應(ying)用也在(zai)(zai)逐(zhu)漸向娛樂、語(yu)(yu)音(yin)(yin)教學(xue)、康復治療(liao)等(deng)領域深入。可以說語(yu)(yu)音(yin)(yin)合(he)(he)成(cheng)(cheng)(cheng)正(zheng)在(zai)(zai)影響著(zhu)人們(men)生活的(de)(de)(de)(de)方方面面。

Networking · Neural Networks · 設計 · 噪聲 · 模型評估 ·

2023 年 8 月 11 日

Noise-Resilient Designs for Optical Neural Networks

Gianluca Kosmella,Ripalta Stabile,Jaron Sanders

from arxiv, 17 pages, 6 figures

All analog signal processing is fundamentally subject to noise, and this is also the case in modern implementations of Optical Neural Networks (ONNs). Therefore, to mitigate noise in ONNs, we propose two designs that are constructed from a given, possibly trained, Neural Network (NN) that one wishes to implement. Both designs have the capability that the resulting ONNs gives outputs close to the desired NN. To establish the latter, we analyze the designs mathematically. Specifically, we investigate a probabilistic framework for the first design that establishes that the design is correct, i.e., for any feed-forward NN with Lipschitz continuous activation functions, an ONN can be constructed that produces output arbitrarily close to the original. ONNs constructed with the first design thus also inherit the universal approximation property of NNs. For the second design, we restrict the analysis to NNs with linear activation functions and characterize the ONNs' output distribution using exact formulas. Finally, we report on numerical experiments with LeNet ONNs that give insight into the number of components required in these designs for certain accuracy gains. We specifically study the effect of noise as a function of the depth of an ONN. The results indicate that in practice, adding just a few components in the manner of the first or the second design can already be expected to increase the accuracy of ONNs considerably.

估計/估計量 · 重構誤差 · MoDELS · motivation · Performer ·

2023 年 8 月 11 日

Out-of-Distribution Detection for Monocular Depth Estimation

Julia Hornauer,Adrian Holzbock,Vasileios Belagiannis

from arxiv, Accepted to ICCV 2023

In monocular depth estimation, uncertainty estimation approaches mainly target the data uncertainty introduced by image noise. In contrast to prior work, we address the uncertainty due to lack of knowledge, which is relevant for the detection of data not represented by the training distribution, the so-called out-of-distribution (OOD) data. Motivated by anomaly detection, we propose to detect OOD images from an encoder-decoder depth estimation model based on the reconstruction error. Given the features extracted with the fixed depth encoder, we train an image decoder for image reconstruction using only in-distribution data. Consequently, OOD images result in a high reconstruction error, which we use to distinguish between in- and out-of-distribution samples. We built our experiments on the standard NYU Depth V2 and KITTI benchmarks as in-distribution data. Our post hoc method performs astonishingly well on different models and outperforms existing uncertainty estimation approaches without modifying the trained encoder-decoder depth estimation model.

數據增強 · 樣例 · 分類模型 · Weight · 文本分類 ·

2023 年 8 月 9 日

Adversarial Word Dilution as Text Data Augmentation in Low-Resource Regime

Junfan Chen,Richong Zhang,Zheyan Luo,Chunming Hu,Yongyi Mao

from arxiv, Preprint, Accepted by AAAI 2023

Data augmentation is widely used in text classification, especially in the low-resource regime where a few examples for each class are available during training. Despite the success, generating data augmentations as hard positive examples that may increase their effectiveness is under-explored. This paper proposes an Adversarial Word Dilution (AWD) method that can generate hard positive examples as text data augmentations to train the low-resource text classification model efficiently. Our idea of augmenting the text data is to dilute the embedding of strong positive words by weighted mixing with unknown-word embedding, making the augmented inputs hard to be recognized as positive by the classification model. We adversarially learn the dilution weights through a constrained min-max optimization process with the guidance of the labels. Empirical studies on three benchmark datasets show that AWD can generate more effective data augmentations and outperform the state-of-the-art text data augmentation methods. The additional analysis demonstrates that the data augmentations generated by AWD are interpretable and can flexibly extend to new examples without further training.

表示 · 可理解性 · NeRF · 3D · 解碼 ·

2023 年 8 月 9 日

One-Shot Neural Fields for 3D Object Understanding

Valts Blukis,Taeyeop Lee,Jonathan Tremblay,Bowen Wen,In So Kweon,Kuk-Jin Yoon,Dieter Fox,Stan Birchfield

from arxiv, IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) on XRNeRF: Advances in NeRF for the Metaverse 2023

We present a unified and compact scene representation for robotics, where each object in the scene is depicted by a latent code capturing geometry and appearance. This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction (e.g. recovering depth, point clouds, or voxel maps), collision checking, and stable grasp prediction. We build our representation from a single RGB input image at test time by leveraging recent advances in Neural Radiance Fields (NeRF) that learn category-level priors on large multiview datasets, then fine-tune on novel objects from one or few views. We expand the NeRF model for additional grasp outputs and explore ways to leverage this representation for robotics. At test-time, we build the representation from a single RGB input image observing the scene from only one viewpoint. We find that the recovered representation allows rendering from novel views, including of occluded object parts, and also for predicting successful stable grasps. Grasp poses can be directly decoded from our latent representation with an implicit grasp decoder. We experimented in both simulation and real world and demonstrated the capability for robust robotic grasping using such compact representation. Website: //nerfgrasp.github.io

圖形處理器 · 圖 · Better · Neural Networks · 視覺問答 ·

2020 年 3 月 31 日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Difei Gao,Ke Li,Ruiping Wang,Shiguang Shan,Xilin Chen

from arxiv, Published as a CVPR2020 paper

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and sports teams. To overcome this difficulty, only resorting to pre-trained word embedding models is far from enough. A desired model should utilize the rich information in multiple modalities of the image to help understand the meaning of scene texts, e.g., the prominent text on a bottle is most likely to be the brand. Following this idea, we propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN). It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. Then, we introduce three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities, so as to refine the features of nodes. The updated nodes have better features for the downstream question answering module. Experimental evaluations show that our MM-GNN represents the scene texts better and obviously facilitates the performances on two VQA tasks that require reading scene texts.

小樣本學習 · 目標檢測 · Networking · 數據集 · 情景 ·

2020 年 3 月 31 日

Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector

Qi Fan,Wei Zhuo,Chi-Keung Tang,Yu-Wing Tai

from arxiv, CVPR2020 Camera Ready. (Fix Figure 3 and Table 5. More implementation details in the supplementary material.)

Conventional methods for object detection typically require a substantial amount of training data and preparing such high-quality training data is very labor-intensive. In this paper, we propose a novel few-shot object detection network that aims at detecting objects of unseen categories with only a few annotated examples. Central to our method are our Attention-RPN, Multi-Relation Detector and Contrastive Training strategy, which exploit the similarity between the few shot support set and query set to detect novel objects while suppressing false detection in the background. To train our network, we contribute a new dataset that contains 1000 categories of various objects with high-quality annotations. To the best of our knowledge, this is one of the first datasets specifically designed for few-shot object detection. Once our few-shot network is trained, it can detect objects of unseen categories without further training or fine-tuning. Our method is general and has a wide range of potential applications. We produce a new state-of-the-art performance on different datasets in the few-shot setting. The dataset link is //github.com/fanq15/Few-Shot-Object-Detection-Dataset.

MoDELS · entity · CC · Performer · 學成 ·

2020 年 3 月 12 日

Learning Conceptual-Contextual Embeddings for Medical Text

Xiao Zhang,Dejing Dou,Ji Wu

External knowledge is often useful for natural language understanding tasks. We introduce a contextual text representation model called Conceptual-Contextual (CC) embeddings, which incorporates structured knowledge into text representations. Unlike entity embedding methods, our approach encodes a knowledge graph into a context model. CC embeddings can be easily reused for a wide range of tasks just like pre-trained language models. Our model effectively encodes the huge UMLS database by leveraging semantic generalizability. Experiments on electronic health records (EHRs) and medical text processing benchmarks showed our model gives a major boost to the performance of supervised medical NLP tasks.

MoDELS · entity · CC · Performer · 學成 ·

2019 年 8 月 16 日

Learning Conceptual-Contexual Embeddings for Medical Text

Xiao Zhang,Dejing Dou,Ji Wu

屬性空間 · 多樣性 · Pair · MoDELS · 訓練數據 ·

2018 年 8 月 2 日

Diverse Image-to-Image Translation via Disentangled Representations

Hsin-Ying Lee,Hung-Yu Tseng,Jia-Bin Huang,Maneesh Kumar Singh,Ming-Hsuan Yang

from arxiv, ECCV 2018 (Oral). Project page: //vllab.ucmerced.edu/hylee/DRIT/ Code: //github.com/HsinYingLee/DRIT/

Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for many applications: 1) the lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for producing diverse outputs without paired training images. To achieve diversity, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time. To handle unpaired training data, we introduce a novel cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative comparisons, we measure realism with user study and diversity with a perceptual distance metric. We apply the proposed model to domain adaptation and show competitive performance when compared to the state-of-the-art on the MNIST-M and the LineMod datasets.

視覺問答 · 自頂向下 · 圖像字幕 · 注意力機制 · 自下而上 ·

2018 年 3 月 14 日

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Peter Anderson,Xiaodong He,Chris Buehler,Damien Teney,Mark Johnson,Stephen Gould,Lei Zhang

from arxiv, CVPR 2018 full oral, winner of the 2017 Visual Question Answering challenge

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.