99欧美日韩精品一区二区红桃,国产日黄色大片一区二区,国模私拍视频一区二区,亚洲清纯唯美色图

Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily life, where models are deployed in cloud to provide services for customs. Among these models are diffusion probabilistic models (DPMs), which can be stably trained and are more parameter-efficient compared with other generative models. As transmitting data between customs and the cloud introduces high latency and the risk of exposing private data, deploying TTS models on edge devices is preferred. When implementing DPMs onto edge devices, there are two practical problems. First, current DPMs are not lightweight enough for resource-constrained devices. Second, DPMs require many denoising steps in inference, which increases latency. In this work, we present LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight U-Net diffusion decoder and a training-free fast sampling technique, reducing both model parameters and inference latency. Streaming inference is also implemented in LightGrad to reduce latency further. Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.

相關內容

語音(yin)合成(cheng)

關注 491

語(yu)音(yin)合(he)成(cheng)(cheng)（Speech Synthesis），也稱(cheng)為(wei)文語(yu)轉(zhuan)換（Text-to-Speech, TTS,它是(shi)將(jiang)任意的(de)(de)輸(shu)入文本(ben)轉(zhuan)換成(cheng)(cheng)自(zi)然流暢(chang)的(de)(de)語(yu)音(yin)輸(shu)出(chu)。語(yu)音(yin)合(he)成(cheng)(cheng)涉及到人工智能、心理學(xue)、聲學(xue)、語(yu)言(yan)學(xue)、數字(zi)信號處理、計算機科學(xue)等(deng)多(duo)個學(xue)科技(ji)(ji)術(shu)，是(shi)信息處理領(ling)域中的(de)(de)一項前沿(yan)技(ji)(ji)術(shu)。隨著(zhu)計算機技(ji)(ji)術(shu)的(de)(de)不斷(duan)提高(gao)，語(yu)音(yin)合(he)成(cheng)(cheng)技(ji)(ji)術(shu)從早期的(de)(de)共振峰合(he)成(cheng)(cheng),逐(zhu)步(bu)發(fa)展(zhan)為(wei)波形(xing)拼接(jie)合(he)成(cheng)(cheng)和統(tong)計參數語(yu)音(yin)合(he)成(cheng)(cheng)，再(zai)發(fa)展(zhan)到混(hun)合(he)語(yu)音(yin)合(he)成(cheng)(cheng)；合(he)成(cheng)(cheng)語(yu)音(yin)的(de)(de)質量、自(zi)然度已經(jing)得(de)到明顯提高(gao)，基(ji)本(ben)能滿足一些特(te)定(ding)場合(he)的(de)(de)應用(yong)需求。目前，語(yu)音(yin)合(he)成(cheng)(cheng)技(ji)(ji)術(shu)在(zai)銀行、醫院等(deng)的(de)(de)信息播報系統(tong)、汽車導(dao)航系統(tong)、自(zi)動應答呼叫中心等(deng)都有廣泛(fan)應用(yong)，取得(de)了巨大的(de)(de)經(jing)濟(ji)效益。另外(wai)，隨著(zhu)智能手機、MP3、PDA 等(deng)與我們(men)生活(huo)密切相關的(de)(de)媒介的(de)(de)大量涌(yong)現(xian)，語(yu)音(yin)合(he)成(cheng)(cheng)的(de)(de)應用(yong)也在(zai)逐(zhu)漸向娛樂(le)、語(yu)音(yin)教學(xue)、康復治療等(deng)領(ling)域深入。可以說(shuo)語(yu)音(yin)合(he)成(cheng)(cheng)正在(zai)影響著(zhu)人們(men)生活(huo)的(de)(de)方方面面。

容差 · 可約的 · Eagle · Performer · 語言模型化 ·

2023 年 10 月 18 日

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

Baodong Wu,Lei Xia,Qingping Li,Kangyu Li,Xu Chen,Yongqiang Guo,Tieyao Xiang,Yuheng Chen,Shigang Li

from arxiv, 14 pages, 9 figures

Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.

回合 · MoDELS · 損失函數（機器學習） · HTTPS · 數據集 ·

2023 年 10 月 18 日

SegmATRon: Embodied Adaptive Semantic Segmentation for Indoor Environment

Tatiana Zemskova,Margarita Kichik,Dmitry Yudin,Aleksei Staroverov,Aleksandr Panov

from arxiv, 14 pages, 6 figures

This paper presents an adaptive transformer model named SegmATRon for embodied image semantic segmentation. Its distinctive feature is the adaptation of model weights during inference on several images using a hybrid multicomponent loss function. We studied this model on datasets collected in the photorealistic Habitat and the synthetic AI2-THOR Simulators. We showed that obtaining additional images using the agent's actions in an indoor environment can improve the quality of semantic segmentation. The code of the proposed approach and datasets are publicly available at //github.com/wingrune/SegmATRon.

語音識別 · Performer · Continuity · MoDELS · 原點 ·

2023 年 10 月 18 日

Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition

Theresa Pekarek Rosin,Stefan Wermter

from arxiv, 13 pages, 7 figures, accepted and presented at ICANN 2023

While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or self-supervised training techniques, these improvements are still only limited to a subsection of languages and speakers. Transfer learning enables the adaptation of large-scale multilingual models to not only low-resource languages but also to more specific speaker groups. However, fine-tuning on data from new domains is usually accompanied by a decrease in performance on the original domain. Therefore, in our experiments, we examine how well the performance of large-scale ASR models can be approximated for smaller domains, with our own dataset of German Senior Voice Commands (SVC-de), and how much of the general speech recognition performance can be preserved by selectively freezing parts of the model during training. To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. By adding only a fraction of data from the original domain, we are able to reach Word-Error-Rates (WERs) below 5\% on the new domain, while stabilizing performance for general speech recognition at acceptable WERs.

MoDELS · 生成模型 · Prompt · state-of-the-art · Performer ·

2023 年 10 月 18 日

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Yaofang Liu,Xiaodong Cun,Xuebo Liu,Xintao Wang,Yong Zhang,Haoxin Chen,Yang Liu,Tieyong Zeng,Raymond Chan,Ying Shan

from arxiv, Technical Report, Project page: //evalcrafter.github.io/

The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

MoDELS · 語言模型化 · Learning · motivation · state-of-the-art ·

2023 年 10 月 17 日

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Zorik Gekhman,Jonathan Herzig,Roee Aharoni,Chen Elkind,Idan Szpektor

from arxiv, Accepted as a long paper in EMNLP 2023

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.

MoDELS · 邊界框 · 目標檢測 · 控制器 · Extensibility ·

2023 年 10 月 17 日

GeoDiffusion: Text-Prompted Geometric Control for Object Detection Data Generation

Kai Chen,Enze Xie,Zhe Chen,Yibo Wang,Lanqing Hong,Zhenguo Li,Dit-Yan Yeung

from arxiv, Project Page: //kaichen1998.github.io/projects/geodiffusion/

Diffusion models have attracted significant attention due to the remarkable ability to create content and generate data for tasks like image classification. However, the usage of diffusion models to generate the high-quality object detection data remains an underexplored area, where not only image-level perceptual quality but also geometric conditions such as bounding boxes and camera views are essential. Previous studies have utilized either copy-paste synthesis or layout-to-image (L2I) generation with specifically designed modules to encode semantic layouts. In this paper, we propose GeoDiffusion, a simple framework that can flexibly translate various geometric conditions into text prompts and empower pre-trained text-to-image (T2I) diffusion models for high-quality detection data generation. Unlike previous L2I methods, our GeoDiffusion is able to encode not only the bounding boxes but also extra geometric conditions such as camera views in self-driving scenes. Extensive experiments demonstrate GeoDiffusion outperforms previous L2I methods while maintaining 4x training time faster. To the best of our knowledge, this is the first work to adopt diffusion models for layout-to-image generation with geometric conditions and demonstrate that L2I-generated images can be beneficial for improving the performance of object detectors.

語言模型化 · 自動問答 · MoDELS · 控制器 · 輸出 ·

2023 年 10 月 14 日

CarExpert: Leveraging Large Language Models for In-Car Conversational Question Answering

Md Rashad Al Hasan Rony,Christian Suess,Sinchana Ramakanth Bhat,Viju Sudhi,Julia Schneider,Maximilian Vogel,Roman Teucher,Ken E. Friedl,Soumya Sahoo

from arxiv, Accepted into EMNLP 2023 (industry track), corresponding Author: Md Rashad Al Hasan Rony

Large language models (LLMs) have demonstrated remarkable performance by following natural language instructions without fine-tuning them on domain-specific tasks and data. However, leveraging LLMs for domain-specific question answering suffers from severe limitations. The generated answer tends to hallucinate due to the training data collection time (when using off-the-shelf), complex user utterance and wrong retrieval (in retrieval-augmented generation). Furthermore, due to the lack of awareness about the domain and expected output, such LLMs may generate unexpected and unsafe answers that are not tailored to the target domain. In this paper, we propose CarExpert, an in-car retrieval-augmented conversational question-answering system leveraging LLMs for different tasks. Specifically, CarExpert employs LLMs to control the input, provide domain-specific documents to the extractive and generative answering components, and controls the output to ensure safe and domain-specific answers. A comprehensive empirical evaluation exhibits that CarExpert outperforms state-of-the-art LLMs in generating natural, safe and car-specific answers.

Performer · 可約的 · 變換 · 推斷 · 詞元分析器 ·

2023 年 10 月 13 日

PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers

Yuting Wu,Ziyu Wang,Wei D. Lu

Decoder-only Transformer models such as GPT have demonstrated superior performance in text generation, by autoregressively predicting the next token. However, the performance of GPT is bounded by low compute-to-memory-ratio and high memory access. Throughput-oriented architectures such as GPUs target parallel processing rather than sequential token generation, and are not efficient for GPT acceleration, particularly on-device inference applications. Process-in-memory (PIM) architectures can significantly reduce data movement and provide high computation parallelism, and are promising candidates to accelerate GPT inference. In this work, we propose PIM-GPT that aims to achieve high throughput, high energy efficiency and end-to-end acceleration of GPT inference. PIM-GPT leverages DRAM-based PIM solutions to perform multiply-accumulate (MAC) operations on the DRAM chips, greatly reducing data movement. A compact application-specific integrated chip (ASIC) is designed and synthesized to initiate instructions to PIM chips and support data communication along with necessary arithmetic computations. At the software level, the mapping scheme is designed to maximize data locality and computation parallelism by partitioning a matrix among DRAM channels and banks to utilize all in-bank computation resources concurrently. We develop an event-driven clock-cycle accurate simulator to validate the efficacy of the proposed PIM-GPT architecture. Overall, PIM-GPT achieves 41$-$137$\times$, 631$-$1074$\times$ speedup and 339$-$1085$\times$, 890$-$1632$\times$ energy efficiency over GPU and CPU baseline, respectively, on 8 GPT models with up to 1.4 billion parameters.

INTERACT · 3D · 估計/估計量 · 情景 · MoDELS ·

2023 年 10 月 13 日

Decaf: Monocular Deformation Capture for Face and Hand Interactions

Soshi Shimada,Vladislav Golyanik,Patrick Pérez,Christian Theobalt

Existing methods for 3D tracking from monocular RGB videos predominantly consider articulated and rigid objects. Modelling dense non-rigid object deformations in this setting remained largely unaddressed so far, although such effects can improve the realism of the downstream applications such as AR/VR and avatar communications. This is due to the severe ill-posedness of the monocular view setting and the associated challenges. While it is possible to naively track multiple non-rigid objects independently using 3D templates or parametric 3D models, such an approach would suffer from multiple artefacts in the resulting 3D estimates such as depth ambiguity, unnatural intra-object collisions and missing or implausible deformations. Hence, this paper introduces the first method that addresses the fundamental challenges depicted above and that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system. As a pivotal step in its creation, we process the reconstructed raw 3D shapes with position-based dynamics and an approach for non-uniform stiffness estimation of the head tissues, which results in plausible annotations of the surface deformations, hand-face contact regions and head-hand positions. At the core of our neural approach are a variational auto-encoder supplying the hand-face depth prior and modules that guide the 3D tracking by estimating the contacts and the deformations. Our final 3D hand and face reconstructions are realistic and more plausible compared to several baselines applicable in our setting, both quantitatively and qualitatively. //vcai.mpi-inf.mpg.de/projects/Decaf

state-of-the-art · MoDELS · Performer · BERT · 自動問答 ·

2023 年 10 月 13 日

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

Shaltiel Shmidman,Avi Shmidman,Moshe Koppel

from arxiv, Updated second version, with links to two question-answering models

We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew, outperforming existing models on most benchmarks. Additionally, we release three fine-tuned versions of the model, designed to perform three specific foundational tasks in the analysis of Hebrew texts: prefix segmentation, morphological tagging and question answering. These fine-tuned models allow any developer to perform prefix segmentation, morphological tagging and question answering of a Hebrew input with a single call to a HuggingFace model, without the need to integrate any additional libraries or code. In this paper we describe the details of the training as well and the results on the different benchmarks. We release the models to the community, along with sample code demonstrating their use. We release these models as part of our goal to help further research and development in Hebrew NLP.