草莓视频在线观看免费完整_精品自在线观看影片天天看_国产亚洲精品VA在线观看_日本高清一区二区免费不卡_亚洲国产中文第一在线_欧美日本黄区免费网站_又粗又大又爽真人一级毛片

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.

相關內容

語音(yin)合(he)成

關注 491

語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)（Speech Synthesis），也稱為文(wen)語(yu)(yu)(yu)(yu)(yu)轉(zhuan)(zhuan)換（Text-to-Speech, TTS,它是(shi)將任(ren)意的輸入文(wen)本轉(zhuan)(zhuan)換成(cheng)(cheng)(cheng)自然流暢的語(yu)(yu)(yu)(yu)(yu)音(yin)輸出(chu)。語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)涉及到(dao)人工(gong)智能(neng)(neng)、心理(li)學、聲學、語(yu)(yu)(yu)(yu)(yu)言學、數字信(xin)號處(chu)理(li)、計(ji)算機(ji)科(ke)學等(deng)多個(ge)學科(ke)技(ji)術(shu)(shu)，是(shi)信(xin)息(xi)處(chu)理(li)領域(yu)(yu)中(zhong)的一(yi)項前沿技(ji)術(shu)(shu)。隨著計(ji)算機(ji)技(ji)術(shu)(shu)的不(bu)斷提高，語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)技(ji)術(shu)(shu)從早(zao)期的共振峰合(he)成(cheng)(cheng)(cheng),逐(zhu)步發(fa)展為波(bo)形拼接合(he)成(cheng)(cheng)(cheng)和統(tong)計(ji)參數語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)，再發(fa)展到(dao)混合(he)語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)；合(he)成(cheng)(cheng)(cheng)語(yu)(yu)(yu)(yu)(yu)音(yin)的質量(liang)、自然度已經得到(dao)明顯提高，基本能(neng)(neng)滿足一(yi)些特(te)定場(chang)合(he)的應(ying)用需求。目前，語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)技(ji)術(shu)(shu)在(zai)銀行、醫院等(deng)的信(xin)息(xi)播(bo)報系(xi)統(tong)、汽車導航系(xi)統(tong)、自動應(ying)答呼叫中(zhong)心等(deng)都有廣泛(fan)應(ying)用，取得了巨(ju)大的經濟效益。另(ling)外，隨著智能(neng)(neng)手機(ji)、MP3、PDA 等(deng)與我們生(sheng)活密切(qie)相關的媒介(jie)的大量(liang)涌現，語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)的應(ying)用也在(zai)逐(zhu)漸向娛樂、語(yu)(yu)(yu)(yu)(yu)音(yin)教學、康復治療等(deng)領域(yu)(yu)深入。可以說語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)正在(zai)影響著人們生(sheng)活的方方面面。

變換 · 周期的 · INFORMS · 傅立葉變換 · Extensibility ·

2023 年 9 月 20 日

WFTNet: Exploiting Global and Local Periodicity in Long-term Time Series Forecasting

Peiyuan Liu,Beiliang Wu,Naiqi Li,Tao Dai,Fengmao Lei,Jigang Bao,Yong Jiang,Shu-Tao Xia

Recent CNN and Transformer-based models tried to utilize frequency and periodicity information for long-term time series forecasting. However, most existing work is based on Fourier transform, which cannot capture fine-grained and local frequency structure. In this paper, we propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting. WFTNet utilizes both Fourier and wavelet transforms to extract comprehensive temporal-frequency information from the signal, where Fourier transform captures the global periodic patterns and wavelet transform captures the local ones. Furthermore, we introduce a Periodicity-Weighted Coefficient (PWC) to adaptively balance the importance of global and local frequency patterns. Extensive experiments on various time series datasets show that WFTNet consistently outperforms other state-of-the-art baseline.

INTERACT · 推薦系統 · 控制器 · MoDELS · AIM ·

2023 年 9 月 20 日

Interactive Explanation with Varying Level of Details in an Explainable Scientific Literature Recommender System

Mouadh Guesmi,Mohamed Amine Chatti,Shoeb Joarder,Qurat Ul Ain,Rawaa Alatrash,Clara Siepmann,Tannaz Vahidi

from arxiv, This article has been accepted for publication in the International Journal of Human-Computer Interaction, published by Taylor & Francis

Explainable recommender systems (RS) have traditionally followed a one-size-fits-all approach, delivering the same explanation level of detail to each user, without considering their individual needs and goals. Further, explanations in RS have so far been presented mostly in a static and non-interactive manner. To fill these research gaps, we aim in this paper to adopt a user-centered, interactive explanation model that provides explanations with different levels of detail and empowers users to interact with, control, and personalize the explanations based on their needs and preferences. We followed a user-centered approach to design interactive explanations with three levels of detail (basic, intermediate, and advanced) and implemented them in the transparent Recommendation and Interest Modeling Application (RIMA). We conducted a qualitative user study (N=14) to investigate the impact of providing interactive explanations with varying level of details on the users' perception of the explainable RS. Our study showed qualitative evidence that fostering interaction and giving users control in deciding which explanation they would like to see can meet the demands of users with different needs, preferences, and goals, and consequently can have positive effects on different crucial aspects in explainable recommendation, including transparency, trust, satisfaction, and user experience.

多峰值 · 塑造 · Vision · 在線推斷 · Performer ·

2023 年 9 月 18 日

General In-Hand Object Rotation with Vision and Touch

Haozhi Qi,Brent Yi,Sudharshan Suresh,Mike Lambeta,Yi Ma,Roberto Calandra,Jitendra Malik

from arxiv, CoRL 2023; Website: //haozhi.io/rotateit/

We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.

語音增強 · Performer · Processing（編程語言） · 泛函 · 估計/估計量 ·

2023 年 9 月 18 日

Single and Few-step Diffusion for Generative Speech Enhancement

Bunlong Lay,Jean-Marie Lemercier,Julius Richter,Timo Gerkmann

from arxiv, 5 pages, 1 figure, 1 table

Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

Performer · MoDELS · tuning · 語言模型化 · 多峰值 ·

2023 年 9 月 18 日

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Chien-yu Huang,Ke-Han Lu,Shih-Heng Wang,Chi-Yuan Hsiao,Chun-Yi Kuan,Haibin Wu,Siddhant Arora,Kai-Wei Chang,Jiatong Shi,Yifan Peng,Roshan Sharma,Shinji Watanabe,Bhiksha Ramakrishnan,Shady Shehata,Hung-yi Lee

from arxiv, Submitted to ICASSP 2024

Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We also conducted an ablation study to assess the robustness and seek improvements in the performance. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.

Attention · 變換 · 掩碼 · Performer · 線性變換 ·

2023 年 9 月 15 日

Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben,Valerie Morris

from arxiv, 11 pages

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

多峰值 · MoDELS · 數據集 · Performer · HTTPS ·

2023 年 9 月 15 日

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Conghui He,Zhenjiang Jin,Chao Xu,Jiantao Qiu,Bin Wang,Wei Li,Hang Yan,Jiaqi Wang,Dahua Lin

from arxiv, Technical Report

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at //opendatalab.org.cn/WanJuan1.0.

Automator · 狀態空間 · 可約的 · Hive · 可辨認的 ·

2023 年 9 月 14 日

HIVE: Scalable Hardware-Firmware Co-Verification using Scenario-based Decomposition and Automated Hint Extraction

Aruna Jayasena,Prabhat Mishra

Hardware-firmware co-verification is critical to design trustworthy systems. While formal methods can provide verification guarantees, due to the complexity of firmware and hardware, it can lead to state space explosion. There are promising avenues to reduce the state space during firmware verification through manual abstraction of hardware or manual generation of hints. Manual development of abstraction or hints requires domain expertise and can be time-consuming and error-prone, leading to incorrect proofs or inaccurate results. In this paper, we effectively combine the scalability of simulation-based validation and the completeness of formal verification. Our proposed approach is applicable to actual firmware and hardware implementations without requiring any manual intervention during formal model generation or hint extraction. To reduce the state space complexity, we utilize both static module-level analysis and dynamic execution of verification scenarios to automatically generate system-level hints. These hints guide the underlying solver to perform scalable equivalence checking using proofs. The extracted hints are validated against the implementation before using them in the proofs. Experimental evaluation on RISC-V based systems demonstrates that our proposed framework is scalable due to scenario-based decomposition and automated hint extraction. Moreover, our fully automated framework can identify complex bugs in actual firmware-hardware implementations.

MoDELS · 生成模型 · Processing（編程語言） · Taxonomy · Signal Processing ·

2022 年 9 月 2 日

Diffusion Models: A Comprehensive Survey of Methods and Applications

Ling Yang,Zhilong Zhang,Shenda Hong

from arxiv, 23 pages

Diffusion models are a class of deep generative models that have shown impressive results on various tasks with dense theoretical founding. Although diffusion models have achieved impressive quality and diversity of sample synthesis than other state-of-the-art models, they still suffer from costly sampling procedure and sub-optimal likelihood estimation. Recent studies have shown great enthusiasm on improving the performance of diffusion model. In this article, we present a first comprehensive review of existing variants of the diffusion models. Specifically, we provide a first taxonomy of diffusion models and categorize them variants to three types, namely sampling-acceleration enhancement, likelihood-maximization enhancement and data-generalization enhancement. We also introduce in detail other five generative models (i.e., variational autoencoders, generative adversarial networks, normalizing flow, autoregressive models, and energy-based models), and clarify the connections between diffusion models and these generative models. Then we make a thorough investigation into the applications of diffusion models, including computer vision, natural language processing, waveform signal processing, multi-modal modeling, molecular graph generation, time series modeling, and adversarial purification. Furthermore, we propose new perspectives pertaining to the development of this generative model.

Extensibility · GM · MoDELS · 類別 · 多代理人模型 ·

2021 年 2 月 9 日

Equilibrium Refinements for Multi-Agent Influence Diagrams: Theory and Practice

Lewis Hammond,James Fox,Tom Everitt,Alessandro Abate,Michael Wooldridge

from arxiv, Accepted to the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS-21)

Multi-agent influence diagrams (MAIDs) are a popular form of graphical model that, for certain classes of games, have been shown to offer key complexity and explainability advantages over traditional extensive form game (EFG) representations. In this paper, we extend previous work on MAIDs by introducing the concept of a MAID subgame, as well as subgame perfect and trembling hand perfect equilibrium refinements. We then prove several equivalence results between MAIDs and EFGs. Finally, we describe an open source implementation for reasoning about MAIDs and computing their equilibria.