亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.

相關內容

語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)(Speech Synthesis),也稱為文(wen)語(yu)(yu)(yu)(yu)(yu)轉(zhuan)(zhuan)換(Text-to-Speech, TTS,它是(shi)將任(ren)意的輸入文(wen)本轉(zhuan)(zhuan)換成(cheng)(cheng)(cheng)自然流暢的語(yu)(yu)(yu)(yu)(yu)音(yin)輸出(chu)。語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)涉及到(dao)人工(gong)智能(neng)(neng)、心理(li)學、聲學、語(yu)(yu)(yu)(yu)(yu)言學、數字信(xin)號處(chu)理(li)、計(ji)算機(ji)科(ke)學等(deng)多個(ge)學科(ke)技(ji)術(shu)(shu),是(shi)信(xin)息(xi)處(chu)理(li)領域(yu)(yu)中(zhong)的一(yi)項前沿技(ji)術(shu)(shu)。 隨著計(ji)算機(ji)技(ji)術(shu)(shu)的不(bu)斷提高,語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)技(ji)術(shu)(shu)從早(zao)期的共振峰合(he)成(cheng)(cheng)(cheng),逐(zhu)步發(fa)展為波(bo)形拼接合(he)成(cheng)(cheng)(cheng)和統(tong)計(ji)參數語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng),再發(fa)展到(dao)混合(he)語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng);合(he)成(cheng)(cheng)(cheng)語(yu)(yu)(yu)(yu)(yu)音(yin)的質量(liang)、自然度已經得到(dao)明顯提高,基本能(neng)(neng)滿足一(yi)些特(te)定場(chang)合(he)的應(ying)用需求。目前,語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)技(ji)術(shu)(shu)在(zai)銀行、醫院等(deng)的信(xin)息(xi)播(bo)報系(xi)統(tong)、汽車導航系(xi)統(tong)、自動應(ying)答呼叫中(zhong)心等(deng)都有廣泛(fan)應(ying)用,取得了巨(ju)大的經濟效益。 另(ling)外,隨著智能(neng)(neng)手機(ji)、MP3、PDA 等(deng)與我們生(sheng)活密切(qie)相關的媒介(jie)的大量(liang)涌現,語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)的應(ying)用也在(zai)逐(zhu)漸向娛樂、語(yu)(yu)(yu)(yu)(yu)音(yin)教學、康復治療等(deng)領域(yu)(yu)深入。可以說語(yu)(yu)(yu)(yu)(yu)音(yin)合(he)成(cheng)(cheng)(cheng)正在(zai)影響著人們生(sheng)活的方方面面。

Recent CNN and Transformer-based models tried to utilize frequency and periodicity information for long-term time series forecasting. However, most existing work is based on Fourier transform, which cannot capture fine-grained and local frequency structure. In this paper, we propose a Wavelet-Fourier Transform Network (WFTNet) for long-term time series forecasting. WFTNet utilizes both Fourier and wavelet transforms to extract comprehensive temporal-frequency information from the signal, where Fourier transform captures the global periodic patterns and wavelet transform captures the local ones. Furthermore, we introduce a Periodicity-Weighted Coefficient (PWC) to adaptively balance the importance of global and local frequency patterns. Extensive experiments on various time series datasets show that WFTNet consistently outperforms other state-of-the-art baseline.

Explainable recommender systems (RS) have traditionally followed a one-size-fits-all approach, delivering the same explanation level of detail to each user, without considering their individual needs and goals. Further, explanations in RS have so far been presented mostly in a static and non-interactive manner. To fill these research gaps, we aim in this paper to adopt a user-centered, interactive explanation model that provides explanations with different levels of detail and empowers users to interact with, control, and personalize the explanations based on their needs and preferences. We followed a user-centered approach to design interactive explanations with three levels of detail (basic, intermediate, and advanced) and implemented them in the transparent Recommendation and Interest Modeling Application (RIMA). We conducted a qualitative user study (N=14) to investigate the impact of providing interactive explanations with varying level of details on the users' perception of the explainable RS. Our study showed qualitative evidence that fostering interaction and giving users control in deciding which explanation they would like to see can meet the demands of users with different needs, preferences, and goals, and consequently can have positive effects on different crucial aspects in explainable recommendation, including transparency, trust, satisfaction, and user experience.

We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and the importance of visual and tactile sensing.

Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

Text language models have shown remarkable zero-shot capability in generalizing to unseen tasks when provided with well-formulated instructions. However, existing studies in speech processing primarily focus on limited or specific tasks. Moreover, the lack of standardized benchmarks hinders a fair comparison across different approaches. Thus, we present Dynamic-SUPERB, a benchmark designed for building universal speech models capable of leveraging instruction tuning to perform multiple tasks in a zero-shot fashion. To achieve comprehensive coverage of diverse speech tasks and harness instruction tuning, we invite the community to collaborate and contribute, facilitating the dynamic growth of the benchmark. To initiate, Dynamic-SUPERB features 55 evaluation instances by combining 33 tasks and 22 datasets. This spans a broad spectrum of dimensions, providing a comprehensive platform for evaluation. Additionally, we propose several approaches to establish benchmark baselines. These include the utilization of speech models, text language models, and the multimodal encoder. Evaluation results indicate that while these baselines perform reasonably on seen tasks, they struggle with unseen ones. We also conducted an ablation study to assess the robustness and seek improvements in the performance. We release all materials to the public and welcome researchers to collaborate on the project, advancing technologies in the field together.

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads. We also prove that attention heads can perform the components of an MLP (linear transformations and activation functions) separately. Finally, we prove that attention heads can encode arbitrary masking patterns in their weight matrices to within arbitrarily small error.

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents "Wan Juan", a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at //opendatalab.org.cn/WanJuan1.0.

Hardware-firmware co-verification is critical to design trustworthy systems. While formal methods can provide verification guarantees, due to the complexity of firmware and hardware, it can lead to state space explosion. There are promising avenues to reduce the state space during firmware verification through manual abstraction of hardware or manual generation of hints. Manual development of abstraction or hints requires domain expertise and can be time-consuming and error-prone, leading to incorrect proofs or inaccurate results. In this paper, we effectively combine the scalability of simulation-based validation and the completeness of formal verification. Our proposed approach is applicable to actual firmware and hardware implementations without requiring any manual intervention during formal model generation or hint extraction. To reduce the state space complexity, we utilize both static module-level analysis and dynamic execution of verification scenarios to automatically generate system-level hints. These hints guide the underlying solver to perform scalable equivalence checking using proofs. The extracted hints are validated against the implementation before using them in the proofs. Experimental evaluation on RISC-V based systems demonstrates that our proposed framework is scalable due to scenario-based decomposition and automated hint extraction. Moreover, our fully automated framework can identify complex bugs in actual firmware-hardware implementations.

Diffusion models are a class of deep generative models that have shown impressive results on various tasks with dense theoretical founding. Although diffusion models have achieved impressive quality and diversity of sample synthesis than other state-of-the-art models, they still suffer from costly sampling procedure and sub-optimal likelihood estimation. Recent studies have shown great enthusiasm on improving the performance of diffusion model. In this article, we present a first comprehensive review of existing variants of the diffusion models. Specifically, we provide a first taxonomy of diffusion models and categorize them variants to three types, namely sampling-acceleration enhancement, likelihood-maximization enhancement and data-generalization enhancement. We also introduce in detail other five generative models (i.e., variational autoencoders, generative adversarial networks, normalizing flow, autoregressive models, and energy-based models), and clarify the connections between diffusion models and these generative models. Then we make a thorough investigation into the applications of diffusion models, including computer vision, natural language processing, waveform signal processing, multi-modal modeling, molecular graph generation, time series modeling, and adversarial purification. Furthermore, we propose new perspectives pertaining to the development of this generative model.

Multi-agent influence diagrams (MAIDs) are a popular form of graphical model that, for certain classes of games, have been shown to offer key complexity and explainability advantages over traditional extensive form game (EFG) representations. In this paper, we extend previous work on MAIDs by introducing the concept of a MAID subgame, as well as subgame perfect and trembling hand perfect equilibrium refinements. We then prove several equivalence results between MAIDs and EFGs. Finally, we describe an open source implementation for reasoning about MAIDs and computing their equilibria.

北京阿比特科技有限公司