亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.

相關內容

語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)(Speech Synthesis),也稱(cheng)為文(wen)語(yu)(yu)轉換(Text-to-Speech, TTS,它是(shi)將任意的(de)(de)輸(shu)入(ru)文(wen)本轉換成(cheng)(cheng)(cheng)(cheng)(cheng)自(zi)然(ran)流暢的(de)(de)語(yu)(yu)音(yin)(yin)輸(shu)出。語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)涉及到(dao)人工智能(neng)、心理學、聲學、語(yu)(yu)言學、數字(zi)信(xin)號(hao)處理、計(ji)算機科(ke)(ke)學等(deng)(deng)多個學科(ke)(ke)技(ji)(ji)術(shu)(shu),是(shi)信(xin)息處理領域(yu)中的(de)(de)一(yi)項前(qian)(qian)沿技(ji)(ji)術(shu)(shu)。 隨著(zhu)計(ji)算機技(ji)(ji)術(shu)(shu)的(de)(de)不斷提(ti)高,語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)技(ji)(ji)術(shu)(shu)從早期的(de)(de)共振(zhen)峰合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng),逐(zhu)步(bu)發展(zhan)為波形拼(pin)接(jie)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)和統計(ji)參數語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng),再發展(zhan)到(dao)混(hun)合(he)(he)(he)(he)語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng);合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)語(yu)(yu)音(yin)(yin)的(de)(de)質(zhi)量、自(zi)然(ran)度已經(jing)得(de)到(dao)明顯提(ti)高,基(ji)本能(neng)滿(man)足一(yi)些特定場合(he)(he)(he)(he)的(de)(de)應用需求。目(mu)前(qian)(qian),語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)技(ji)(ji)術(shu)(shu)在(zai)銀(yin)行(xing)、醫(yi)院等(deng)(deng)的(de)(de)信(xin)息播(bo)報系統、汽(qi)車導航系統、自(zi)動應答呼叫(jiao)中心等(deng)(deng)都有廣泛應用,取得(de)了巨大的(de)(de)經(jing)濟(ji)效益(yi)。 另(ling)外,隨著(zhu)智能(neng)手(shou)機、MP3、PDA 等(deng)(deng)與我們(men)生活密切相(xiang)關的(de)(de)媒介的(de)(de)大量涌現,語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)的(de)(de)應用也在(zai)逐(zhu)漸向娛樂、語(yu)(yu)音(yin)(yin)教學、康復治(zhi)療(liao)等(deng)(deng)領域(yu)深入(ru)。可以說語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)正在(zai)影響著(zhu)人們(men)生活的(de)(de)方(fang)方(fang)面(mian)面(mian)。

Real-time communication applications require consistently low latency, which is often disrupted by latency spikes caused by competing flows, especially Web traffic. We identify the root cause of disruptions in such cases as the mismatch between the abrupt bandwidth allocation adjustment of queue scheduling and gradual congestion window adjustment of congestion control. For example, when a sudden burst of new Web flows arrives, queue schedulers abruptly shift bandwidth away from the existing real-time flow(s). The real-time flow will need several RTTs to converge to the new available bandwidth, during which severe stalls occur. In this paper, we present Confucius, a practical queue management scheme designed for offering real-time traffic with consistently low latency regardless of competing flows. Confucius slows down bandwidth adjustment to match the reaction of congestion control, such that the end host can reduce the sending rate without incurring latency spikes. Importantly, Confucius does not require the collaboration of end-hosts (e.g., labels on packets), nor manual parameter tuning to achieve good performance. Extensive experiments show that Confucius outperforms existing practical queueing schemes by reducing the stall duration by more than 50%, while the competing flows also fairly enjoy on-par performance.

Recently, Profile-based Spoken Language Understanding (SLU) has gained increasing attention, which aims to incorporate various types of supplementary profile information (i.e., Knowledge Graph, User Profile, Context Awareness) to eliminate the prevalent ambiguities in user utterances. However, existing approaches can only separately model different profile information, without considering their interrelationships or excluding irrelevant and conflicting information within them. To address the above issues, we introduce a Heterogeneous Graph Attention Network to perform reasoning across multiple Profile information, called Pro-HAN. Specifically, we design three types of edges, denoted as intra-Pro, inter-Pro, and utterance-Pro, to capture interrelationships among multiple Pros. We establish a new state-of-the-art on the ProSLU dataset, with an improvement of approximately 8% across all three metrics. Further analysis experiments also confirm the effectiveness of our method in modeling multi-source profile information.

Multi-antenna relays and intelligent reflecting surfaces (IRSs) have been utilized to construct favorable channels to improve the performance of wireless systems. A common feature between relay systems and IRS-aided systems is the two-hop multiple-input multiple-output (MIMO) channel. As a result, the mutual information (MI) of two-hop MIMO channels has been widely investigated with very engaging results. However, a rigorous investigation on the fundamental limits of two-hop MIMO channels, i.e., the first and second-order analysis, is not yet available in the literature, due to the difficulties caused by the two-hop (product) channel and the noise introduced by the relay (active IRS). In this paper, we employ large-scale random matrix theory (RMT), specifically Gaussian tools, to derive the closed-form deterministic approximation for the mean and variance of the MI. Additionally, we determine the convergence rate for the mean, variance and the characteristic function of the MI, and prove the asymptotic Gaussianity. Furthermore, we also investigate the analytical properties of the fundamental equations that describe the closed-form approximation and prove the existence and uniqueness of the solution. An iterative algorithm is then proposed to obtain the solution for the fundamental equations. Numerical results validate the accuracy of the theoretical analysis.

Several photonic microring resonators (MRRs) based analog accelerators have been proposed to accelerate the inference of integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing analog photonic accelerators suffer from three shortcomings: (i) severe hampering of wavelength parallelism due to various crosstalk effects, (ii) inflexibility of supporting various dataflows other than the weight-stationary dataflow, and (iii) failure in fully leveraging the ability of photodetectors to perform in-situ accumulations. These shortcomings collectively hamper the performance and energy efficiency of prior accelerators. To tackle these shortcomings, we present a novel Hybrid timE Amplitude aNalog optical Accelerator, called HEANA. HEANA employs hybrid time-amplitude analog optical multipliers (TAOMs) that increase the flexibility of HEANA to support multiple dataflows. A spectrally hitless arrangement of TAOMs significantly reduces the crosstalk effects, thereby increasing the wavelength parallelism in HEANA. Moreover, HEANA employs our invented balanced photo-charge accumulators (BPCAs) that enable buffer-less, in-situ, temporal accumulations to eliminate the need to use reduction networks in HEANA, relieving it from related latency and energy overheads. Our evaluation for the inference of four modern CNNs indicates that HEANA provides improvements of atleast 66x and 84x in frames-per-second (FPS) and FPS/W (energy-efficiency), respectively, for equal-area comparisons, on gmean over two MRR-based analog CNN accelerators from prior work.

Deep neural networks (DNNs) have demonstrated remarkable performance across various tasks, including image and speech recognition. However, maximizing the effectiveness of DNNs requires meticulous optimization of numerous hyperparameters and network parameters through training. Moreover, high-performance DNNs entail many parameters, which consume significant energy during training. In order to overcome these challenges, researchers have turned to spiking neural networks (SNNs), which offer enhanced energy efficiency and biologically plausible data processing capabilities, rendering them highly suitable for sensory data tasks, particularly in neuromorphic data. Despite their advantages, SNNs, like DNNs, are susceptible to various threats, including adversarial examples and backdoor attacks. Yet, the field of SNNs still needs to be explored in terms of understanding and countering these attacks. This paper delves into backdoor attacks in SNNs using neuromorphic datasets and diverse triggers. Specifically, we explore backdoor triggers within neuromorphic data that can manipulate their position and color, providing a broader scope of possibilities than conventional triggers in domains like images. We present various attack strategies, achieving an attack success rate of up to 100% while maintaining a negligible impact on clean accuracy. Furthermore, we assess these attacks' stealthiness, revealing that our most potent attacks possess significant stealth capabilities. Lastly, we adapt several state-of-the-art defenses from the image domain, evaluating their efficacy on neuromorphic data and uncovering instances where they fall short, leading to compromised performance.

Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at //uni-rlhf.github.io/.

Effective Receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed at most during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERF remains not large enough instead, or heavy non-local attention mechanisms, which limit the potential of high resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in learned image compression community, we introduce a few large kernel-based depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to wide range of image diversity, we propose to enhance the adaptability of convolutions via generating weights in a self-conditioned manner. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter point-wise interactions. We also investigate improved training techniques to fully exploit the potential of large kernels. In addition, to enhance the interactions among channels, we propose the adaptive channel-wise bit allocation via generating channel importance factor in a self-conditioned manner. To demonstrate the effectiveness of proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, LLIC-TCM. Extensive experiments demonstrate our proposed LLIC models have significant improvements over corresponding baselines and achieve state-of-the-art performances and better trade-off between performance and complexity.

Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications. Looking to the application of building CS corpora, we explore CS language identification (LID) for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable models for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate and finally provide recommendations for future work in this area.

ChatGPT and other general large language models (LLMs) have achieved remarkable success, but they have also raised concerns about the misuse of AI-generated texts. Existing AI-generated text detection models, such as based on BERT and RoBERTa, are prone to in-domain over-fitting, leading to poor out-of-domain (OOD) detection performance. In this paper, we first collected Chinese text responses generated by human experts and 9 types of LLMs, for which to multiple domains questions, and further created a dataset that mixed human-written sentences and sentences polished by LLMs. We then proposed LLM-Detector, a novel method for both document-level and sentence-level text detection through Instruction Tuning of LLMs. Our method leverages the wealth of knowledge LLMs acquire during pre-training, enabling them to detect the text they generate. Instruction tuning aligns the model's responses with the user's expected text detection tasks. Experimental results show that previous methods struggle with sentence-level AI-generated text detection and OOD detection. In contrast, our proposed method not only significantly outperforms baseline methods in both sentence-level and document-level text detection but also demonstrates strong generalization capabilities. Furthermore, since LLM-Detector is trained based on open-source LLMs, it is easy to customize for deployment.

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.

北京阿比特科技有限公司