好男人在线观看免费2019_一区二区三区高清视频精品_国产欧美精品久久久久中文字幕_2020国产午夜福利免费看_无码人妻精品一区二区在线视_亚洲最新永久在线观看_丝袜在线一区二区三区视频

Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.

相關內容

語(yu)音合成

關注 491

語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)（Speech Synthesis），也稱(cheng)為文(wen)語(yu)(yu)轉換（Text-to-Speech, TTS,它是(shi)將任意的(de)(de)輸(shu)入(ru)文(wen)本轉換成(cheng)(cheng)(cheng)(cheng)(cheng)自(zi)然(ran)流暢的(de)(de)語(yu)(yu)音(yin)(yin)輸(shu)出。語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)涉及到(dao)人工智能(neng)、心理學、聲學、語(yu)(yu)言學、數字(zi)信(xin)號(hao)處理、計(ji)算機科(ke)(ke)學等(deng)(deng)多個學科(ke)(ke)技(ji)(ji)術(shu)(shu)，是(shi)信(xin)息處理領域(yu)中的(de)(de)一(yi)項前(qian)(qian)沿技(ji)(ji)術(shu)(shu)。隨著(zhu)計(ji)算機技(ji)(ji)術(shu)(shu)的(de)(de)不斷提(ti)高，語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)技(ji)(ji)術(shu)(shu)從早期的(de)(de)共振(zhen)峰合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng),逐(zhu)步(bu)發展(zhan)為波形拼(pin)接(jie)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)和統計(ji)參數語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)，再發展(zhan)到(dao)混(hun)合(he)(he)(he)(he)語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)；合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)語(yu)(yu)音(yin)(yin)的(de)(de)質(zhi)量、自(zi)然(ran)度已經(jing)得(de)到(dao)明顯提(ti)高，基(ji)本能(neng)滿(man)足一(yi)些特定場合(he)(he)(he)(he)的(de)(de)應用需求。目(mu)前(qian)(qian)，語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)技(ji)(ji)術(shu)(shu)在(zai)銀(yin)行(xing)、醫(yi)院等(deng)(deng)的(de)(de)信(xin)息播(bo)報系統、汽(qi)車導航系統、自(zi)動應答呼叫(jiao)中心等(deng)(deng)都有廣泛應用，取得(de)了巨大的(de)(de)經(jing)濟(ji)效益(yi)。另(ling)外，隨著(zhu)智能(neng)手(shou)機、MP3、PDA 等(deng)(deng)與我們(men)生活密切相(xiang)關的(de)(de)媒介的(de)(de)大量涌現，語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)的(de)(de)應用也在(zai)逐(zhu)漸向娛樂、語(yu)(yu)音(yin)(yin)教學、康復治(zhi)療(liao)等(deng)(deng)領域(yu)深入(ru)。可以說語(yu)(yu)音(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(cheng)正在(zai)影響著(zhu)人們(men)生活的(de)(de)方(fang)方(fang)面(mian)面(mian)。

可約的 · Performance · 控制器 · 可辨認的 · CASES ·

2024 年 2 月 7 日

Confucius: Achieving Consistent Low Latency with Practical Queue Management for Real-Time Communications

Zili Meng,Nirav Atre,Mingwei Xu,Justine Sherry,Maria Apostolaki

Real-time communication applications require consistently low latency, which is often disrupted by latency spikes caused by competing flows, especially Web traffic. We identify the root cause of disruptions in such cases as the mismatch between the abrupt bandwidth allocation adjustment of queue scheduling and gradual congestion window adjustment of congestion control. For example, when a sudden burst of new Web flows arrives, queue schedulers abruptly shift bandwidth away from the existing real-time flow(s). The real-time flow will need several RTTs to converge to the new available bandwidth, during which severe stalls occur. In this paper, we present Confucius, a practical queue management scheme designed for offering real-time traffic with consistently low latency regardless of competing flows. Confucius slows down bandwidth adjustment to match the reaction of congestion control, such that the end host can reduce the sending rate without incurring latency spikes. Importantly, Confucius does not require the collaboration of end-hosts (e.g., labels on packets), nor manual parameter tuning to achieve good performance. Extensive experiments show that Confucius outperforms existing practical queueing schemes by reducing the stall duration by more than 50%, while the competing flows also fairly enjoy on-par performance.

圖注意力網絡 · 可理解性 · Attention · 圖 · Networking ·

2024 年 2 月 6 日

Pro-HAN: A Heterogeneous Graph Attention Network for Profile-Based Spoken Language Understanding

Dechuan Teng,Chunlin Lu,Xiao Xu,Wanxiang Che,Libo Qin

from arxiv, Accepted at ICASSP 2024

Recently, Profile-based Spoken Language Understanding (SLU) has gained increasing attention, which aims to incorporate various types of supplementary profile information (i.e., Knowledge Graph, User Profile, Context Awareness) to eliminate the prevalent ambiguities in user utterances. However, existing approaches can only separately model different profile information, without considering their interrelationships or excluding irrelevant and conflicting information within them. To address the above issues, we introduce a Heterogeneous Graph Attention Network to perform reasoning across multiple Profile information, called Pro-HAN. Specifically, we design three types of edges, denoted as intra-Pro, inter-Pro, and utterance-Pro, to capture interrelationships among multiple Pros. We establish a new state-of-the-art on the ProSLU dataset, with an improvement of approximately 8% across all three metrics. Further analysis experiments also confirm the effectiveness of our method in modeling multi-source profile information.

MIMO · 通道 · Analysis · 閉式 · 均值 ·

2024 年 2 月 6 日

Fundamental Limits of Two-Hop MIMO Channels: An Asymptotic Approach

Zeyan Zhuang,Xin Zhang,Dongfang Xu,Shenghui Song

Multi-antenna relays and intelligent reflecting surfaces (IRSs) have been utilized to construct favorable channels to improve the performance of wireless systems. A common feature between relay systems and IRS-aided systems is the two-hop multiple-input multiple-output (MIMO) channel. As a result, the mutual information (MI) of two-hop MIMO channels has been widely investigated with very engaging results. However, a rigorous investigation on the fundamental limits of two-hop MIMO channels, i.e., the first and second-order analysis, is not yet available in the literature, due to the difficulties caused by the two-hop (product) channel and the noise introduced by the relay (active IRS). In this paper, we employ large-scale random matrix theory (RMT), specifically Gaussian tools, to derive the closed-form deterministic approximation for the mean and variance of the MI. Additionally, we determine the convergence rate for the mean, variance and the characteristic function of the MI, and prove the asymptotic Gaussianity. Furthermore, we also investigate the analytical properties of the fundamental equations that describe the closed-form approximation and prove the existence and uniqueness of the solution. An iterative algorithm is then proposed to obtain the solution for the fundamental equations. Numerical results validate the accuracy of the theoretical analysis.

推斷 · crosstalk · Performer · CNN · 可約的 ·

2024 年 2 月 5 日

HEANA: A Hybrid Time-Amplitude Analog Optical Accelerator with Flexible Dataflows for Energy-Efficient CNN Inference

Sairam Sri Vatsavai,Venkata Sai Praneeth Karempudi,Ishan Thakkar

from arxiv, The paper is under review at ACM TODAES

Several photonic microring resonators (MRRs) based analog accelerators have been proposed to accelerate the inference of integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing analog photonic accelerators suffer from three shortcomings: (i) severe hampering of wavelength parallelism due to various crosstalk effects, (ii) inflexibility of supporting various dataflows other than the weight-stationary dataflow, and (iii) failure in fully leveraging the ability of photodetectors to perform in-situ accumulations. These shortcomings collectively hamper the performance and energy efficiency of prior accelerators. To tackle these shortcomings, we present a novel Hybrid timE Amplitude aNalog optical Accelerator, called HEANA. HEANA employs hybrid time-amplitude analog optical multipliers (TAOMs) that increase the flexibility of HEANA to support multiple dataflows. A spectrally hitless arrangement of TAOMs significantly reduces the crosstalk effects, thereby increasing the wavelength parallelism in HEANA. Moreover, HEANA employs our invented balanced photo-charge accumulators (BPCAs) that enable buffer-less, in-situ, temporal accumulations to eliminate the need to use reduction networks in HEANA, relieving it from related latency and energy overheads. Our evaluation for the inference of four modern CNNs indicates that HEANA provides improvements of atleast 66x and 84x in frames-per-second (FPS) and FPS/W (energy-efficiency), respectively, for equal-area comparisons, on gmean over two MRR-based analog CNN accelerators from prior work.

Networking · Neural Networks · 生物學合理性 · Performer · state-of-the-art ·

2024 年 2 月 5 日

Sneaky Spikes: Uncovering Stealthy Backdoor Attacks in Spiking Neural Networks with Neuromorphic Data

Gorka Abad,Oguzhan Ersoy,Stjepan Picek,Aitor Urbieta

from arxiv, To appear in Network and Distributed System Security (NDSS) Symposium 2024

Deep neural networks (DNNs) have demonstrated remarkable performance across various tasks, including image and speech recognition. However, maximizing the effectiveness of DNNs requires meticulous optimization of numerous hyperparameters and network parameters through training. Moreover, high-performance DNNs entail many parameters, which consume significant energy during training. In order to overcome these challenges, researchers have turned to spiking neural networks (SNNs), which offer enhanced energy efficiency and biologically plausible data processing capabilities, rendering them highly suitable for sensory data tasks, particularly in neuromorphic data. Despite their advantages, SNNs, like DNNs, are susceptible to various threats, including adversarial examples and backdoor attacks. Yet, the field of SNNs still needs to be explored in terms of understanding and countering these attacks. This paper delves into backdoor attacks in SNNs using neuromorphic datasets and diverse triggers. Specifically, we explore backdoor triggers within neuromorphic data that can manipulate their position and color, providing a broader scope of possibilities than conventional triggers in domains like images. We present various attack strategies, achieving an attack success rate of up to 100% while maintaining a negligible impact on clean accuracy. Furthermore, we assess these attacks' stealthiness, revealing that our most potent attacks possess significant stealth capabilities. Lastly, we adapt several state-of-the-art defenses from the image domain, evaluating their efficacy on neuromorphic data and uncovering instances where they fall short, leading to compromised performance.

多樣性 · Learning · Performer · 強化學習 · 基準 ·

2024 年 2 月 4 日

Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback

Yifu Yuan,Jianye Hao,Yi Ma,Zibin Dong,Hebin Liang,Jinyi Liu,Zhixin Feng,Kai Zhao,Yan Zheng

from arxiv, Published as a conference paper at ICLR 2024. The website is available at //uni-rlhf.github.io/

Reinforcement Learning with Human Feedback (RLHF) has received significant attention for performing tasks without the need for costly manual reward design by aligning human preferences. It is crucial to consider diverse human feedback types and various learning methods in different environments. However, quantifying progress in RLHF with diverse feedback is challenging due to the lack of standardized annotation platforms and widely used unified benchmarks. To bridge this gap, we introduce Uni-RLHF, a comprehensive system implementation tailored for RLHF. It aims to provide a complete workflow from real human feedback, fostering progress in the development of practical problems. Uni-RLHF contains three packages: 1) a universal multi-feedback annotation platform, 2) large-scale crowdsourced feedback datasets, and 3) modular offline RLHF baseline implementations. Uni-RLHF develops a user-friendly annotation interface tailored to various feedback types, compatible with a wide range of mainstream RL environments. We then establish a systematic pipeline of crowdsourced annotations, resulting in large-scale annotated datasets comprising more than 15 million steps across 30+ popular tasks. Through extensive experiments, the results in the collected datasets demonstrate competitive performance compared to those from well-designed manual rewards. We evaluate various design choices and offer insights into their strengths and potential areas of improvement. We wish to build valuable open-source platforms, datasets, and baselines to facilitate the development of more robust and reliable RLHF solutions based on realistic human feedback. The website is available at //uni-rlhf.github.io/.

變換 · Weight · 感受野 · Learning · Performer ·

2024 年 2 月 3 日

LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression

Wei Jiang,Peirong Ning,Jiayu Yang,Yongqi Zhai,Feng Gao,Ronggang Wang

from arxiv, Fix typos

Effective Receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed at most during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERF remains not large enough instead, or heavy non-local attention mechanisms, which limit the potential of high resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in learned image compression community, we introduce a few large kernel-based depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to wide range of image diversity, we propose to enhance the adaptability of convolutions via generating weights in a self-conditioned manner. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter point-wise interactions. We also investigate improved training techniques to fully exploit the potential of large kernels. In addition, to enhance the interactions among channels, we propose the adaptive channel-wise bit allocation via generating channel importance factor in a self-conditioned manner. To demonstrate the effectiveness of proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, LLIC-TCM. Extensive experiments demonstrate our proposed LLIC models have significant improvements over corresponding baselines and achieve state-of-the-art performances and better trade-off between performance and complexity.

計算機科學 · 易處理的 · MoDELS · Better · Performance ·

2024 年 2 月 2 日

Code-Switched Language Identification is Harder Than You Think

Laurie Burchell,Alexandra Birch,Robert P. Thompson,Kenneth Heafield

from arxiv, EACL 2024

Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications. Looking to the application of building CS corpora, we explore CS language identification (LID) for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable models for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate and finally provide recommendations for future work in this area.

tuning · 大語言模型 · MoDELS · 語言模型化 · 泛化理論 ·

2024 年 2 月 2 日

LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

Rongsheng Wang,Haoming Chen,Ruizhe Zhou,Han Ma,Yaofei Duan,Yanlan Kang,Songhua Yang,Baoyu Fan,Tao Tan

from arxiv, 17 pages, 13 tables, 7 figures

ChatGPT and other general large language models (LLMs) have achieved remarkable success, but they have also raised concerns about the misuse of AI-generated texts. Existing AI-generated text detection models, such as based on BERT and RoBERTa, are prone to in-domain over-fitting, leading to poor out-of-domain (OOD) detection performance. In this paper, we first collected Chinese text responses generated by human experts and 9 types of LLMs, for which to multiple domains questions, and further created a dataset that mixed human-written sentences and sentences polished by LLMs. We then proposed LLM-Detector, a novel method for both document-level and sentence-level text detection through Instruction Tuning of LLMs. Our method leverages the wealth of knowledge LLMs acquire during pre-training, enabling them to detect the text they generate. Instruction tuning aligns the model's responses with the user's expected text detection tasks. Experimental results show that previous methods struggle with sentence-level AI-generated text detection and OOD detection. In contrast, our proposed method not only significantly outperforms baseline methods in both sentence-level and document-level text detection but also demonstrates strong generalization capabilities. Furthermore, since LLM-Detector is trained based on open-source LLMs, it is easy to customize for deployment.

Single-Shot · Branch · 目標檢測 · 推斷 · MS ·

2018 年 4 月 8 日

Single-Shot Object Detection with Enriched Semantics

Zhishuai Zhang,Siyuan Qiao,Cihang Xie,Wei Shen,Bo Wang,Alan L. Yuille

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.