国产欧美日韩综合在线,污黄视频免费观看的网站网址

Speaker adaptation, which involves cloning voices from unseen speakers in the Text-to-Speech task, has garnered significant interest due to its numerous applications in multi-media fields. Despite recent advancements, existing methods often struggle with inadequate speaker representation accuracy and overfitting, particularly in limited reference speeches scenarios. To address these challenges, we propose an Agile Speaker Representation Reinforcement Learning strategy to enhance speaker similarity in speaker adaptation tasks. ASRRL is the first work to apply reinforcement learning to improve the modeling accuracy of speaker embeddings in speaker adaptation, addressing the challenge of decoupling voice content and timbre. Our approach introduces two action strategies tailored to different reference speeches scenarios. In the single-sentence scenario, a knowledge-oriented optimal routine searching RL method is employed to expedite the exploration and retrieval of refinement information on the fringe of speaker representations. In the few-sentence scenario, we utilize a dynamic RL method to adaptively fuse reference speeches, enhancing the robustness and accuracy of speaker modeling. To achieve optimal results in the target domain, a multi-scale fusion scoring mechanism based reward model that evaluates speaker similarity, speech quality, and intelligibility across three dimensions is proposed, ensuring that improvements in speaker similarity do not compromise speech quality or intelligibility. The experimental results on the LibriTTS and VCTK datasets within mainstream TTS frameworks demonstrate the extensibility and generalization capabilities of the proposed ASRRL method. The results indicate that the ASRRL method significantly outperforms traditional fine-tuning approaches, achieving higher speaker similarity and better overall speech quality with limited reference speeches.

相關內容

相似度

關注 2

流 · 優化器 · 可辨認的 · 哈希學習 · 可約的 ·

2024 年 8 月 23 日

2FA Sketch: Two-Factor Armor Sketch for Accurate and Efficient Heavy Hitter Detection in Data Streams

Xilai Liu,Xinyi Zhang,Bingqing Liu,Tao Li,Tong Yang,Gaogang Xie

from arxiv, 12 pages, 7 figures, added new contributions, supplemented experiments, and restructured the entire paper

Detecting heavy hitters, which are flows exceeding a specified threshold, is crucial for network measurement, but it faces challenges due to increasing throughput and memory constraints. Existing sketch-based solutions, particularly those using Comparative Counter Voting, have limitations in efficiently identifying heavy hitters. This paper introduces the Two-Factor Armor (2FA) Sketch, a novel data structure designed to enhance heavy hitter detection in data streams. 2FA Sketch implements dual-layer protection through an improved $\mathtt{Arbitration}$ strategy for in-bucket competition and a cross-bucket conflict $\mathtt{Avoidance}$ hashing scheme. By theoretically deriving an optimal $\lambda$ parameter and redesigning $vote^+_{new}$ as a conflict indicator, it optimizes the Comparative Counter Voting strategy. Experimental results show that 2FA Sketch outperforms the standard Elastic Sketch, reducing error rates by 2.5 to 19.7 times and increasing processing speed by 1.03 times.

Siamese · Wi-Fi · Networking · 孿生網絡 · Performer ·

2024 年 8 月 20 日

CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network

Zijian Zhao,Tingwei Chen,Zhijie Cai,Hang Li,Xiaoyang Li,Qimei Chen,Guangxu Zhu

In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. To facilitate future research, we will release the code for our model upon publication.

數據集 · MoDELS · 估計/估計量 · Automator · 可約的 ·

2024 年 8 月 20 日

V-RoAst: A New Dataset for Visual Road Assessment

Natchapon Jongwiriyanurak,Zichao Zeng,June Moh Goo,Xinglei Wang,Ilya Ilyankou,Kerkritt Srirrongvikrai,Meihui Wang,James Haworth

Road traffic crashes cause millions of deaths annually and have a significant economic impact, particularly in low- and middle-income countries (LMICs). This paper presents an approach using Vision Language Models (VLMs) for road safety assessment, overcoming the limitations of traditional Convolutional Neural Networks (CNNs). We introduce a new task ,V-RoAst (Visual question answering for Road Assessment), with a real-world dataset. Our approach optimizes prompt engineering and evaluates advanced VLMs, including Gemini-1.5-flash and GPT-4o-mini. The models effectively examine attributes for road assessment. Using crowdsourced imagery from Mapillary, our scalable solution influentially estimates road safety levels. In addition, this approach is designed for local stakeholders who lack resources, as it does not require training data. It offers a cost-effective and automated methods for global road safety assessments, potentially saving lives and reducing economic burdens.

2024 年 8 月 20 日

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Yu Sun,Keyu Chen,Shujie Wang,Peiji Li,Qipeng Guo,Hang Yan,Xipeng Qiu,Xuanjing Huang,Dahua Lin

from arxiv, ACL 2024

Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.

Agent · 大語言模型 · Performer · HTTPS · 語言模型化 ·

2024 年 8 月 19 日

MegaAgent: A Practical Framework for Autonomous Cooperation in Large-Scale LLM Agent Systems

Qian Wang,Tianyu Wang,Qinbin Li,Jingsheng Liang,Bingsheng He

With the emergence of large language models (LLMs), LLM-powered multi-agent systems (LLM-MA systems) have been proposed to tackle real-world tasks. However, their agents mostly follow predefined Standard Operating Procedures (SOPs) that remain unchanged across the whole interaction, lacking autonomy and scalability. Additionally, current solutions often overlook the necessity for effective agent cooperation. To address the above limitations, we propose MegaAgent, a practical framework designed for autonomous cooperation in large-scale LLM Agent systems. MegaAgent leverages the autonomy of agents to dynamically generate agents based on task requirements, incorporating features such as automatically dividing tasks, systematic planning and monitoring of agent activities, and managing concurrent operations. In addition, MegaAgent is designed with a hierarchical structure and employs system-level parallelism to enhance performance and boost communication. We demonstrate the effectiveness of MegaAgent through Gobang game development, showing that it outperforms popular LLM-MA systems; and national policy simulation, demonstrating its high autonomy and potential to rapidly scale up to 590 agents while ensuring effective cooperation among them. Our results indicate that MegaAgent is the first autonomous large-scale LLM-MA system with no pre-defined SOPs, high effectiveness and scalability, paving the way for further research in this field. Our code is at //anonymous.4open.science/r/MegaAgent-81F3.

SOFT · contrastive · 掩碼 · MoDELS · 樣本 ·

2024 年 8 月 17 日

EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition

Qile Liu,Weishan Ye,Yulu Liu,Zhen Liang

from arxiv, 16 pages, 8 figures, 15 tables, submitted to AAAI 2025

Emotion recognition using electroencephalography (EEG) signals has garnered widespread attention in recent years. However, existing studies have struggled to develop a sufficiently generalized model suitable for different datasets without re-training (cross-corpus). This difficulty arises because distribution differences across datasets far exceed the intra-dataset variability. To solve this problem, we propose a novel Soft Contrastive Masked Modeling (SCMM) framework. Inspired by emotional continuity, SCMM integrates soft contrastive learning with a new hybrid masking strategy to effectively mine the "short-term continuity" characteristics inherent in human emotions. During the self-supervised learning process, soft weights are assigned to sample pairs, enabling adaptive learning of similarity relationships across samples. Furthermore, we introduce an aggregator that weightedly aggregates complementary information from multiple close samples based on pairwise similarities among samples to enhance fine-grained feature representation, which is then used for original sample reconstruction. Extensive experiments on the SEED, SEED-IV and DEAP datasets show that SCMM achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy of 4.26% under two types of cross-corpus conditions (same-class and different-class) for EEG-based emotion recognition.

圖 · INFORMS · 變換 · 模型評估 · Graph Transformer ·

2024 年 8 月 16 日

SE-SGformer: A Self-Explainable Signed Graph Transformer for Link Sign Prediction

Lu Li,Jiale Liu,Xingyu Ji,Maojun Wang,Zeyu Zhang

Signed Graph Neural Networks (SGNNs) have been shown to be effective in analyzing complex patterns in real-world situations where positive and negative links coexist. However, SGNN models suffer from poor explainability, which limit their adoptions in critical scenarios that require understanding the rationale behind predictions. To the best of our knowledge, there is currently no research work on the explainability of the SGNN models. Our goal is to address the explainability of decision-making for the downstream task of link sign prediction specific to signed graph neural networks. Since post-hoc explanations are not derived directly from the models, they may be biased and misrepresent the true explanations. Therefore, in this paper we introduce a Self-Explainable Signed Graph transformer (SE-SGformer) framework, which can not only outputs explainable information while ensuring high prediction accuracy. Specifically, We propose a new Transformer architecture for signed graphs and theoretically demonstrate that using positional encoding based on signed random walks has greater expressive power than current SGNN methods and other positional encoding graph Transformer-based approaches. We constructs a novel explainable decision process by discovering the $K$-nearest (farthest) positive (negative) neighbors of a node to replace the neural network-based decoder for predicting edge signs. These $K$ positive (negative) neighbors represent crucial information about the formation of positive (negative) edges between nodes and thus can serve as important explanatory information in the decision-making process. We conducted experiments on several real-world datasets to validate the effectiveness of SE-SGformer, which outperforms the state-of-the-art methods by improving 2.2\% prediction accuracy and 73.1\% explainablity accuracy in the best-case scenario.

大語言模型 · 語言模型化 · MoDELS · 推斷 · Processing（編程語言） ·

2024 年 8 月 14 日

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

Seungjae Moon,Jung-Hoon Kim,Junsoo Kim,Seongmin Hong,Junseo Cha,Minsu Kim,Sukbin Lim,Gyubin Choi,Dongjin Seo,Jongho Kim,Hunjong Lee,Hyunjun Park,Ryeowook Ko,Soongyu Choi,Jongse Park,Jinwon Lee,Joo-Young Kim

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively.

INTERACT · 協同過濾 · 圖 · SimPLe · 特征提取 ·

2024 年 8 月 11 日

GraphTransfer: A Generic Feature Fusion Framework for Collaborative Filtering

Jiafeng Xia,Dongsheng Li,Hansu Gu,Tun Lu,Ning Gu

Graph Neural Networks (GNNs) have demonstrated effectiveness in collaborative filtering tasks due to their ability to extract powerful structural features. However, combining the graph features extracted from user-item interactions and auxiliary features extracted from user genres and item properties remains a challenge. Currently available fusion methods face two major issues: 1) simple methods such as concatenation and summation are generic, but not accurate in capturing feature relationships; 2) task-specific methods like attention mechanisms and meta paths may not be suitable for general feature fusion. To address these challenges, we present GraphTransfer, a simple but universal feature fusion framework for GNN-based collaborative filtering. Our method accurately fuses different types of features by first extracting graph features from the user-item interaction graph and auxiliary features from users and items using GCN. The proposed cross fusion module then effectively bridges the semantic gaps between the interaction scores of different features. Theoretical analysis and experiments on public datasets show that GraphTransfer outperforms other feature fusion methods in CF tasks. Additionally, we demonstrate the universality of our framework via empirical studies in three other scenarios, showing that GraphTransfer leads to significant improvements in the performance of CF algorithms.

INFORMS · 語音識別 · 自動語音識別 · Performer · 回合 ·

2024 年 8 月 11 日

LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition

Eunseop Yoon,Hee Suk Yoon,John Harvill,Mark Hasegawa-Johnson,Chang D. Yoo

from arxiv, INTERSPEECH 2024

Test-Time Adaptation (TTA) has emerged as a crucial solution to the domain shift challenge, wherein the target environment diverges from the original training environment. A prime exemplification is TTA for Automatic Speech Recognition (ASR), which enhances model performance by leveraging output prediction entropy minimization as a self-supervision signal. However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. LI-TTA integrates corrections from an external language model to merge linguistic with acoustic information by minimizing the CTC loss from the correction alongside the standard TTA loss. With extensive experiments, we show that LI-TTA effectively improves the performance of TTA for ASR in various distribution shift situations.