国产精品亚洲综合久久,甜味弥漫一区二区在线观看,亚洲国产人成中文幕一级二级

In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability, a constraint less common in text-to-speech (TTS). This study proposes a new approach to address this data scarcity. We utilize an existing singing voice synthesizer for data augmentation and apply precise manual tuning to reduce unnatural voice synthesis. Our development of two extensive singing voice corpora, ACE-Opencpop and KiSing-v2, facilitates large-scale, multi-singer voice synthesis. Utilizing pre-trained models derived from these corpora, we achieve notable improvements in voice quality, evident in both in-domain and out-of-domain scenarios. The corpora, pre-trained models, and their related training recipes are publicly available at Muskits-ESPnet (//github.com/espnet/espnet).

相關內容

語音合成

關注 491

語音合成（Speech Synthesis），也稱為文語轉換（Text-to-Speech, TTS,它是將任意的輸入文本轉換成自然流暢的語音輸出。語音合成涉及到人工智能、心理學、聲學、語言學、數字信號處理、計算機科學等多個學科技術，是信息處理領域中的一項前沿技術。隨著計算機技術的不斷提高，語音合成技術從早期的共振峰合成,逐步發展為波形拼接合成和統計參數語音合成，再發展到混合語音合成；合成語音的質量、自然度已經得到明顯提高，基本能滿足一些特定場合的應用需求。目前，語音合成技術在銀行、醫院等的信息播報系統、汽車導航系統、自動應答呼叫中心等都有廣泛應用，取得了巨大的經濟效益。另外，隨著智能手機、MP3、PDA 等與我們生活密切相關的媒介的大量涌現，語音合成的應用也在逐漸向娛樂、語音教學、康復治療等領域深入。可以說語音合成正在影響著人們生活的方方面面。

大語言模型 · 混合 · MoDELS · Learning · Branch ·

2024 年 3 月 12 日

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Sainbayar Sukhbaatar,Olga Golovneva,Vasu Sharma,Hu Xu,Xi Victoria Lin,Baptiste Rozière,Jacob Kahn,Daniel Li,Wen-tau Yih,Jason Weston,Xian Li

We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.

分離的 · 無監督 · 模態 · CLIP · Performer ·

2024 年 3 月 11 日

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Xinyao Li,Yuke Li,Zhekai Du,Fengling Li,Ke Lu,Jingjing Li

from arxiv, CVPR 2024 camera ready

Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet, most transfer approaches for VLMs focus on either the language or visual branches, overlooking the nuanced interplay between both modalities. In this work, we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies, we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: //github.com/TL-UESTC/UniMoS

INTERACT · 磁流變材料 · 設計 · 增強現實（AR） · 3D ·

2024 年 3 月 11 日

Data Cubes in Hand: A Design Space of Tangible Cubes for Visualizing 3D Spatio-Temporal Data in Mixed Reality

Shuqi He,Haonan Yao,Luyan Jiang,Kaiwen Li,Nan Xiang,Yue Li,Hai-Ning Liang,Lingyun Yu

Tangible interfaces in mixed reality (MR) environments allow for intuitive data interactions. Tangible cubes, with their rich interaction affordances, high maneuverability, and stable structure, are particularly well-suited for exploring multi-dimensional data types. However, the design potential of these cubes is underexplored. This study introduces a design space for tangible cubes in MR, focusing on interaction space, visualization space, sizes, and multiplicity. Using spatio-temporal data, we explored the interaction affordances of these cubes in a workshop (N=24). We identified unique interactions like rotating, tapping, and stacking, which are linked to augmented reality (AR) visualization commands. Integrating user-identified interactions, we created a design space for tangible-cube interactions and visualization. A prototype visualizing global health spending with small cubes was developed and evaluated, supporting both individual and combined cube manipulation. This research enhances our grasp of tangible interaction in MR, offering insights for future design and application in diverse data contexts.

Performer · 知識 (knowledge) · MoDELS · 可理解性 · Learning ·

2024 年 3 月 11 日

RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

Yanming Liu,Xinyue Peng,Xuhong Zhang,Weihao Liu,Jianwei Yin,Jiannan Cao,Tianyu Du

from arxiv, 15 pages, 4 figures. Providing first version RA-ISF

Large language models (LLMs) demonstrate exceptional performance in numerous tasks but still heavily rely on knowledge stored in their parameters. Moreover, updating this knowledge incurs high training costs. Retrieval-augmented generation (RAG) methods address this issue by integrating external knowledge. The model can answer questions it couldn't previously by retrieving knowledge relevant to the query. This approach improves performance in certain scenarios for specific tasks. However, if irrelevant texts are retrieved, it may impair model performance. In this paper, we propose Retrieval Augmented Iterative Self-Feedback (RA-ISF), a framework that iteratively decomposes tasks and processes them in three submodules to enhance the model's problem-solving capabilities. Experiments show that our method outperforms existing benchmarks, performing well on models like GPT3.5, Llama2, significantly enhancing factual reasoning capabilities and reducing hallucinations.

Attention · DCN · Learning · Softmax · 估計/估計量 ·

2024 年 3 月 11 日

LVC-LGMC: Joint Local and Global Motion Compensation for Learned Video Compression

Wei Jiang,Junru Li,Kai Zhang,Li Zhang

from arxiv, Accepted to ICASSP 2024 (lecture presentation). The first attempt to use cross attention for bits-free motion estimation and motion compensation

Existing learned video compression models employ flow net or deformable convolutional networks (DCN) to estimate motion information. However, the limited receptive fields of flow net and DCN inherently direct their attentiveness towards the local contexts. Global contexts, such as large-scale motions and global correlations among frames are ignored, presenting a significant bottleneck for capturing accurate motions. To address this issue, we propose a joint local and global motion compensation module (LGMC) for leaned video coding. More specifically, we adopt flow net for local motion compensation. To capture global context, we employ the cross attention in feature domain for motion compensation. In addition, to avoid the quadratic complexity of vanilla cross attention, we divide the softmax operations in attention into two independent softmax operations, leading to linear complexity. To validate the effectiveness of our proposed LGMC, we integrate it with DCVC-TCM and obtain learned video compression with joint local and global motion compensation (LVC-LGMC). Extensive experiments demonstrate that our LVC-LGMC has significant rate-distortion performance improvements over baseline DCVC-TCM.

MoDELS · Nuance · Stable Diffusion · state-of-the-art · Integration ·

2024 年 3 月 11 日

Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to Model Evaluation

Shih-Ying Yeh,Yu-Guan Hsieh,Zhidong Gao,Bernard B W Yang,Giyeong Oh,Yanmin Gong

from arxiv, In International Conference on Learning Representations 12 (ICLR 2024) [79 pages, 54 figures, 7 tables]

Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [//github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.

Processing（編程語言） · 語言模型化 · Performer · MoDELS · 大語言模型 ·

2024 年 3 月 10 日

ArgMed-Agents: Explainable Clinical Decision Reasoning with Large Language Models via Argumentation Schemes

Shengxin Hong,Liang Xiao,Xin Zhang,Jianxia Chen

There are two main barriers to using large language models (LLMs) in clinical reasoning. Firstly, while LLMs exhibit significant promise in Natural Language Processing (NLP) tasks, their performance in complex reasoning and planning falls short of expectations. Secondly, LLMs use uninterpretable methods to make clinical decisions that are fundamentally different from the clinician's cognitive processes. This leads to user distrust. In this paper, we present a multi-agent framework called ArgMed-Agents, which aims to enable LLM-based agents to make explainable clinical decision reasoning through interaction. ArgMed-Agents performs self-argumentation iterations via Argumentation Scheme for Clinical Decision (a reasoning mechanism for modeling cognitive processes in clinical reasoning), and then constructs the argumentation process as a directed graph representing conflicting relationships. Ultimately, Reasoner(a symbolic solver) identify a series of rational and coherent arguments to support decision. ArgMed-Agents enables LLMs to mimic the process of clinical argumentative reasoning by generating explanations of reasoning in a self-directed manner. The setup experiments show that ArgMed-Agents not only improves accuracy in complex clinical decision reasoning problems compared to other prompt methods, but more importantly, it provides users with decision explanations that increase their confidence.

Learning · 表示 · 變換 · contrastive · 表示學習 ·

2024 年 3 月 10 日

MART: Learning Hierarchical Music Audio Representations with Part-Whole Transformer

Dong Yao,Jieming Zhu,Jiahao Xun,Shengyu Zhang,Zhou Zhao,Liqun Deng,Wenqiao Zhang,Zhenhua Dong,Xin Jiang

Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification.

DQN · INFORMS · 優化器 · Integration · 回合 ·

2024 年 3 月 7 日

ComTraQ-MPC: Meta-Trained DQN-MPC Integration for Trajectory Tracking with Limited Active Localization Updates

Gokul Puthumanaillam,Manav Vora,Melkior Ornik

from arxiv, * Equal contribution

Optimal decision-making for trajectory tracking in partially observable, stochastic environments where the number of active localization updates -- the process by which the agent obtains its true state information from the sensors -- are limited, presents a significant challenge. Traditional methods often struggle to balance resource conservation, accurate state estimation and precise tracking, resulting in suboptimal performance. This problem is particularly pronounced in environments with large action spaces, where the need for frequent, accurate state data is paramount, yet the capacity for active localization updates is restricted by external limitations. This paper introduces ComTraQ-MPC, a novel framework that combines Deep Q-Networks (DQN) and Model Predictive Control (MPC) to optimize trajectory tracking with constrained active localization updates. The meta-trained DQN ensures adaptive active localization scheduling, while the MPC leverages available state information to improve tracking. The central contribution of this work is their reciprocal interaction: DQN's update decisions inform MPC's control strategy, and MPC's outcomes refine DQN's learning, creating a cohesive, adaptive system. Empirical evaluations in simulated and real-world settings demonstrate that ComTraQ-MPC significantly enhances operational efficiency and accuracy, providing a generalizable and approximately optimal solution for trajectory tracking in complex partially observable environments.

Single-Shot · Branch · 目標檢測 · 推斷 · MS ·

2018 年 4 月 8 日

Single-Shot Object Detection with Enriched Semantics

Zhishuai Zhang,Siyuan Qiao,Cihang Xie,Wei Shen,Bo Wang,Alan L. Yuille

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.