一级a视频免费一区二区,国产亚洲一区二区三区在线,日日玩日日摸日日上,国产精品久久久久久一二三四五,免费又黄又裸乳的视频

This paper introduces VoxHakka, a text-to-speech (TTS) system designed for Taiwanese Hakka, a critically under-resourced language spoken in Taiwan. Leveraging the YourTTS framework, VoxHakka achieves high naturalness and accuracy and low real-time factor in speech synthesis while supporting six distinct Hakka dialects. This is achieved by training the model with dialect-specific data, allowing for the generation of speaker-aware Hakka speech. To address the scarcity of publicly available Hakka speech corpora, we employed a cost-effective approach utilizing a web scraping pipeline coupled with automatic speech recognition (ASR)-based data cleaning techniques. This process ensured the acquisition of a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training. Subjective listening tests conducted using comparative mean opinion scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.

相關內容

語音合成

關注 491

語音合成（Speech Synthesis），也稱為文語轉換（Text-to-Speech, TTS,它是將任意的輸入文本轉換成自然流暢的語音輸出。語音合成涉及到人工智能、心理學、聲學、語言學、數字信號處理、計算機科學等多個學科技術，是信息處理領域中的一項前沿技術。隨著計算機技術的不斷提高，語音合成技術從早期的共振峰合成,逐步發展為波形拼接合成和統計參數語音合成，再發展到混合語音合成；合成語音的質量、自然度已經得到明顯提高，基本能滿足一些特定場合的應用需求。目前，語音合成技術在銀行、醫院等的信息播報系統、汽車導航系統、自動應答呼叫中心等都有廣泛應用，取得了巨大的經濟效益。另外，隨著智能手機、MP3、PDA 等與我們生活密切相關的媒介的大量涌現，語音合成的應用也在逐漸向娛樂、語音教學、康復治療等領域深入。可以說語音合成正在影響著人們生活的方方面面。

可理解性 · INFORMS · 情景 · Analysis · MoDELS ·

2024 年 11 月 8 日

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents

Yilun Zhao,Yitao Long,Yuru Jiang,Chengye Wang,Weiyuan Chen,Hongjun Liu,Yiming Zhang,Xiangru Tang,Chen Zhao,Arman Cohan

from arxiv, EMNLP 2024

We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 2,400 expert-annotated examples, divided into three subsets: information extraction, numerical reasoning, and knowledge-intensive reasoning, each addressing common scenarios encountered in real-world financial contexts. We assess a broad spectrum of LLMs under long-context and RAG settings. Our results show that even the current best-performing system, GPT-4o, still lags behind human experts. We further provide in-depth analysis on long-context and RAG setting, Chain-of-Thought reasoning, and model reasoning errors, offering insights to drive future advancements. We believe that FinDVer can serve as a valuable benchmark for evaluating LLMs in claim verification over complex, expert-domain documents.

控制器 · UMA · Extensibility · TransAct ·

2024 年 11 月 8 日

From Resource Control to Digital Trust with User-Managed Access

Wouter Termont,Ruben Dedecker,Wout Slabbinck,Beatriz Esteves,Ben De Meester,Ruben Verborgh

The User-Managed Access (UMA) extension to OAuth 2.0 is a promising candidate for increasing Digital Trust in personal data ecosystems like Solid. With minor modifications, it can achieve many requirements regarding usage control and transaction contextualization, even though additional specification is needed to address delegation of control and retraction of usage policies.

數據集 · 語音翻譯 · Performer · 評論員 · 可辨認的 ·

2024 年 11 月 8 日

BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Sparsh Jain,Ashwin Sankar,Devilal Choudhary,Dhairya Suman,Nikhil Narasimhan,Mohammed Safi Ur Rahman Khan,Anoop Kunchukuttan,Mitesh M Khapra,Raj Dabre

from arxiv, Work in Progress

Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In this paper, we first evaluate the performance of widely-used AST systems on Indian languages, identifying notable performance gaps and challenges. Our findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, a key aspect of everyday communication. To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English spanning over 44,400 hours and 17M text segments. BhasaAnuvaad contains data for English speech to Indic text, as well as Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation. By offering this diverse and expansive dataset, we aim to bridge the resource gap and promote advancements in AST for Indian languages.

SimPLe · 3D · 控制器 · 蒸餾 · 可行 ·

2024 年 11 月 7 日

ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing

Jun-Kun Chen,Yu-Xiong Wang

from arxiv, NeurIPS 2024. Project Page: //immortalco.github.io/ProEdit/

This paper proposes ProEdit - a simple yet effective framework for high-quality 3D scene editing guided by diffusion distillation in a novel progressive manner. Inspired by the crucial observation that multi-view inconsistency in scene editing is rooted in the diffusion model's large feasible output space (FOS), our framework controls the size of FOS and reduces inconsistency by decomposing the overall editing task into several subtasks, which are then executed progressively on the scene. Within this framework, we design a difficulty-aware subtask decomposition scheduler and an adaptive 3D Gaussian splatting (3DGS) training strategy, ensuring high quality and efficiency in performing each subtask. Extensive evaluation shows that our ProEdit achieves state-of-the-art results in various scenes and challenging editing tasks, all through a simple framework without any expensive or sophisticated add-ons like distillation losses, components, or training procedures. Notably, ProEdit also provides a new way to control, preview, and select the "aggressivity" of editing operation during the editing process.

數據集 · 語音翻譯 · Performer · INFORMS · 評論員 ·

2024 年 11 月 7 日

BhasaAnuvaad: A Speech Translation Dataset for 14 Indian Languages

Sparsh Jain,Ashwin Sankar,Devilal Choudhary,Dhairya Suman,Nikhil Narasimhan,Mohammed Safi Ur Rahman Khan,Anoop Kunchukuttan,Mitesh M Khapra,Raj Dabre

from arxiv, Work in Progress

Automatic Speech Translation (AST) datasets for Indian languages remain critically scarce, with public resources covering fewer than 10 of the 22 official languages. This scarcity has resulted in AST systems for Indian languages lagging far behind those available for high-resource languages like English. In this paper, we first evaluate the performance of widely-used AST systems on Indian languages, identifying notable performance gaps and challenges. Our findings show that while these systems perform adequately on read speech, they struggle significantly with spontaneous speech, including disfluencies like pauses and hesitations. Additionally, there is a striking absence of systems capable of accurately translating colloquial and informal language, a key aspect of everyday communication. To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 14 scheduled Indian languages spanning over 44,400 hours and 17M text segments. BhasaAnuvaad contains data for English speech to Indic text, as well as Indic speech to English text. This dataset comprises three key categories: (1) Curated datasets from existing resources, (2) Large-scale web mining, and (3) Synthetic data generation. By offering this diverse and expansive dataset, we aim to bridge the resource gap and promote advancements in AST for low-resource Indian languages, especially in handling spontaneous and informal speech patterns.

分解 · 泛函 · 方陣 · 穩健性 · 縮放 ·

2024 年 11 月 7 日

A Micro-Macro Decomposition-Based Asymptotic-Preserving Random Feature Method for Multiscale Radiative Transfer Equations

Jingrun Chen,Zheng Ma,Keke Wu

This paper introduces the Asymptotic-Preserving Random Feature Method (APRFM) for the efficient resolution of multiscale radiative transfer equations. The APRFM effectively addresses the challenges posed by stiffness and multiscale characteristics inherent in radiative transfer equations through the application of a micro-macro decomposition strategy. This approach decomposes the distribution function into equilibrium and non-equilibrium components, allowing for the approximation of both parts through the random feature method (RFM) within a least squares minimization framework. The proposed method exhibits remarkable robustness across different scales and achieves high accuracy with fewer degrees of freedom and collocation points than the vanilla RFM. Additionally, compared to the deep neural network-based method, our approach offers significant advantages in terms of parameter efficiency and computational speed. These benefits have been substantiated through numerous numerical experiments conducted on both one- and two-dimensional problems.

任務對話系統 · Agent · Performer · 可理解性 · 代碼 ·

2024 年 11 月 6 日

MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue

Fengxiang Wang,Ranjie Duan,Peng Xiao,Xiaojun Jia,YueFeng Chen,Chongwen Wang,Jialing Tao,Hang Su,Jun Zhu,Hui Xue

Large Language Models (LLMs) demonstrate outstanding performance in their reservoir of knowledge and understanding capabilities, but they have also been shown to be prone to illegal or unethical reactions when subjected to jailbreak attacks. To ensure their responsible deployment in critical applications, it is crucial to understand the safety capabilities and vulnerabilities of LLMs. Previous works mainly focus on jailbreak in single-round dialogue, overlooking the potential jailbreak risks in multi-round dialogues, which are a vital way humans interact with and extract information from LLMs. Some studies have increasingly concentrated on the risks associated with jailbreak in multi-round dialogues. These efforts typically involve the use of manually crafted templates or prompt engineering techniques. However, due to the inherent complexity of multi-round dialogues, their jailbreak performance is limited. To solve this problem, we propose a novel multi-round dialogue jailbreaking agent, emphasizing the importance of stealthiness in identifying and mitigating potential threats to human values posed by LLMs. We propose a risk decomposition strategy that distributes risks across multiple rounds of queries and utilizes psychological strategies to enhance attack strength. Extensive experiments show that our proposed method surpasses other attack methods and achieves state-of-the-art attack success rate. We will make the corresponding code and dataset available for future research. The code will be released soon.

語言模型化 · MoDELS · Vision · Analysis · 數據集 ·

2024 年 11 月 6 日

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang,Xingxing Weng,Jiang Wu,Jiayu Li,Yi Liu,Jiaxing Sun,Weijia Li,Shuai Wang,Litong Feng,Gui-Song Xia,Conghui He

from arxiv, Equal contribution: Chao Pang, Xingxing Weng, Jiang Wu; Corresponding author: Gui-Song Xia, Conghui He

This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering. We will release the code, data and model weights at //github.com/opendatalab/VHM .

MoDELS · 語言模型化 · 大學 · 大語言模型 · 可理解性 ·

2024 年 11 月 6 日

AudioBench: A Universal Benchmark for Audio Large Language Models

Bin Wang,Xunlong Zou,Geyu Lin,Shuo Sun,Zhuohan Liu,Wenyu Zhang,Zhengyuan Liu,AiTi Aw,Nancy F. Chen

from arxiv, v4 - Add acknowledgment and slight update on structure; Code: //github.com/AudioLLMs/AudioBench

We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.

相似度 · INFORMS · 估計/估計量 · Extensibility · 無監督 ·

2021 年 3 月 10 日

SDD-FIQA: Unsupervised Face Image Quality Assessment with Similarity Distribution Distance

Fu-Zhao Ou,Xingyu Chen,Ruixin Zhang,Yuge Huang,Shaoxin Li,Jilin Li,Yong Li,Liujuan Cao,Yuan-Gen Wang

In recent years, Face Image Quality Assessment (FIQA) has become an indispensable part of the face recognition system to guarantee the stability and reliability of recognition performance in an unconstrained scenario. For this purpose, the FIQA method should consider both the intrinsic property and the recognizability of the face image. Most previous works aim to estimate the sample-wise embedding uncertainty or pair-wise similarity as the quality score, which only considers the information from partial intra-class. However, these methods ignore the valuable information from the inter-class, which is for estimating to the recognizability of face image. In this work, we argue that a high-quality face image should be similar to its intra-class samples and dissimilar to its inter-class samples. Thus, we propose a novel unsupervised FIQA method that incorporates Similarity Distribution Distance for Face Image Quality Assessment (SDD-FIQA). Our method generates quality pseudo-labels by calculating the Wasserstein Distance (WD) between the intra-class similarity distributions and inter-class similarity distributions. With these quality pseudo-labels, we are capable of training a regression network for quality prediction. Extensive experiments on benchmark datasets demonstrate that the proposed SDD-FIQA surpasses the state-of-the-arts by an impressive margin. Meanwhile, our method shows good generalization across different recognition systems.