销魂美女一区二区三区AV_影888午夜理论不卡_一级A爱片视频在线观看国产_思思久久ER99精品亚洲_9691精品人妻无码久久久_久久91精品国产91久久小草_精品夜夜骚妇一区二区三区

2024 年 5 月 29 日

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Yueyuan Sui,Minghui Zhao,Junxi Xia,Xiaofan Jiang,Stephen Xia

We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.

相關內容

可穿戴設備

關注 127

可穿戴設備即直接穿在身上，或是整合到用戶的衣服或配件的一種便攜式設備。可穿戴設備不僅僅是一種硬件設備，更是通過軟件支持以及數據交互、云端交互來實現強大的功能，可穿戴設備將會對我們的生活、感知帶來很大的轉變。

Subspace · 隨機子空間 · LORA · 推斷 · 全 ·

2024 年 7 月 10 日

ROSA: Random Subspace Adaptation for Efficient Fine-Tuning

Marawan Gamal Abdel Hameed,Aristides Milios,Siva Reddy,Guillaume Rabusseau

Model training requires significantly more memory, compared with inference. Parameter efficient fine-tuning (PEFT) methods provide a means of adapting large models to downstream tasks using less memory. However, existing methods such as adapters, prompt tuning or low-rank adaptation (LoRA) either introduce latency overhead at inference time or achieve subpar downstream performance compared with full fine-tuning. In this work we propose Random Subspace Adaptation (ROSA), a method that outperforms previous PEFT methods by a significant margin, while maintaining a zero latency overhead during inference time. In contrast to previous methods, ROSA is able to adapt subspaces of arbitrarily large dimension, better approximating full-finetuning. We demonstrate both theoretically and experimentally that this makes ROSA strictly more expressive than LoRA, without consuming additional memory during runtime. As PEFT methods are especially useful in the natural language processing domain, where models operate on scales that make full fine-tuning very expensive, we evaluate ROSA in two common NLP scenarios: natural language generation (NLG) and natural language understanding (NLU) with GPT-2 and RoBERTa, respectively. We show that on almost every GLUE task ROSA outperforms LoRA by a significant margin, while also outperforming LoRA on NLG tasks. Our code is available at //github.com/rosa-paper/rosa

可理解性 · Next · 可辨認的 · Weight · 相同 ·

2024 年 7 月 9 日

Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology

Robin Netzorg,Alyssa Cote,Sumi Koshin,Klo Vivienne Garoute,Gopala Krishna Anumanchipalli

As experts in voice modification, trans-feminine gender-affirming voice teachers have unique perspectives on voice that confound current understandings of speaker identity. To demonstrate this, we present the Versatile Voice Dataset (VVD), a collection of three speakers modifying their voices along gendered axes. The VVD illustrates that current approaches in speaker modeling, based on categorical notions of gender and a static understanding of vocal texture, fail to account for the flexibility of the vocal tract. Utilizing publicly-available speaker embeddings, we demonstrate that gender classification systems are highly sensitive to voice modification, and speaker verification systems fail to identify voices as coming from the same speaker as voice modification becomes more drastic. As one path towards moving beyond categorical and static notions of speaker identity, we propose modeling individual qualities of vocal texture such as pitch, resonance, and weight.

tuning · Performance · 代價 · 查準率/準確率 · MoDELS ·

2024 年 7 月 9 日

PEER: Expertizing Domain-Specific Tasks with a Multi-Agent Framework and Tuning Methods

Yiying Wang,Xiaojing Li,Binzhu Wang,Yueyang Zhou,Han Ji,Hong Chen,Jinshi Zhang,Fei Yu,Zewei Zhao,Song Jin,Renji Gong,Wanqing Xu

In domain-specific applications, GPT-4, augmented with precise prompts or Retrieval-Augmented Generation (RAG), shows notable potential but faces the critical tri-lemma of performance, cost, and data privacy. High performance requires sophisticated processing techniques, yet managing multiple agents within a complex workflow often proves costly and challenging. To address this, we introduce the PEER (Plan, Execute, Express, Review) multi-agent framework. This systematizes domain-specific tasks by integrating precise question decomposition, advanced information retrieval, comprehensive summarization, and rigorous self-assessment. Given the concerns of cost and data privacy, enterprises are shifting from proprietary models like GPT-4 to custom models, striking a balance between cost, security, and performance. We developed industrial practices leveraging online data and user feedback for efficient model tuning. This study provides best practice guidelines for applying multi-agent systems in domain-specific problem-solving and implementing effective agent tuning strategies. Our empirical studies, particularly in the financial question-answering domain, demonstrate that our approach achieves 95.0% of GPT-4's performance, while effectively managing costs and ensuring data privacy.

多峰值 · 數據集 · INFORMS · 值域 · Processing（編程語言） ·

2024 年 7 月 8 日

MMIS: Multimodal Dataset for Interior Scene Visual Generation and Recognition

Hozaifa Kassab,Ahmed Mahmoud,Mohamed Bahaa,Ammar Mohamed,Ali Hamdi

We introduce MMIS, a novel dataset designed to advance MultiModal Interior Scene generation and recognition. MMIS consists of nearly 160,000 images. Each image within the dataset is accompanied by its corresponding textual description and an audio recording of that description, providing rich and diverse sources of information for scene generation and recognition. MMIS encompasses a wide range of interior spaces, capturing various styles, layouts, and furnishings. To construct this dataset, we employed careful processes involving the collection of images, the generation of textual descriptions, and corresponding speech annotations. The presented dataset contributes to research in multi-modal representation learning tasks such as image generation, retrieval, captioning, and classification.

AI · 穩健性 · MoDELS · AIM · 數學 ·

2024 年 7 月 8 日

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

David "davidad" Dalrymple,Joar Skalse,Yoshua Bengio,Stuart Russell,Max Tegmark,Sanjit Seshia,Steve Omohundro,Christian Szegedy,Ben Goldhaber,Nora Ammann,Alessandro Abate,Joe Halpern,Clark Barrett,Ding Zhao,Tan Zhi-Xuan,Jeannette Wing,Joshua Tenenbaum

Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.

MoDELS · Performer · 控制器 · 語音識別 · 語音翻譯 ·

2024 年 7 月 5 日

Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models

Vyas Raina,Mark Gales

Speech enabled foundation models, either in the form of flexible speech recognition based systems or audio-prompted large language models (LLMs), are becoming increasingly popular. One of the interesting aspects of these models is their ability to perform tasks other than automatic speech recognition (ASR) using an appropriate prompt. For example, the OpenAI Whisper model can perform both speech transcription and speech translation. With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. Without any access to the model prompt it is possible to modify the behaviour of the system by appropriately changing the audio input. To illustrate this risk, we demonstrate that it is possible to prepend a short universal adversarial acoustic segment to any input speech signal to override the prompt setting of an ASR foundation model. Specifically, we successfully use a universal adversarial acoustic segment to control Whisper to always perform speech translation, despite being set to perform speech transcription. Overall, this work demonstrates a new form of adversarial attack on multi-tasking speech enabled foundation models that needs to be considered prior to the deployment of this form of model.

情景 · 圖 · 核化 · 可約的 · 樣例 ·

2024 年 7 月 5 日

Tight (Double) Exponential Bounds for Identification Problems: Locating-Dominating Set and Test Cover

Dipayan Chakraborty,Florent Foucaud,Diptapriyo Majumdar,Prafullkumar Tale

from arxiv, Abstract shortened to fit the requirements. This version contains a subset of results mentioned in the previous version

We investigate fine-grained algorithmic aspects of identification problems in graphs and set systems, with a focus on Locating-Dominating Set and Test Cover. We prove the (tight) conditional lower bounds for these problems when parameterized by treewidth and solution as. Formally, \textsc{Locating-Dominating Set} (respectively, \textsc{Test Cover}) parameterized by the treewidth of the input graph (respectively, of the natural auxiliary graph) does not admit an algorithm running in time $2^{2^{o(tw)}} \cdot poly(n)$ (respectively, $2^{2^{o(tw)}} \cdot poly(|U| + |\mathcal{F}|))$. This result augments the small list of NP-Complete problems that admit double exponential lower bounds when parameterized by treewidth. Then, we first prove that \textsc{Locating-Dominating Set} does not admit an algorithm running in time $2^{o(k^2)} \cdot poly(n)$, nor a polynomial time kernelization algorithm that reduces the solution size and outputs a kernel with $2^{o(k)}$ vertices, unless the \ETH\ fails. To the best of our knowledge, \textsc{Locating-Dominating Set} is the first problem that admits such an algorithmic lower-bound (with a quadratic function in the exponent) when parameterized by the solution size. Finally, we prove that \textsc{Test Cover} does not admit an algorithm running in time $2^{2^{o(k)}} \cdot poly(|U| + |\mathcal{F}|)$. This is also a rare example of the problem that admits a double exponential lower bound when parameterized by the solution size. We also present algorithms whose running times match the above lower bounds.

控制器 · MoDELS · 損失 · 生成模型 · Extensibility ·

2024 年 7 月 5 日

MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss

Yangyang Shu,Haiming Xu,Ziqin Zhou,Anton van den Hengel,Lingqiao Liu

from arxiv, Demo is available at: //ganperf.github.io/musebarcontrol.github.io/musebarcontrol/

Automatically generating symbolic music-music scores tailored to specific human needs-can be highly beneficial for musicians and enthusiasts. Recent studies have shown promising results using extensive datasets and advanced transformer architectures. However, these state-of-the-art models generally offer only basic control over aspects like tempo and style for the entire composition, lacking the ability to manage finer details, such as control at the level of individual bars. While fine-tuning a pre-trained symbolic music generation model might seem like a straightforward method for achieving this finer control, our research indicates challenges in this approach. The model often fails to respond adequately to new, fine-grained bar-level control signals. To address this, we propose two innovative solutions. First, we introduce a pre-training task designed to link control signals directly with corresponding musical tokens, which helps in achieving a more effective initialization for subsequent fine-tuning. Second, we implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts. Together, these techniques significantly enhance our ability to control music generation at the bar level, showing a 13.06\% improvement over conventional methods. Our subjective evaluations also confirm that this enhanced control does not compromise the musical quality of the original pre-trained generative model.

INTERACT · 數據集 · Extensibility · 可理解性 · MoDELS ·

2024 年 7 月 5 日

WOMD-Reasoning: A Large-Scale Language Dataset for Interaction and Driving Intentions Reasoning

Yiheng Li,Chongjian Ge,Chenran Li,Chenfeng Xu,Masayoshi Tomizuka,Chen Tang,Mingyu Ding,Wei Zhan

We propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a language annotation dataset built on WOMD, with a focus on describing and reasoning interactions and intentions in driving scenarios. Previous language datasets primarily captured interactions caused by close distances. However, interactions induced by traffic rules and human intentions, which can occur over long distances, are yet sufficiently covered, despite being very common and more challenging for prediction or planning models to understand. Therefore, our WOMD-Reasoning focuses extensively on these interactions, providing a total of 409k Q&As for varying types of interactions. Additionally, WOMD-Reasoning presents by far the largest Q&A dataset on real-world driving scenarios, with around 3 million Q&As covering various topics of autonomous driving from map descriptions, motion status descriptions, to narratives and analyses of agents' interactions, behaviors, and intentions. This extensive textual information enables fine-tuning driving-related Large Language Models (LLMs) for a wide range of applications like scene description, prediction, planning, etc. By incorporating interaction and intention language from WOMD-Reasoning, we see significant enhancements in the performance of the state-of-the-art trajectory prediction model, Multipath++, with improvements of 10.14% in $MR_6$ and 6.90% in $minFDE_6$, proving the effectiveness of WOMD-Reasoning. We hope WOMD-Reasoning would empower LLMs in driving to offer better interaction understanding and behavioral reasoning. The dataset is available on //waymo.com/open/download .

事件抽取 · Extensibility · 端到端 · 有向非循環圖 · state-of-the-art ·

2019 年 9 月 23 日

Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Shun Zheng,Wei Cao,Wei Xu,Jiang Bian

from arxiv, Accepted by EMNLP 2019

Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at //github.com/dolphin-zs/Doc2EDAG.