亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.

相關內容

語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)(Speech Synthesis),也稱為文語(yu)(yu)轉換(huan)(Text-to-Speech, TTS,它(ta)是將任意(yi)的(de)(de)(de)輸入文本轉換(huan)成(cheng)(cheng)(cheng)(cheng)自(zi)(zi)然(ran)流暢的(de)(de)(de)語(yu)(yu)音(yin)(yin)(yin)(yin)輸出(chu)。語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)涉(she)及到人工智(zhi)能(neng)、心理(li)學(xue)(xue)、聲學(xue)(xue)、語(yu)(yu)言(yan)學(xue)(xue)、數(shu)(shu)字信號處理(li)、計(ji)(ji)算機(ji)科學(xue)(xue)等多(duo)個學(xue)(xue)科技(ji)術(shu),是信息(xi)處理(li)領(ling)域中的(de)(de)(de)一項(xiang)前沿技(ji)術(shu)。 隨(sui)(sui)著計(ji)(ji)算機(ji)技(ji)術(shu)的(de)(de)(de)不斷提高,語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)技(ji)術(shu)從早(zao)期(qi)的(de)(de)(de)共振峰合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng),逐步發展為波形拼接合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)和統計(ji)(ji)參(can)數(shu)(shu)語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng),再發展到混合(he)(he)(he)(he)語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng);合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)語(yu)(yu)音(yin)(yin)(yin)(yin)的(de)(de)(de)質量(liang)、自(zi)(zi)然(ran)度已經(jing)得(de)(de)到明(ming)顯提高,基(ji)本能(neng)滿足一些特定場(chang)合(he)(he)(he)(he)的(de)(de)(de)應(ying)用(yong)需求。目前,語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)技(ji)術(shu)在(zai)銀行、醫院等的(de)(de)(de)信息(xi)播報系統、汽車導航系統、自(zi)(zi)動應(ying)答呼叫中心等都有廣(guang)泛應(ying)用(yong),取得(de)(de)了(le)巨大的(de)(de)(de)經(jing)濟(ji)效益。 另外(wai),隨(sui)(sui)著智(zhi)能(neng)手機(ji)、MP3、PDA 等與我們生活(huo)密切相關的(de)(de)(de)媒介的(de)(de)(de)大量(liang)涌現(xian),語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)的(de)(de)(de)應(ying)用(yong)也在(zai)逐漸向娛樂、語(yu)(yu)音(yin)(yin)(yin)(yin)教(jiao)學(xue)(xue)、康復治療等領(ling)域深入。可以(yi)說語(yu)(yu)音(yin)(yin)(yin)(yin)合(he)(he)(he)(he)成(cheng)(cheng)(cheng)(cheng)正在(zai)影響著人們生活(huo)的(de)(de)(de)方(fang)方(fang)面面。

Distribution shifts between training and test data are all but inevitable over the lifecycle of a deployed model and lead to performance decay. Adapting the model can hopefully mitigate this drop in performance. Yet, adaptation is challenging since it must be unsupervised: we usually do not have access to any labeled data at test time. In this paper, we propose a probabilistic state-space model that can adapt a deployed model subjected to distribution drift. Our model learns the dynamics induced by distribution shifts on the last set of hidden features. Without requiring labels, we infer time-evolving class prototypes that serve as a dynamic classification head. Moreover, our approach is lightweight, modifying only the model's last linear layer. In experiments on real-world distribution shifts and synthetic corruptions, we demonstrate that our approach performs competitively with methods that require back-propagation and access to the model backbone. Our model especially excels in the case of small test batches - the most difficult setting.

This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task. Single-view tasks would encounter blind spots from other perspectives, resulting in inaccuracies in sample-level prediction. Therefore, we introduce the \textbf{M}ulti-\textbf{V}iew \textbf{A}nomaly \textbf{D}etection (\textbf{MVAD}) framework, which learns and integrates features from multi-views. Specifically, we proposed a \textbf{M}ulti-\textbf{V}iew \textbf{A}daptive \textbf{S}election (\textbf{MVAS}) algorithm for feature learning and fusion across multiple views. The feature maps are divided into neighbourhood attention windows to calculate a semantic correlation matrix between single-view windows and all other views, which is a conducted attention mechanism for each single-view window and the top-K most correlated multi-view windows. Adjusting the window sizes and top-K can minimise the computational complexity to linear. Extensive experiments on the Real-IAD dataset for cross-setting (multi/single-class) validate the effectiveness of our approach, achieving state-of-the-art performance among sample \textbf{4.1\%}$\uparrow$/ image \textbf{5.6\%}$\uparrow$/pixel \textbf{6.7\%}$\uparrow$ levels with a total of ten metrics with only \textbf{18M} parameters and fewer GPU memory and training time.

In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain the learned policy through policy regularization. However, these methods often suffer from the issue of unnecessary conservativeness, hampering policy improvement. This occurs due to the indiscriminate use of all actions from the behavior policy that generates the offline dataset as constraints. The problem becomes particularly noticeable when the quality of the dataset is suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage actions from an augmented behavior policy combined with VAE to guide the learned policy. A2PR can select high-advantage actions that differ from those present in the dataset, while still effectively maintaining conservatism from OOD actions. This is achieved by harnessing the VAE capacity to generate samples matching the distribution of the data points. We theoretically prove that the improvement of the behavior policy is guaranteed. Besides, it effectively mitigates value overestimation with a bounded performance gap. Empirically, we conduct a series of experiments on the D4RL benchmark, where A2PR demonstrates state-of-the-art performance. Furthermore, experimental results on additional suboptimal mixed datasets reveal that A2PR exhibits superior performance. Code is available at //github.com/ltlhuuu/A2PR.

Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as demonstrations or examples of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics. These models can perform visual reasoning in physical contexts and generate coarse robot motions for various manipulation tasks. Motivated by this range of capability, in this work, we propose and study rewards shaped by vision-language models (VLMs). State-of-the-art VLMs have demonstrated an impressive ability to reason about affordances through keypoints in zero-shot, and we leverage this to define dense rewards for robotic learning. On a real-world manipulation task specified by natural language description, we find that these rewards improve the sample efficiency of autonomous RL and enable successful completion of the task in 20K online finetuning steps. Additionally, we demonstrate the robustness of the approach to reductions in the number of in-domain demonstrations used for pretraining, reaching comparable performance in 35K online finetuning steps.

This technical report presents our frame-level embedding learning system for the DCASE2024 challenge for few-shot bioacoustic event detection (Task 5).In this work, we used log-mel and PCEN for feature extraction of the input audio, Netmamba Encoder as the information interaction network, and adopted data augmentation strategies to improve the generalizability of the trained model as well as multiple post-processing methods. Our final system achieved an F-measure score of 56.4%, securing the 2nd rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2024.

Multi-Instance Partial Label Learning (MI-PLL) is a weakly-supervised learning setting encompassing partial label learning, latent structural learning, and neurosymbolic learning. Differently from supervised learning, in MI-PLL, the inputs to the classifiers at training-time are tuples of instances $\textbf{x}$, while the supervision signal is generated by a function $\sigma$ over the gold labels of $\textbf{x}$. The gold labels are hidden during training. In this paper, we focus on characterizing and mitigating learning imbalances, i.e., differences in the errors occurring when classifying instances of different classes (aka class-specific risks), under MI-PLL. The phenomenon of learning imbalances has been extensively studied in the context of long-tail learning; however, the nature of MI-PLL introduces new challenges. Our contributions are as follows. From a theoretical perspective, we characterize the learning imbalances by deriving class-specific risk bounds that depend upon the function $\sigma$. Our theory reveals that learning imbalances exist in MI-PLL even when the hidden labels are uniformly distributed. On the practical side, we introduce a technique for estimating the marginal of the hidden labels using only MI-PLL data. Then, we introduce algorithms that mitigate imbalances at training- and testing-time, by treating the marginal of the hidden labels as a constraint. The first algorithm relies on a novel linear programming formulation of MI-PLL for pseudo-labeling. The second one adjusts a model's scores based on robust optimal transport. We demonstrate the effectiveness of our techniques using strong neurosymbolic and long-tail learning baselines, discussing also open challenges.

Graph Convolutional Network (GCN) has achieved extraordinary success in learning effective task-specific representations of nodes in graphs. However, regarding Heterogeneous Information Network (HIN), existing HIN-oriented GCN methods still suffer from two deficiencies: (1) they cannot flexibly explore all possible meta-paths and extract the most useful ones for a target object, which hinders both effectiveness and interpretability; (2) they often need to generate intermediate meta-path based dense graphs, which leads to high computational complexity. To address the above issues, we propose an interpretable and efficient Heterogeneous Graph Convolutional Network (ie-HGCN) to learn the representations of objects in HINs. It is designed as a hierarchical aggregation architecture, i.e., object-level aggregation first, followed by type-level aggregation. The novel architecture can automatically extract useful meta-paths for each object from all possible meta-paths (within a length limit), which brings good model interpretability. It can also reduce the computational cost by avoiding intermediate HIN transformation and neighborhood attention. We provide theoretical analysis about the proposed ie-HGCN in terms of evaluating the usefulness of all possible meta-paths, its connection to the spectral graph convolution on HINs, and its quasi-linear time complexity. Extensive experiments on three real network datasets demonstrate the superiority of ie-HGCN over the state-of-the-art methods.

Graph Neural Networks (GNNs) have received considerable attention on graph-structured data learning for a wide variety of tasks. The well-designed propagation mechanism which has been demonstrated effective is the most fundamental part of GNNs. Although most of GNNs basically follow a message passing manner, litter effort has been made to discover and analyze their essential relations. In this paper, we establish a surprising connection between different propagation mechanisms with a unified optimization problem, showing that despite the proliferation of various GNNs, in fact, their proposed propagation mechanisms are the optimal solution optimizing a feature fitting function over a wide class of graph kernels with a graph regularization term. Our proposed unified optimization framework, summarizing the commonalities between several of the most representative GNNs, not only provides a macroscopic view on surveying the relations between different GNNs, but also further opens up new opportunities for flexibly designing new GNNs. With the proposed framework, we discover that existing works usually utilize naive graph convolutional kernels for feature fitting function, and we further develop two novel objective functions considering adjustable graph kernels showing low-pass or high-pass filtering capabilities respectively. Moreover, we provide the convergence proofs and expressive power comparisons for the proposed models. Extensive experiments on benchmark datasets clearly show that the proposed GNNs not only outperform the state-of-the-art methods but also have good ability to alleviate over-smoothing, and further verify the feasibility for designing GNNs with our unified optimization framework.

Few-shot Knowledge Graph (KG) completion is a focus of current research, where each task aims at querying unseen facts of a relation given its few-shot reference entity pairs. Recent attempts solve this problem by learning static representations of entities and references, ignoring their dynamic properties, i.e., entities may exhibit diverse roles within task relations, and references may make different contributions to queries. This work proposes an adaptive attentional network for few-shot KG completion by learning adaptive entity and reference representations. Specifically, entities are modeled by an adaptive neighbor encoder to discern their task-oriented roles, while references are modeled by an adaptive query-aware aggregator to differentiate their contributions. Through the attention mechanism, both entities and references can capture their fine-grained semantic meanings, and thus render more expressive representations. This will be more predictive for knowledge acquisition in the few-shot scenario. Evaluation in link prediction on two public datasets shows that our approach achieves new state-of-the-art results with different few-shot sizes.

Learning with limited data is a key challenge for visual recognition. Few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them is the target task. In this paper, we propose a novel approach to adapt the embedding model to the target classification task, yielding embeddings that are task-specific and are discriminative. To this end, we employ a type of self-attention mechanism called Transformer to transform the embeddings from task-agnostic to task-specific by focusing on relating instances from the test instances to the training instances in both seen and unseen classes. Our approach also extends to both transductive and generalized few-shot classification, two important settings that have essential use cases. We verify the effectiveness of our model on two standard benchmark few-shot classification datasets --- MiniImageNet and CUB, where our approach demonstrates state-of-the-art empirical performance.

北京阿比特科技有限公司