亚洲男人的天堂2018av,欧美草比,久久久久久免费视频精选,国色天香在线看免费,久久久久亚洲av成人片仓井空

Most existing masked audio modeling (MAM) methods learn audio representations by masking and reconstructing local spectrogram patches. However, the reconstruction loss mainly accounts for the signal-level quality of the reconstructed spectrogram and is still limited in extracting high-level audio semantics. In this paper, we propose to enhance the semantic modeling of MAM by distilling cross-modality knowledge from contrastive language-audio pretraining (CLAP) representations for both masked and unmasked regions (MAM-CLAP) and leveraging a multi-objective learning strategy with a supervised classification branch (SupMAM), thereby providing more semantic knowledge for MAM and enabling it to effectively learn global features from labels. Experiments show that our methods significantly improve the performance on multiple downstream tasks. Furthermore, by combining our MAM-CLAP with SupMAM, we can achieve new state-of-the-art results on various audio and speech classification tasks, exceeding previous self-supervised learning and supervised pretraining methods.

相關內容

Large language model (LLM) has achieved promising performance in multilingual machine translation tasks through zero/few-shot prompts or prompt-tuning. However, due to the mixture of multilingual data during the pre-training of LLM, the LLM-based translation models face the off-target issue in both prompt-based methods, including a series of phenomena, namely instruction misunderstanding, translation with wrong language and over-generation. For this issue, this paper introduces an \textbf{\underline{A}}uto-\textbf{\underline{C}}onstriction \textbf{\underline{T}}urning mechanism for \textbf{\underline{M}}ultilingual \textbf{\underline{N}}eural \textbf{\underline{M}}achine \textbf{\underline{T}}ranslation (\model), which is a novel supervised fine-tuning mechanism and orthogonal to the traditional prompt-based methods. In this method, \model automatically constructs a constrained template in the target side by adding trigger tokens ahead of the ground truth. Furthermore, trigger tokens can be arranged and combined freely to represent different task semantics, and they can be iteratively updated to maximize the label likelihood. Experiments are performed on WMT test sets with multiple metrics, and the experimental results demonstrate that \model achieves substantially improved performance across multiple translation directions and reduce the off-target phenomena in the translation.

The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion methods encompass image transformation-based and synthesis-based methods. The transformation-based methods introduce only local variations, resulting in poor diversity. While image synthesis-based methods can create entirely new content, significantly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, an effective data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its ability to generate distribution-consistent samples, achieving substantial improvements in data expansion tasks. Specifically, without additional training, DistDiff achieves a 30.7% improvement in accuracy across six image datasets compared to the model trained on original datasets and a 9.8% improvement compared to the state-of-the-art diffusion-based method. Our code is available at //github.com/haoweiz23/DistDiff

Existing learned video compression models employ flow net or deformable convolutional networks (DCN) to estimate motion information. However, the limited receptive fields of flow net and DCN inherently direct their attentiveness towards the local contexts. Global contexts, such as large-scale motions and global correlations among frames are ignored, presenting a significant bottleneck for capturing accurate motions. To address this issue, we propose a joint local and global motion compensation module (LGMC) for leaned video coding. More specifically, we adopt flow net for local motion compensation. To capture global context, we employ the cross attention in feature domain for motion compensation. In addition, to avoid the quadratic complexity of vanilla cross attention, we divide the softmax operations in attention into two independent softmax operations, leading to linear complexity. To validate the effectiveness of our proposed LGMC, we integrate it with DCVC-TCM and obtain learned video compression with joint local and global motion compensation (LVC-LGMC). Extensive experiments demonstrate that our LVC-LGMC has significant rate-distortion performance improvements over baseline DCVC-TCM.

Large language models (LLMs) have demonstrated remarkable capabilities across various NLP tasks. However, their computational costs are prohibitively high. To address this issue, previous research has attempted to distill the knowledge of LLMs into smaller models by generating annotated data. Nonetheless, these works have mainly focused on the direct use of LLMs for text generation and labeling, without fully exploring their potential to comprehend the target task and acquire valuable knowledge. In this paper, we propose EvoKD: Evolving Knowledge Distillation, which leverages the concept of active learning to interactively enhance the process of data generation using large language models, simultaneously improving the task capabilities of small domain model (student model). Different from previous work, we actively analyze the student model's weaknesses, and then synthesize labeled samples based on the analysis. In addition, we provide iterative feedback to the LLMs regarding the student model's performance to continuously construct diversified and challenging samples. Experiments and analysis on different NLP tasks, namely, text classification and named entity recognition show the effectiveness of EvoKD.

In millimeter-wave (mmWave) integrated sensing and communication networks, users may be within the coverage of multiple access points (AP), which typically employ large-scale antenna arrays to mitigate obstacle occlusion and path loss. However, large-scale arrays generate pencil-shaped beams, which necessitate a higher number of training beams to cover the desired space. Furthermore, as the antenna aperture increases, users are more likely to be situated in the near-field region of the AP antenna array. This motivates our investigation into the near-field beam training problem to achieve effective positioning services. To address the high complexity and low identification accuracy of existing beam training techniques, we propose an efficient hashing multi-arm beam (HMB) training scheme for the near-field scenario. Specifically, we first construct a near-field single-beam training codebook for the uniform planar arrays. Then, the hash functions are chosen independently to construct the multi-arm beam training codebooks for each AP. All APs traverse the predefined multi-arm beam training codeword simultaneously and the multi-AP superimposed signals at the user are recorded. Finally, the soft decision and voting methods are applied to obtain the correctly aligned beams only based on the signal powers. In addition, we logically prove that the traversal complexity is at the logarithmic level. Simulation results show that our proposed near-field HMB training method can achieve 96.4% identification accuracy of the exhaustive beam training method and greatly reduce the training overhead. Furthermore, we verify its applicability under the far-field scenario as well.

Dynamic graph neural networks (DyGNNs) have demonstrated powerful predictive abilities by exploiting graph structural and temporal dynamics. However, the existing DyGNNs fail to handle distribution shifts, which naturally exist in dynamic graphs, mainly because the patterns exploited by DyGNNs may be variant with respect to labels under distribution shifts. In this paper, we propose Disentangled Intervention-based Dynamic graph Attention networks with Invariance Promotion (I-DIDA) to handle spatio-temporal distribution shifts in dynamic graphs by discovering and utilizing invariant patterns, i.e., structures and features whose predictive abilities are stable across distribution shifts. Specifically, we first propose a disentangled spatio-temporal attention network to capture the variant and invariant patterns. By utilizing the disentangled patterns, we design a spatio-temporal intervention mechanism to create multiple interventional distributions and an environment inference module to infer the latent spatio-temporal environments, and minimize the variance of predictions among these intervened distributions and environments, so that our model can make predictions based on invariant patterns with stable predictive abilities under distribution shifts. Extensive experiments demonstrate the superiority of our method over state-of-the-art baselines under distribution shifts. Our work is the first study of spatio-temporal distribution shifts in dynamic graphs, to the best of our knowledge.

A novel near-field transmission framework is proposed for dynamic metasurface antenna (DMA)-enabled non-orthogonal multiple access (NOMA) networks. The base station (BS) exploits the hybrid beamforming to communicate with multiple near users (NUs) and far users (FUs) using the NOMA principle. Based on this framework, two novel beamforming schemes are proposed. 1) For the case of the grouped users distributed in the same direction, a beam-steering scheme is developed. The metric of beam pattern error (BPE) is introduced for the characterization of the gap between the hybrid beamformers and the desired ideal beamformers, where a two-layer algorithm is proposed to minimize BPE by optimizing hybrid beamformers. Then, the optimal power allocation strategy is obtained to maximize the sum achievable rate of the network. 2) For the case of users randomly distributed, a beam-splitting scheme is proposed, where two sub-beamformers are extracted from the single beamformer to serve different users in the same group. An alternating optimization (AO) algorithm is proposed for hybrid beamformer optimization, and the optimal power allocation is also derived. Numerical results validate that: 1) the proposed beamforming schemes exhibit superior performance compared with the existing imperfect-resolution-based beamforming scheme; 2) the communication rate of the proposed transmission framework is sensitive to the imperfect distance knowledge of NUs but not to that of FUs.

Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.

Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio (MONA), a system that leverages cross-modal alignment through novel loss functions--cross-contrast (crossCon) and supervised temporal contrast (supTcon)--to train a multimodal model with a shared latent representation. This architecture enables the use of audio-only datasets like LibriSpeech to improve silent speech recognition. Additionally, our introduction of Large Language Model (LLM) Integrated Scoring Adjustment (LISA) significantly improves recognition accuracy. Together, MONA LISA reduces the state-of-the-art word error rate (WER) from 28.8% to 12.2% in the Gaddy (2020) benchmark dataset for silent speech on an open vocabulary. For vocal EMG recordings, our method improves the state-of-the-art from 23.3% to 3.7% WER. In the Brain-to-Text 2024 competition, LISA performs best, improving the top WER from 9.8% to 8.9%. To the best of our knowledge, this work represents the first instance where noninvasive silent speech recognition on an open vocabulary has cleared the threshold of 15% WER, demonstrating that SSIs can be a viable alternative to automatic speech recognition (ASR). Our work not only narrows the performance gap between silent and vocalized speech but also opens new possibilities in human-computer interaction, demonstrating the potential of cross-modal approaches in noisy and data-limited regimes.

High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.

北京阿比特科技有限公司