The subjective nature of music emotion introduces inherent bias in both recognition and generation, especially when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside the Frechet Audio Distance (FAD), a reference-free evaluation metric. Our study begins with a benchmark evaluation of MER, highlighting the limitations associated with using a single audio encoder and the disparities observed across different measurements. We then propose assessing MER performance using FAD from multiple encoders to provide a more objective measure of music emotion. Furthermore, we introduce an enhanced EMG approach designed to improve both the variation and prominence of generated music emotion, thus enhancing realism. Additionally, we investigate the realism disparities between the emotions conveyed in real and synthetic music, comparing our EMG model against two baseline models. Experimental results underscore the emotion bias problem in both MER and EMG and demonstrate the potential of using FAD and diverse audio encoders to evaluate music emotion objectively.
The healthcare sector has experienced a rapid accumulation of digital data recently, especially in the form of electronic health records (EHRs). EHRs constitute a precious resource that IS researchers could utilize for clinical applications (e.g., morbidity prediction). Deep learning seems like the obvious choice to exploit this surfeit of data. However, numerous studies have shown that deep learning does not enjoy the same kind of success on EHR data as it has in other domains; simple models like logistic regression are frequently as good as sophisticated deep learning ones. Inspired by this observation, we develop a novel model called rational logistic regression (RLR) that has standard logistic regression (LR) as its special case (and thus inherits LR's inductive bias that aligns with EHR data). RLR has rational series as its theoretical underpinnings, works on longitudinal time-series data, and learns interpretable patterns. Empirical comparisons on real-world clinical tasks demonstrate RLR's efficacy.
Integrated sensing and communications (ISAC) has emerged as a means to efficiently utilize spectrum and thereby save cost and power. At the higher end of the spectrum, ISAC systems operate at wideband using large antenna arrays to meet the stringent demands for high-resolution sensing and enhanced communications capacity. On the other hand, the overall design should satisfy energy-efficiency and hardware constraints such as operating on low resolution components for a practical scenario. Therefore, this paper presents the design of Hybrid ANalog and Digital BeAmformers with Low resoLution (HANDBALL) digital-to-analog converters (DACs). We introduce a greedy-search-based approach to design the analog beamformers for multi-user multi-target ISAC scenario. Then, the quantization distortion is taken into account in order to design the baseband beamformer with low resolution DACs. We evaluated performance of the proposed HANDBALL technique in terms of both spectral efficiency and sensing beampattern, providing a satisfactory sensing and communication performance for both one-bit and few-bit designs.
Graph diffusion, which iteratively propagates real-valued substances among the graph, is used in numerous graph/network-involved applications. However, releasing diffusion vectors may reveal sensitive linking information in the data such as transaction information in financial network data. However, protecting the privacy of graph data is challenging due to its interconnected nature. This work proposes a novel graph diffusion framework with edge-level differential privacy guarantees by using noisy diffusion iterates. The algorithm injects Laplace noise per diffusion iteration and adopts a degree-based thresholding function to mitigate the high sensitivity induced by low-degree nodes. Our privacy loss analysis is based on Privacy Amplification by Iteration (PABI), which to our best knowledge, is the first effort that analyzes PABI with Laplace noise and provides relevant applications. We also introduce a novel Infinity-Wasserstein distance tracking method, which tightens the analysis of privacy leakage and makes PABI more applicable in practice. We evaluate this framework by applying it to Personalized Pagerank computation for ranking tasks. Experiments on real-world network data demonstrate the superiority of our method under stringent privacy conditions.
This paper presents an efficient deep learning solution for decoding motor movements from neural recordings in non-human primates. An Autoencoder Gated Recurrent Unit (AEGRU) model was adopted as the model architecture for this task. The autoencoder is only used during the training stage to achieve better generalization. Together with the preprocessing techniques, our model achieved 0.71 $R^2$ score, surpassing the baseline models in Neurobench and is ranked first for $R^2$ in the IEEE BioCAS 2024 Grand Challenge on Neural Decoding. Model pruning is also applied leading to a reduction of 41.4% of the multiply-accumulate (MAC) operations with little change in the $R^2$ score compared to the unpruned model.
This article establishes a method to answer a finite set of linear queries on a given dataset while ensuring differential privacy. To achieve this, we formulate the corresponding task as a saddle-point problem, i.e. an optimization problem whose solution corresponds to a distribution minimizing the difference between answers to the linear queries based on the true distribution and answers from a differentially private distribution. Against this background, we establish two new algorithms for corresponding differentially private data release: the first is based on the differentially private Frank-Wolfe method, the second combines randomized smoothing with stochastic convex optimization techniques for a solution to the saddle-point problem. While previous works assess the accuracy of differentially private algorithms with reference to the empirical data distribution, a key contribution of our work is a more natural evaluation of the proposed algorithms' accuracy with reference to the true data-generating distribution.
The cognitive faculty of visual reasoning necessitates the integration of multimodal perceptual processing and commonsense and external knowledge of the world. In recent years, a plethora of large vision-language models (LVLMs) have been proposed, demonstrating outstanding power and exceptional proficiency in commonsense reasoning across diverse domains and tasks. Nevertheless, training such LVLMs requires a lot of costly resources. Recent approaches, instead of training LVLMs from scratch on various large datasets, focus on exploring ways to take advantage of the capabilities of many different LVLMs, such as ensemble methods. In this work, we propose self-ensemble, a novel method that improves the generalization and visual reasoning of the model without updating any parameters, a training-free method. Our key insight is that we realized that LVLM itself can ensemble without the need for any other LVLMs, which helps to unlock their internal capabilities. Extensive experiments on various benchmarks demonstrate the effectiveness of our method in achieving state-of-the-art (SOTA) performance on SketchyVQA, Outside Knowledge VQA, and out-of-distribution VQA tasks.
With the advancement of technology, large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, powering LLM-integrated applications like Microsoft Copilot. However, as LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. These attacks trick LLMs into deviating from the original input instructions and executing the attacker's instructions injected in data content, such as retrieved results. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content, and achieve a high attack success rate (ASR). When comparing the attack and defense methods, we interestingly find that they share similar design goals, of inducing the model to ignore unwanted instructions and instead to execute wanted instructions. Therefore, we raise an intuitive question: Could these attack techniques be utilized for defensive purposes? In this paper, we invert the intention of prompt injection methods to develop novel defense methods based on previous training-free attack methods, by repeating the attack process but with the original input instruction rather than the injected instruction. Our comprehensive experiments demonstrate that our defense techniques outperform existing training-free defense approaches, achieving state-of-the-art results.
Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong underestimation bias when estimating large amounts of MI. We propose decomposing the full MI estimation problem into a sum of smaller estimation problems by splitting one of the views into progressively more informed subviews and by applying the chain rule on MI between the decomposed views. This expression contains a sum of unconditional and conditional MI terms, each measuring modest chunks of the total MI, which facilitates approximation via contrastive bounds. To maximize the sum, we formulate a contrastive lower bound on the conditional MI which can be approximated efficiently. We refer to our general approach as Decomposed Estimation of Mutual Information (DEMI). We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation.
Pre-trained deep neural network language models such as ELMo, GPT, BERT and XLNet have recently achieved state-of-the-art performance on a variety of language understanding tasks. However, their size makes them impractical for a number of scenarios, especially on mobile and edge devices. In particular, the input word embedding matrix accounts for a significant proportion of the model's memory footprint, due to the large input vocabulary and embedding dimensions. Knowledge distillation techniques have had success at compressing large neural network models, but they are ineffective at yielding student models with vocabularies different from the original teacher models. We introduce a novel knowledge distillation technique for training a student model with a significantly smaller vocabulary as well as lower embedding and hidden state dimensions. Specifically, we employ a dual-training mechanism that trains the teacher and student models simultaneously to obtain optimal word embeddings for the student vocabulary. We combine this approach with learning shared projection matrices that transfer layer-wise knowledge from the teacher model to the student model. Our method is able to compress the BERT_BASE model by more than 60x, with only a minor drop in downstream task metrics, resulting in a language model with a footprint of under 7MB. Experimental results also demonstrate higher compression efficiency and accuracy when compared with other state-of-the-art compression techniques.
Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples resulting from adding small-magnitude perturbations to inputs. Such adversarial examples can mislead DNNs to produce adversary-selected results. Different attack strategies have been proposed to generate adversarial examples, but how to produce them with high perceptual quality and more efficiently requires more research efforts. In this paper, we propose AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. For AdvGAN, once the generator is trained, it can generate adversarial perturbations efficiently for any instance, so as to potentially accelerate adversarial training as defenses. We apply AdvGAN in both semi-whitebox and black-box attack settings. In semi-whitebox attacks, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box attacks. In black-box attacks, we dynamically train a distilled model for the black-box model and optimize the generator accordingly. Adversarial examples generated by AdvGAN on different target models have high attack success rate under state-of-the-art defenses compared to other attacks. Our attack has placed the first with 92.76% accuracy on a public MNIST black-box attack challenge.