Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset.
Hyperbox-based classification has been seen as a promising technique in which decisions on the data are represented as a series of orthogonal, multidimensional boxes (i.e., hyperboxes) that are often interpretable and human-readable. However, existing methods are no longer capable of efficiently handling the increasing volume of data many application domains face nowadays. We address this gap by proposing a novel, fully differentiable framework for hyperbox-based classification via neural networks. In contrast to previous work, our hyperbox models can be efficiently trained in an end-to-end fashion, which leads to significantly reduced training times and superior classification results.
This paper presents a new radiometric fingerprint that is revealed by micro-signals in the channel state information (CSI) curves extracted from commodity Wi-Fi devices. We refer to this new fingerprint as "micro-CSI". Our experiments show that micro-CSI is likely to be caused by imperfections in the radio-frequency circuitry and is present in Wi-Fi 4/5/6 network interface cards (NICs). We conducted further experiments to determine the most effective CSI collection configuration to stabilize micro-CSI. To extract micro-CSI from varying CSI curves, we developed a signal space-based extraction algorithm that effectively separates distortions caused by wireless channels and hardware imperfections under line-of-sight (LoS) scenarios. Finally, we implemented a micro-CSI-based device authentication algorithm that uses the k-Nearest Neighbors (KNN) method to identify 11 COTS Wi-Fi NICs from the same manufacturer in typical indoor environments. Our experimental results demonstrate that the micro-CSI-based authentication algorithm can achieve an average attack detection rate of over 99% with a false alarm rate of 0%.
Few existing studies focus on the source separation problem with unknown numbers of signals, and how to evaluate the performances of the systems is not yet clear. We propose a solution with a fixed number of output channels to address these two problems, enabling it to avoid the dimensional disaster caused by the permutation problem induced by the alignment of outputs to targets. Specifically, we propose a two-step algorithm based on autoencoders and a new performance evaluation method for situations with mute channels. Experiments conducted on simulated mixtures of radiated ship noise show that the proposed solution can achieve similar separation performance to that attained with a known number of signals. The proposed algorithm achieved competitive performance as two algorithms developed for known numbers of signals, which is highly explainable and extensible and get the state of the art under this framework.
Location fingerprinting based on RSSI becomes a mainstream indoor localization technique due to its advantage of not requiring the installation of new infrastructure and the modification of existing devices, especially given the prevalence of Wi-Fi-enabled devices and the ubiquitous Wi-Fi access in modern buildings. The use of AI/ML technologies like DNNs makes location fingerprinting more accurate and reliable, especially for large-scale multi-building and multi-floor indoor localization. The application of DNNs for indoor localization, however, depends on a large amount of preprocessed and deliberately-labeled data for their training. Considering the difficulty of the data collection in an indoor environment, especially under the current epidemic situation of COVID-19, we investigate three different methods of RSSI data augmentation based on Multi-Output Gaussian Process (MOGP), i.e., by a single floor, by neighboring floors, and by a single building; unlike Single-Output Gaussian Process (SOGP), MOGP can take into account the correlation among RSSI observations from multiple Access Points (APs) deployed closely to each other (e.g., APs on the same floor of a building) by collectively handling them. The feasibility of the MOGP-based RSSI data augmentation is demonstrated through experiments based on the state-of-the-art RNN indoor localization model and the UJIIndoorLoc, i.e., the most popular publicly-available multi-building and multi-floor indoor localization database, where the RNN model trained with the UJIIndoorLoc database augmented by using the whole RSSI data of a building in fitting an MOGP model (i.e., by a single building) outperforms the other two augmentation methods as well as the RNN model trained with the original UJIIndoorLoc database, resulting in the mean three-dimensional positioning error of 8.42 m.
Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional training approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.
This paper introduces a novel computational framework for solving alternating current optimal power flow (ACOPF) problems using graphics processing units (GPUs). While GPUs have demonstrated remarkable performance in various computing domains, their application in AC OPF has been limited due to challenges associated with porting sparse automatic differentiation (AD) and sparse linear solver routines to GPUs. We aim to address these issues with two key strategies. First, we utilize a single-instruction, multiple-data (SIMD) abstraction of nonlinear programs (NLP). This approach enables the specification of model equations while preserving their parallelizable structure, and in turn, facilitates the implementation of AD routines that can exploit such structure. Second, we employ a condensed-space interior-point method (IPM) with an inequality relaxation strategy. This technique involves relaxing equality constraints to inequalities and condensing the Karush-Kuhn-Tucker system into a much smaller positive definite system. This strategy offers the key advantage of being able to factorize the KKT matrix without numerical pivoting, which in the past has hampered the parallelization of the IPM algorithm. By combining these two strategies, we can perform the majority of operations on GPUs while keeping the data residing in the device memory only. Comprehensive numerical benchmark results showcase the substantial computational advantage of our approach. Remarkably, for solving large-scale AC OPF problems to a moderate accuracy, our implementations -- MadNLP.jl and ExaModels.jl -- running on NVIDIA GPUs achieve an order of magnitude speedup compared to state-of-the-art tools running on contemporary CPUs.
Neural ranking models (NRMs) have undergone significant development and have become integral components of information retrieval (IR) systems. Unfortunately, recent research has unveiled the vulnerability of NRMs to adversarial document manipulations, potentially exploited by malicious search engine optimization practitioners. While progress in adversarial attack strategies aids in identifying the potential weaknesses of NRMs before their deployment, the defensive measures against such attacks, like the detection of adversarial documents, remain inadequately explored. To mitigate this gap, this paper establishes a benchmark dataset to facilitate the investigation of adversarial ranking defense and introduces two types of detection tasks for adversarial documents. A comprehensive investigation of the performance of several detection baselines is conducted, which involve examining the spamicity, perplexity, and linguistic acceptability, and utilizing supervised classifiers. Experimental results demonstrate that a supervised classifier can effectively mitigate known attacks, but it performs poorly against unseen attacks. Furthermore, such classifier should avoid using query text to prevent learning the classification on relevance, as it might lead to the inadvertent discarding of relevant documents.
Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.
We introduce a multi-task setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. We create SciERC, a dataset that includes annotations for all three tasks and develop a unified framework called Scientific Information Extractor (SciIE) for with shared span representations. The multi-task setup reduces cascading errors between tasks and leverages cross-sentence relations through coreference links. Experiments show that our multi-task model outperforms previous models in scientific information extraction without using any domain-specific features. We further show that the framework supports construction of a scientific knowledge graph, which we use to analyze information in scientific literature.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.