Speaker extraction and diarization are two crucial enabling techniques for speech applications. Speaker extraction aims to extract a target speaker's voice from a multi-talk mixture, while speaker diarization demarcates speech segments by speaker, identifying `who spoke when'. The previous studies have typically treated the two tasks independently. However, the two tasks share a similar objective, that is to disentangle the speakers in the spectral domain for the former but in the temporal domain for the latter. It is logical to believe that the speaker turns obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker turns than the mixture speech. In this paper, we propose a unified framework called Universal Speaker Extraction and Diarization (USED). We extend the existing speaker extraction model to simultaneously extract the waveforms of all speakers. We also employ a scenario-aware differentiated loss function to address the problem of sparsely overlapped speech in real-world conversations. We show that the USED model significantly outperforms the baselines for both speaker extraction and diarization tasks, in both highly overlapped and sparsely overlapped scenarios. Audio samples are available at //ajyy.github.io/demo/USED/.
The artist similarity quest has become a crucial subject in social and scientific contexts. Modern research solutions facilitate music discovery according to user tastes. However, defining similarity among artists may involve several aspects, even related to a subjective perspective, and it often affects a recommendation. This paper presents GATSY, a recommendation system built upon graph attention networks and driven by a clusterized embedding of artists. The proposed framework takes advantage of a graph topology of the input data to achieve outstanding performance results without relying heavily on hand-crafted features. This flexibility allows us to introduce fictitious artists in a music dataset, create bridges to previously unrelated artists, and get recommendations conditioned by possibly heterogeneous sources. Experimental results prove the effectiveness of the proposed method with respect to state-of-the-art solutions.
In contrastive self-supervised learning, positive samples are typically drawn from the same image but in different augmented views, resulting in a relatively limited source of positive samples. An effective way to alleviate this problem is to incorporate the relationship between samples, which involves including the top-k nearest neighbors of positive samples in the framework. However, the problem of false neighbors (i.e., neighbors that do not belong to the same category as the positive sample) is an objective but often overlooked challenge due to the query of neighbor samples without human supervision. In this paper, we present a simple Self-supervised learning framework called Mixed Nearest-Neighbors for Self-Supervised Learning (MNN). MNN optimizes the influence of neighbor samples on the semantics of positive samples through an intuitive weighting approach and image mixture operations. The results of our study demonstrate that MNN exhibits exceptional generalization performance and training efficiency on four benchmark datasets.
ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: custom tokenizers, domain-adaptive continued pretraining, supervised fine-tuning (SFT) with domain-specific instructions, and domain-adapted retrieval models. We evaluate these methods on three selected LLM applications for chip design: an engineering assistant chatbot, EDA script generation, and bug summarization and analysis. Our results show that these domain adaptation techniques enable significant LLM performance improvements over general-purpose base models across the three evaluated applications, enabling up to 5x model size reduction with similar or better performance on a range of design tasks. Our findings also indicate that there's still room for improvement between our current results and ideal outcomes. We believe that further investigation of domain-adapted LLM approaches will help close this gap in the future.
Feature extraction and matching are the basic parts of many robotic vision tasks, such as 2D or 3D object detection, recognition, and registration. As known, 2D feature extraction and matching have already been achieved great success. Unfortunately, in the field of 3D, the current methods fail to support the extensive application of 3D LiDAR sensors in robotic vision tasks, due to the poor descriptiveness and inefficiency. To address this limitation, we propose a novel 3D feature representation method: Linear Keypoints representation for 3D LiDAR point cloud, called LinK3D. The novelty of LinK3D lies in that it fully considers the characteristics (such as the sparsity, and complexity of scenes) of LiDAR point clouds, and represents the keypoint with its robust neighbor keypoints, which provide strong distinction in the description of the keypoint. The proposed LinK3D has been evaluated on two public datasets (i.e., KITTI, Steven VLP16), and the experimental results show that our method greatly outperforms the state-of-the-art in matching performance. More importantly, LinK3D shows excellent real-time performance, faster than the sensor frame rate at 10 Hz of a typical rotating LiDAR sensor. LinK3D only takes an average of 32 milliseconds to extract features from the point cloud collected by a 64-beam LiDAR, and takes merely about 8 milliseconds to match two LiDAR scans when executed in a notebook with an Intel Core i7 @2.2 GHz processor. Moreover, our method can be widely extended to various 3D vision applications. In this paper, we apply the proposed LinK3D to the LiDAR odometry and place recognition task of LiDAR SLAM. The experimental results show that our method can improve the efficiency and accuracy of LiDAR SLAM system.
Prevalent supervised learning methods in natural language processing (NLP) are notoriously data-hungry, which demand large amounts of high-quality annotated data. In practice, acquiring such data is a costly endeavor. Recently, the superior few-shot performance of large language models (LLMs) has propelled the development of dataset generation, where the training data are solely synthesized from LLMs. However, such an approach usually suffers from low-quality issues, and requires orders of magnitude more labeled data to achieve satisfactory performance. To fully exploit the potential of LLMs and make use of massive unlabeled data, we propose LLMaAA, which takes LLMs as annotators and puts them into an active learning loop to determine what to annotate efficiently. To learn robustly with pseudo labels, we optimize both the annotation and training processes: (1) we draw k-NN examples from a small demonstration pool as in-context examples, and (2) we adopt the example reweighting technique to assign training samples with learnable weights. Compared with previous approaches, LLMaAA features both efficiency and reliability. We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction. With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples, which is much more cost-effective than other baselines.
Multi-scale features are essential for dense prediction tasks, such as object detection, instance segmentation, and semantic segmentation. The prevailing methods usually utilize a classification backbone to extract multi-scale features and then fuse these features using a lightweight module (e.g., the fusion module in FPN and BiFPN, two typical object detection methods). However, as these methods allocate most computational resources to the classification backbone, the multi-scale feature fusion in these methods is delayed, which may lead to inadequate feature fusion. While some methods perform feature fusion from early stages, they either fail to fully leverage high-level features to guide low-level feature learning or have complex structures, resulting in sub-optimal performance. We propose a streamlined cascade encoder-decoder network, dubbed CEDNet, tailored for dense \mbox{prediction} tasks. All stages in CEDNet share the same encoder-decoder structure and perform multi-scale feature fusion within the decoder. A hallmark of CEDNet is its ability to incorporate high-level features from early stages to guide low-level feature learning in subsequent stages, thereby enhancing the effectiveness of multi-scale feature fusion. We explored three well-known encoder-decoder structures: Hourglass, UNet, and FPN. When integrated into CEDNet, they performed much better than traditional methods that use a pre-designed classification backbone combined with a lightweight fusion module. Extensive experiments on object detection, instance segmentation, and semantic segmentation demonstrated the effectiveness of our method. The code is available at //github.com/zhanggang001/CEDNet.
Static and dynamic computational graphs represent two distinct approaches to constructing deep learning frameworks. The former prioritizes compiler-based optimizations, while the latter focuses on programmability and user-friendliness. The recent release of PyTorch 2.0, which supports compiling arbitrary deep learning programs in Python, signifies a new direction in the evolution of deep learning infrastructure to incorporate compiler techniques in a more dynamic manner and support more dynamic language features like dynamic control flows and closures. Given PyTorch's seamless integration with Python, its compiler aims to support arbitrary deep learning code written in Python. However, the inherent dynamism of Python poses challenges to the completeness and robustness of the compiler. While recent research has introduced fuzzing to test deep learning compilers, there is still a lack of comprehensive analysis on how to test dynamic features. To address this issue, we propose several code transformations to generate test cases involving dynamic features. These transformations preserve the program's semantics, ensuring that any discrepancy between the transformed and original programs indicates the presence of a bug. Through our approach, we have successfully identified twenty previously unknown bugs in the PyTorch compiler and its underlying tensor compiler Triton.
Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.
The design of deep graph models still remains to be investigated and the crucial part is how to explore and exploit the knowledge from different hops of neighbors in an efficient way. In this paper, we propose a novel RNN-like deep graph neural network architecture by incorporating AdaBoost into the computation of network; and the proposed graph convolutional network called AdaGCN~(AdaBoosting Graph Convolutional Network) has the ability to efficiently extract knowledge from high-order neighbors and integrate knowledge from different hops of neighbors into the network in an AdaBoost way. We also present the architectural difference between AdaGCN and existing graph convolutional methods to show the benefits of our proposal. Finally, extensive experiments demonstrate the state-of-the-art prediction performance and the computational advantage of our approach AdaGCN.
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.