This report describes the NPU-HC speaker verification system submitted to the O-COCOSDA Multi-lingual Speaker Verification (MSV) Challenge 2022, which focuses on developing speaker verification systems for low-resource Asian languages. We participate in the I-MSV track, which aims to develop speaker verification systems for various Indian languages. In this challenge, we first explore different neural network frameworks for low-resource speaker verification. Then we leverage vanilla fine-tuning and weight transfer fine-tuning to transfer the out-domain pre-trained models to the in-domain Indian dataset. Specifically, the weight transfer fine-tuning aims to constrain the distance of the weights between the pre-trained model and the fine-tuned model, which takes advantage of the previously acquired discriminative ability from the large-scale out-domain datasets and avoids catastrophic forgetting and overfitting at the same time. Finally, score fusion is adopted to further improve performance. Together with the above contributions, we obtain 0.223% EER on the public evaluation set, ranking 2nd place on the leaderboard. On the private evaluation set, the EER of our submitted system is 2.123% and 0.630% for the constrained and unconstrained sub-tasks of the I-MSV track, leading to the 1st and 3rd place in the ranking, respectively.
As deep neural networks (DNNs) prove their importance and feasibility, more and more DNN-based apps, such as detection and classification of objects, have been developed and deployed on autonomous vehicles (AVs). To meet their growing expectations and requirements, AVs should "optimize" use of their limited onboard computing resources for multiple concurrent in-vehicle apps while satisfying their timing requirements (especially for safety). That is, real-time AV apps should share the limited on-board resources with other concurrent apps without missing their deadlines dictated by the frame rate of a camera that generates and provides input images to the apps. However, most, if not all, of existing DNN solutions focus on enhancing the concurrency of their specific hardware without dynamically optimizing/modifying the DNN apps' resource requirements, subject to the number of running apps, owing to their high computational cost. To mitigate this limitation, we propose DynaMIX (Dynamic MIXed-precision model construction), which optimizes the resource requirement of concurrent apps and aims to maximize execution accuracy. To realize a real-time resource optimization, we formulate an optimization problem using app performance profiles to consider both the accuracy and worst-case latency of each app. We also propose dynamic model reconfiguration by lazy loading only the selected layers at runtime to reduce the overhead of loading the entire model. DynaMIX is evaluated in terms of constraint satisfaction and inference accuracy for a multi-tasking system and compared against state-of-the-art solutions, demonstrating its effectiveness and feasibility under various environmental/operating conditions.
Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (//ksaintg.github.io/SGAN-File-Classier).
Knowledge Extraction (KE), aiming to extract structural information from unstructured texts, often suffers from data scarcity and emerging unseen types, i.e., low-resource scenarios. Many neural approaches to low-resource KE have been widely investigated and achieved impressive performance. In this paper, we present a literature review towards KE in low-resource scenarios, and systematically categorize existing works into three paradigms: (1) exploiting higher-resource data, (2) exploiting stronger models, and (3) exploiting data and models together. In addition, we highlight promising applications and outline some potential directions for future research. We hope that our survey can help both the academic and industrial communities to better understand this field, inspire more ideas, and boost broader applications.
Simplicity bias is the concerning tendency of deep networks to over-depend on simple, weakly predictive features, to the exclusion of stronger, more complex features. This causes biased, incorrect model predictions in many real-world applications, exacerbated by incomplete training data containing spurious feature-label correlations. We propose a direct, interventional method for addressing simplicity bias in DNNs, which we call the feature sieve. We aim to automatically identify and suppress easily-computable spurious features in lower layers of the network, thereby allowing the higher network levels to extract and utilize richer, more meaningful representations. We provide concrete evidence of this differential suppression & enhancement of relevant features on both controlled datasets and real-world images, and report substantial gains on many real-world debiasing benchmarks (11.4% relative gain on Imagenet-A; 3.2% on BAR, etc). Crucially, we outperform many baselines that incorporate knowledge about known spurious or biased attributes, despite our method not using any such information. We believe that our feature sieve work opens up exciting new research directions in automated adversarial feature extraction & representation learning for deep networks.
We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE render it practical to be used with any off-the-shelf trained model to improve its predictions further.
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.
In this paper, we proposed to apply meta learning approach for low-resource automatic speech recognition (ASR). We formulated ASR for different languages as different tasks, and meta-learned the initialization parameters from many pretraining languages to achieve fast adaptation on unseen target language, via recently proposed model-agnostic meta learning algorithm (MAML). We evaluated the proposed approach using six languages as pretraining tasks and four languages as target tasks. Preliminary results showed that the proposed method, MetaASR, significantly outperforms the state-of-the-art multitask pretraining approach on all target languages with different combinations of pretraining languages. In addition, since MAML's model-agnostic property, this paper also opens new research direction of applying meta learning to more speech-related applications.
High spectral dimensionality and the shortage of annotations make hyperspectral image (HSI) classification a challenging problem. Recent studies suggest that convolutional neural networks can learn discriminative spatial features, which play a paramount role in HSI interpretation. However, most of these methods ignore the distinctive spectral-spatial characteristic of hyperspectral data. In addition, a large amount of unlabeled data remains an unexploited gold mine for efficient data use. Therefore, we proposed an integration of generative adversarial networks (GANs) and probabilistic graphical models for HSI classification. Specifically, we used a spectral-spatial generator and a discriminator to identify land cover categories of hyperspectral cubes. Moreover, to take advantage of a large amount of unlabeled data, we adopted a conditional random field to refine the preliminary classification results generated by GANs. Experimental results obtained using two commonly studied datasets demonstrate that the proposed framework achieved encouraging classification accuracy using a small number of data for training.
Sufficient training data is normally required to train deeply learned models. However, the number of pedestrian images per ID in person re-identification (re-ID) datasets is usually limited, since manually annotations are required for multiple camera views. To produce more data for training deeply learned models, generative adversarial network (GAN) can be leveraged to generate samples for person re-ID. However, the samples generated by vanilla GAN usually do not have labels. So in this paper, we propose a virtual label called Multi-pseudo Regularized Label (MpRL) and assign it to the generated images. With MpRL, the generated samples will be used as supplementary of real training data to train a deep model in a semi-supervised learning fashion. Considering data bias between generated and real samples, MpRL utilizes different contributions from predefined training classes. The contribution-based virtual labels are automatically assigned to generated samples to reduce ambiguous prediction in training. Meanwhile, MpRL only relies on predefined training classes without using extra classes. Furthermore, to reduce over-fitting, a regularized manner is applied to MpRL to regularize the learning process. To verify the effectiveness of MpRL, two state-of-the-art convolutional neural networks (CNNs) are adopted in our experiments. Experiments demonstrate that by assigning MpRL to generated samples, we can further improve the person re-ID performance on three datasets i.e., Market-1501, DukeMTMCreID, and CUHK03. The proposed method obtains +6.29%, +6.30% and +5.58% improvements in rank-1 accuracy over a strong CNN baseline respectively, and outperforms the state-of-the- art methods.
Training a deep architecture using a ranking loss has become standard for the person re-identification task. Increasingly, these deep architectures include additional components that leverage part detections, attribute predictions, pose estimators and other auxiliary information, in order to more effectively localize and align discriminative image regions. In this paper we adopt a different approach and carefully design each component of a simple deep architecture and, critically, the strategy for training it effectively for person re-identification. We extensively evaluate each design choice, leading to a list of good practices for person re-identification. By following these practices, our approach outperforms the state of the art, including more complex methods with auxiliary components, by large margins on four benchmark datasets. We also provide a qualitative analysis of our trained representation which indicates that, while compact, it is able to capture information from localized and discriminative regions, in a manner akin to an implicit attention mechanism.