Inertial sensor-based human activity recognition (HAR) is the base of many human-centered mobile applications. Deep learning-based fine-grained HAR models enable accurate classification in various complex application scenarios. Nevertheless, the large storage and computational overhead of the existing fine-grained deep HAR models hinder their widespread deployment on resource-limited platforms. Inspired by the knowledge distillation's reasonable model compression and potential performance improvement capability, we design a multi-level HAR modeling pipeline called Stage-Logits-Memory Distillation (SMLDist) based on the widely-used MobileNet. By paying more attention to the frequency-related features during the distillation process, the SMLDist improves the HAR classification robustness of the students. We also propose an auto-search mechanism in the heterogeneous classifiers to improve classification performance. Extensive simulation results demonstrate that SMLDist outperforms various state-of-the-art HAR frameworks in accuracy and F1 macro score. The practical evaluation of the Jetson Xavier AGX platform shows that the SMLDist model is both energy-efficient and computation-efficient. These experiments validate the reasonable balance between the robustness and efficiency of the proposed model. The comparative experiments of knowledge distillation on six public datasets also demonstrate that the SMLDist outperforms other advanced knowledge distillation methods of students' performance, which verifies the good generalization of the SMLDist on other classification tasks, including but not limited to HAR.
Fine-Grained Visual Recognition (FGVR) tackles the problem of distinguishing highly similar categories. One of the main approaches to FGVR, namely subset learning, tries to leverage information from existing class taxonomies to improve the performance of deep neural networks. However, these methods rely on the existence of handcrafted hierarchies that are not necessarily optimal for the models. In this paper, we propose ELFIS, an expert learning framework for FGVR that clusters categories of the dataset into meta-categories using both dataset-inherent lexical and model-specific information. A set of neural networks-based experts are trained focusing on the meta-categories and are integrated into a multi-task framework. Extensive experimentation shows improvements in the SoTA FGVR benchmarks of up to +1.3% of accuracy using both CNNs and transformer-based networks. Overall, the obtained results evidence that ELFIS can be applied on top of any classification model, enabling the obtention of SoTA results. The source code will be made public soon.
Facial expression in-the-wild is essential for various interactive computing domains. Especially, "Emotional Reaction Intensity" (ERI) is an important topic in the facial expression recognition task. In this paper, we propose a multi-emotional task learning-based approach and present preliminary results for the ERI challenge introduced in the 5th affective behavior analysis in-the-wild (ABAW) competition. Our method achieved the mean PCC score of 0.3254.
To enhance the efficiency of incident response triage operations, it is not cost-effective to defend all systems equally in a complex cyber environment. Instead, prioritizing the defense of critical functionality and the most vulnerable systems is desirable. Threat intelligence is crucial for guiding SOC analysts' focus toward specific system activity and provides the primary contextual foundation for interpreting security alerts. This paper explores novel approaches for improving incident response triage operations, including ransomware attacks and zero-day malware. This solution for rapid prioritization of different ransomware has been raised to formulate fast response plans to minimize socioeconomic damage from the massive growth of ransomware attacks in recent years; it can also be extended to other incident responses. To address this concern, we propose a ransomware triage approach that can rapidly classify and prioritize different ransomware classes. We utilize a pre-trained ResNet18 network based on Siamese Neural Network (SNN) to reduce the biases in weight and parameters. In addition, our approach uses the entropy features directly obtained from the binary ransomware files to improve feature representation, resilient to obfuscation noise, and computationally less expensive, which evaluation also shows that this classification part of our proposed approach achieves the accuracy exceeding ....and outperforms other similar classification performance. This new triage strategy based on Task memory with meta-learning evaluates the level of similarity matching across ransomware classes to identify any risky and unknown ransomware (e.g., zero-day attacks) so that a defense of those that support critical functionality can be conducted.
Low-resolution face recognition (LRFR) has become a challenging problem for modern deep face recognition systems. Existing methods mainly leverage prior information from high-resolution (HR) images by either reconstructing facial details with super-resolution techniques or learning a unified feature space. To address this issue, this paper proposes a novel approach which enforces the network to focus on the discriminative information stored in the low-frequency components of a low-resolution (LR) image. A cross-resolution knowledge distillation paradigm is first employed as the learning framework. An identity-preserving network, WaveResNet, and a wavelet similarity loss are then designed to capture low-frequency details and boost performance. Finally, an image degradation model is conceived to simulate more realistic LR training data. Consequently, extensive experimental results show that the proposed method consistently outperforms the baseline model and other state-of-the-art methods across a variety of image resolutions.
Image-based table recognition is a challenging task due to the diversity of table styles and the complexity of table structures. Most of the previous methods focus on a non-end-to-end approach which divides the problem into two separate sub-problems: table structure recognition; and cell-content recognition and then attempts to solve each sub-problem independently using two separate systems. In this paper, we propose an end-to-end multi-task learning model for image-based table recognition. The proposed model consists of one shared encoder, one shared decoder, and three separate decoders which are used for learning three sub-tasks of table recognition: table structure recognition, cell detection, and cell-content recognition. The whole system can be easily trained and inferred in an end-to-end approach. In the experiments, we evaluate the performance of the proposed model on two large-scale datasets: FinTabNet and PubTabNet. The experiment results show that the proposed model outperforms the state-of-the-art methods in all benchmark datasets.
Time series generation (TSG) studies have mainly focused on the use of Generative Adversarial Networks (GANs) combined with recurrent neural network (RNN) variants. However, the fundamental limitations and challenges of training GANs still remain. In addition, the RNN-family typically has difficulties with temporal consistency between distant timesteps. Motivated by the successes in the image generation (IMG) domain, we propose TimeVQVAE, the first work, to our knowledge, that uses vector quantization (VQ) techniques to address the TSG problem. Moreover, the priors of the discrete latent spaces are learned with bidirectional transformer models that can better capture global temporal consistency. We also propose VQ modeling in a time-frequency domain, separated into low-frequency (LF) and high-frequency (HF). This allows us to retain important characteristics of the time series and, in turn, generate new synthetic signals that are of better quality, with sharper changes in modularity, than its competing TSG methods. Our experimental evaluation is conducted on all datasets from the UCR archive, using well-established metrics in the IMG literature, such as Fr\'echet inception distance and inception scores. Our implementation on GitHub: \url{//github.com/ML4ITS/TimeVQVAE}.
Emotion recognition in conversation (ERC) aims to detect the emotion label for each utterance. Motivated by recent studies which have proven that feeding training examples in a meaningful order rather than considering them randomly can boost the performance of models, we propose an ERC-oriented hybrid curriculum learning framework. Our framework consists of two curricula: (1) conversation-level curriculum (CC); and (2) utterance-level curriculum (UC). In CC, we construct a difficulty measurer based on "emotion shift" frequency within a conversation, then the conversations are scheduled in an "easy to hard" schema according to the difficulty score returned by the difficulty measurer. For UC, it is implemented from an emotion-similarity perspective, which progressively strengthens the model's ability in identifying the confusing emotions. With the proposed model-agnostic hybrid curriculum learning strategy, we observe significant performance boosts over a wide range of existing ERC models and we are able to achieve new state-of-the-art results on four public ERC datasets.
Unsupervised domain adaptation has recently emerged as an effective paradigm for generalizing deep neural networks to new target domains. However, there is still enormous potential to be tapped to reach the fully supervised performance. In this paper, we present a novel active learning strategy to assist knowledge transfer in the target domain, dubbed active domain adaptation. We start from an observation that energy-based models exhibit free energy biases when training (source) and test (target) data come from different distributions. Inspired by this inherent mechanism, we empirically reveal that a simple yet efficient energy-based sampling strategy sheds light on selecting the most valuable target samples than existing approaches requiring particular architectures or computation of the distances. Our algorithm, Energy-based Active Domain Adaptation (EADA), queries groups of targe data that incorporate both domain characteristic and instance uncertainty into every selection round. Meanwhile, by aligning the free energy of target data compact around the source domain via a regularization term, domain gap can be implicitly diminished. Through extensive experiments, we show that EADA surpasses state-of-the-art methods on well-known challenging benchmarks with substantial improvements, making it a useful option in the open world. Code is available at //github.com/BIT-DA/EADA.
Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for MTL from the perspective of algorithmic modeling, applications and theoretical analyses. For algorithmic modeling, we give a definition of MTL and then classify different MTL algorithms into five categories, including feature learning approach, low-rank approach, task clustering approach, task relation learning approach and decomposition approach as well as discussing the characteristics of each approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models. When the number of tasks is large or the data dimensionality is high, we review online, parallel and distributed MTL models as well as dimensionality reduction and feature hashing to reveal their computational and storage advantages. Many real-world applications use MTL to boost their performance and we review representative works in this paper. Finally, we present theoretical analyses and discuss several future directions for MTL.
Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same input size, the proposed method achieves 1.41% and 0.66% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.