Gait, the manner of walking, has been proven to be a reliable biometric with uses in surveillance, marketing and security. A promising new direction for the field is training gait recognition systems without explicit human annotations, through self-supervised learning approaches. Such methods are heavily reliant on strong augmentations for the same walking sequence to induce more data variability and to simulate additional walking variations. Current data augmentation schemes are heuristic and cannot provide the necessary data variation as they are only able to provide simple temporal and spatial distortions. In this work, we propose GaitMorph, a novel method to modify the walking variation for an input gait sequence. Our method entails the training of a high-compression model for gait skeleton sequences that leverages unlabelled data to construct a discrete and interpretable latent space, which preserves identity-related features. Furthermore, we propose a method based on optimal transport theory to learn latent transport maps on the discrete codebook that morph gait sequences between variations. We perform extensive experiments and show that our method is suitable to synthesize additional views for an input sequence.
Deep Neural Networks (DNNs) have drawn attention because of their outstanding performance on various tasks. However, deploying full-fledged DNNs in resource-constrained devices (edge, mobile, IoT) is difficult due to their large size. To overcome the issue, various approaches are considered, like offloading part of the computation to the cloud for final inference (split computing) or performing the inference at an intermediary layer without passing through all layers (early exits). In this work, we propose combining both approaches by using early exits in split computing. In our approach, we decide up to what depth of DNNs computation to perform on the device (splitting layer) and whether a sample can exit from this layer or need to be offloaded. The decisions are based on a weighted combination of accuracy, computational, and communication costs. We develop an algorithm named SplitEE to learn an optimal policy. Since pre-trained DNNs are often deployed in new domains where the ground truths may be unavailable and samples arrive in a streaming fashion, SplitEE works in an online and unsupervised setup. We extensively perform experiments on five different datasets. SplitEE achieves a significant cost reduction ($>50\%$) with a slight drop in accuracy ($<2\%$) as compared to the case when all samples are inferred at the final layer. The anonymized source code is available at \url{//anonymous.4open.science/r/SplitEE_M-B989/README.md}.
Recent years have witnessed many successful trials in the robot learning field. For contact-rich robotic tasks, it is challenging to learn coordinated motor skills by reinforcement learning. Imitation learning solves this problem by using a mimic reward to encourage the robot to track a given reference trajectory. However, imitation learning is not so efficient and may constrain the learned motion. In this paper, we propose instruction learning, which is inspired by the human learning process and is highly efficient, flexible, and versatile for robot motion learning. Instead of using a reference signal in the reward, instruction learning applies a reference signal directly as a feedforward action, and it is combined with a feedback action learned by reinforcement learning to control the robot. Besides, we propose the action bounding technique and remove the mimic reward, which is shown to be crucial for efficient and flexible learning. We compare the performance of instruction learning with imitation learning, indicating that instruction learning can greatly speed up the training process and guarantee learning the desired motion correctly. The effectiveness of instruction learning is validated through a bunch of motion learning examples for a biped robot and a quadruped robot, where skills can be learned typically within several million steps. Besides, we also conduct sim-to-real transfer and online learning experiments on a real quadruped robot. Instruction learning has shown great merits and potential, making it a promising alternative for imitation learning.
With the rapid development of AI technology, we have witnessed numerous innovations and conveniences. However, along with these advancements come privacy threats and risks. Fully Homomorphic Encryption (FHE) emerges as a key technology for privacy-preserving computation, enabling computations while maintaining data privacy. Nevertheless, FHE has limitations in processing continuous non-polynomial functions as it is restricted to discrete integers and supports only addition and multiplication. Spiking Neural Networks (SNNs) operate on discrete spike signals, naturally aligning with the properties of FHE. In this paper, we present a framework called FHE-DiCSNN. This framework is based on the efficient TFHE scheme and leverages the discrete properties of SNNs to achieve high prediction performance on ciphertexts. Firstly, by employing bootstrapping techniques, we successfully implement computations of the Leaky Integrate-and-Fire neuron model on ciphertexts. Through bootstrapping, we can facilitate computations for SNNs of arbitrary depth. This framework can be extended to other spiking neuron models, providing a novel framework for the homomorphic evaluation of SNNs. Secondly, inspired by CNNs, we adopt convolutional methods to replace Poisson encoding. This not only enhances accuracy but also mitigates the issue of prolonged simulation time caused by random encoding. Furthermore, we employ engineering techniques to parallelize the computation of bootstrapping, resulting in a significant improvement in computational efficiency. Finally, we evaluate our model on the MNIST dataset. Experimental results demonstrate that, with the optimal parameter configuration, FHE-DiCSNN achieves an accuracy of 97.94% on ciphertexts, with a loss of only 0.53% compared to the original network's accuracy of 98.47%. Moreover, each prediction requires only 0.75 seconds of computation time
Modular Reconfigurable Robots (MRRs) represent an exciting path forward for industrial robotics, opening up new possibilities for robot design. Compared to monolithic manipulators, they promise greater flexibility, improved maintainability, and cost-efficiency. However, there is no tool or standardized way to model and simulate assemblies of modules in the same way it has been done for robotic manipulators for decades. We introduce the Toolbox for Industrial Modular Robotics (Timor), a Python toolbox to bridge this gap and integrate modular robotics into existing simulation and optimization pipelines. Our open-source library offers model generation and task-based configuration optimization for MRRs. It can easily be integrated with existing simulation tools - not least by offering URDF export of arbitrary modular robot assemblies. Moreover, our experimental study demonstrates the effectiveness of Timor as a tool for designing modular robots optimized for specific use cases.
Artificial intelligence has been applied in various aspects of online education to facilitate teaching and learning. However, few approaches has been made toward a complete AI-powered tutoring system. In this work, we explore the development of a full-fledged intelligent tutoring system powered by state-of-the-art large language models (LLMs), covering automatic course planning and adjusting, tailored instruction, and flexible quiz evaluation. To make the system robust to prolonged interaction and cater to individualized education, the system is decomposed into three inter-connected core processes-interaction, reflection, and reaction. Each process is implemented by chaining LLM-powered tools along with dynamically updated memory modules. Tools are LLMs prompted to execute one specific task at a time, while memories are data storage that gets updated during education process. Statistical results from learning logs demonstrate the effectiveness and mechanism of each tool usage. Subjective feedback from human users reveal the usability of each function, and comparison with ablation systems further testify the benefits of the designed processes in long-term interaction.
Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Recent advancements in Deep Learning (DL) have substantially enhanced the performance of SER models through increased model complexity. However, designing optimal DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS) offers a promising avenue to automatically determine an optimal DL model. In particular, Differentiable Architecture Search (DARTS) is an efficient method of using NAS to search for optimised models. This paper proposes emoDARTS, a DARTS-optimised joint CNN and LSTM architecture, to improve SER performance, where the literature informs the selection of CNN and LSTM coupling to offer improved performance. While DARTS has previously been applied to CNN and LSTM combinations, our approach introduces a novel mechanism, particularly in selecting CNN operations using DARTS. In contrast to previous studies, we refrain from imposing constraints on the layer order for the CNN within the DARTS cell; instead, we allow DARTS to determine the optimal layer order autonomously. Experimenting with the IEMOCAP and MSP-IMPROV datasets, we demonstrate that emoDARTS achieves significantly higher SER accuracy than hand-engineering the CNN-LSTM configuration. It also outperforms the best-reported SER results achieved using DARTS on CNN-LSTM.
Accurate dietary intake estimation is critical for informing policies and programs to support healthy eating, as malnutrition has been directly linked to decreased quality of life. However self-reporting methods such as food diaries suffer from substantial bias. Other conventional dietary assessment techniques and emerging alternative approaches such as mobile applications incur high time costs and may necessitate trained personnel. Recent work has focused on using computer vision and machine learning to automatically estimate dietary intake from food images, but the lack of comprehensive datasets with diverse viewpoints, modalities and food annotations hinders the accuracy and realism of such methods. To address this limitation, we introduce NutritionVerse-Synth, the first large-scale dataset of 84,984 photorealistic synthetic 2D food images with associated dietary information and multimodal annotations (including depth images, instance masks, and semantic masks). Additionally, we collect a real image dataset, NutritionVerse-Real, containing 889 images of 251 dishes to evaluate realism. Leveraging these novel datasets, we develop and benchmark NutritionVerse, an empirical study of various dietary intake estimation approaches, including indirect segmentation-based and direct prediction networks. We further fine-tune models pretrained on synthetic data with real images to provide insights into the fusion of synthetic and real data. Finally, we release both datasets (NutritionVerse-Synth, NutritionVerse-Real) on //www.kaggle.com/nutritionverse/datasets as part of an open initiative to accelerate machine learning for dietary sensing.
Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and big data, Transformer-based multimodal learning has become a hot topic in AI research. This paper presents a comprehensive survey of Transformer techniques oriented at multimodal data. The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two important paradigms, i.e., for multimodal pretraining and for specific multimodal tasks, (4) a summary of the common challenges and designs shared by the multimodal Transformer models and applications, and (5) a discussion of open problems and potential research directions for the community.
Graph neural networks (GNNs) have been a hot spot of recent research and are widely utilized in diverse applications. However, with the use of huger data and deeper models, an urgent demand is unsurprisingly made to accelerate GNNs for more efficient execution. In this paper, we provide a comprehensive survey on acceleration methods for GNNs from an algorithmic perspective. We first present a new taxonomy to classify existing acceleration methods into five categories. Based on the classification, we systematically discuss these methods and highlight their correlations. Next, we provide comparisons from aspects of the efficiency and characteristics of these methods. Finally, we suggest some promising prospects for future research.
It has been shown that deep neural networks are prone to overfitting on biased training data. Towards addressing this issue, meta-learning employs a meta model for correcting the training bias. Despite the promising performances, super slow training is currently the bottleneck in the meta learning approaches. In this paper, we introduce a novel Faster Meta Update Strategy (FaMUS) to replace the most expensive step in the meta gradient computation with a faster layer-wise approximation. We empirically find that FaMUS yields not only a reasonably accurate but also a low-variance approximation of the meta gradient. We conduct extensive experiments to verify the proposed method on two tasks. We show our method is able to save two-thirds of the training time while still maintaining the comparable or achieving even better generalization performance. In particular, our method achieves the state-of-the-art performance on both synthetic and realistic noisy labels, and obtains promising performance on long-tailed recognition on standard benchmarks.