Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data. Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process. However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy. Therefore, a practical data-free KD method should be robust and ideally provide monotonically increasing student accuracy during distillation. This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data. A straightforward approach to overcome this issue is to store and rehearse the generated samples periodically, which increases the memory footprint and creates privacy concerns. We propose to model the distribution of the previously observed synthetic samples with a generative network. In particular, we design a Variational Autoencoder (VAE) with a training objective that is customized to learn the synthetic data representations optimally. The student is rehearsed by the generative pseudo replay technique, with samples produced by the VAE. Hence knowledge degradation can be prevented without storing any samples. Experiments on image classification benchmarks show that our method optimizes the expected value of the distilled model accuracy while eliminating the large memory overhead incurred by the sample-storing methods.
Adversarial regularization can improve model generalization in many natural language processing tasks. However, conventional approaches are computationally expensive since they need to generate a perturbation for each sample in each epoch. We propose a new adversarial regularization method ARCH (adversarial regularization with caching), where perturbations are generated and cached once every several epochs. As caching all the perturbations imposes memory usage concerns, we adopt a K-nearest neighbors-based strategy to tackle this issue. The strategy only requires caching a small amount of perturbations, without introducing additional training time. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks. We observe that ARCH significantly eases the computational burden (saves up to 70% of computational time in comparison with conventional approaches). More surprisingly, by reducing the variance of stochastic gradients, ARCH produces a notably better (in most of the tasks) or comparable model generalization. Our code is available at //github.com/SimiaoZuo/Caching-Adv.
Point clouds scanned by real-world sensors are always incomplete, irregular, and noisy, making the point cloud completion task become increasingly more important. Though many point cloud completion methods have been proposed, most of them require a large number of paired complete-incomplete point clouds for training, which is labor exhausted. In contrast, this paper proposes a novel Reconstruction-Aware Prior Distillation semi-supervised point cloud completion method named RaPD, which takes advantage of a two-stage training scheme to reduce the dependence on a large-scale paired dataset. In training stage 1, the so-called deep semantic prior is learned from both unpaired complete and unpaired incomplete point clouds using a reconstruction-aware pretraining process. While in training stage 2, we introduce a semi-supervised prior distillation process, where an encoder-decoder-based completion network is trained by distilling the prior into the network utilizing only a small number of paired training samples. A self-supervised completion module is further introduced, excavating the value of a large number of unpaired incomplete point clouds, leading to an increase in the network's performance. Extensive experiments on several widely used datasets demonstrate that RaPD, the first semi-supervised point cloud completion method, achieves superior performance to previous methods on both homologous and heterologous scenarios.
Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer.
Generative Adversarial Networks (GANs) have achieved remarkable achievements in image synthesis. These successes of GANs rely on large scale datasets, requiring too much cost. With limited training data, how to stable the training process of GANs and generate realistic images have attracted more attention. The challenges of Data-Efficient GANs (DE-GANs) mainly arise from three aspects: (i) Mismatch Between Training and Target Distributions, (ii) Overfitting of the Discriminator, and (iii) Imbalance Between Latent and Data Spaces. Although many augmentation and pre-training strategies have been proposed to alleviate these issues, there lacks a systematic survey to summarize the properties, challenges, and solutions of DE-GANs. In this paper, we revisit and define DE-GANs from the perspective of distribution optimization. We conclude and analyze the challenges of DE-GANs. Meanwhile, we propose a taxonomy, which classifies the existing methods into three categories: Data Selection, GANs Optimization, and Knowledge Sharing. Last but not the least, we attempt to highlight the current problems and the future directions.
Controlled text generation tasks such as unsupervised text style transfer have increasingly adopted the use of Reinforcement Learning (RL). A major challenge in applying RL to such tasks is the sparse reward, which is available only after the full text is generated. Sparse rewards, combined with a large action space make RL training sample-inefficient and difficult to converge. Recently proposed reward-shaping strategies to address this issue have shown only negligible gains. In contrast, this work proposes a novel approach that provides dense rewards to each generated token. We evaluate our approach by its usage in unsupervised text style transfer. Averaged across datasets, our style transfer system improves upon current state-of-art systems by 21\% on human evaluation and 12\% on automatic evaluation. Upon ablated comparison with the current reward shaping approach (the `roll-out strategy'), using dense rewards improves the overall style transfer quality by 22\% based on human evaluation. Further the RL training is 2.5 times as sample efficient, and 7 times faster.
Radio access network (RAN) slicing is an important pillar in cross-domain network slicing which covers RAN, edge, transport and core slicing. The evolving network architecture requires the orchestration of multiple network resources such as radio and cache resources. In recent years, machine learning (ML) techniques have been widely applied for network management. However, most existing works do not take advantage of the knowledge transfer capability in ML. In this paper, we propose a deep transfer reinforcement learning (DTRL) scheme for joint radio and cache resource allocation to serve 5G RAN slicing. We first define a hierarchical architecture for the joint resource allocation. Then we propose two DTRL algorithms: Q-value-based deep transfer reinforcement learning (QDTRL) and action selection-based deep transfer reinforcement learning (ADTRL). In the proposed schemes, learner agents utilize expert agents' knowledge to improve their performance on target tasks. The proposed algorithms are compared with both the model-free exploration bonus deep Q-learning (EB-DQN) and the model-based priority proportional fairness and time-to-live (PPF-TTL) algorithms. Compared with EB-DQN, our proposed DTRL based method presents 21.4% lower delay for Ultra Reliable Low Latency Communications (URLLC) slice and 22.4% higher throughput for enhanced Mobile Broad Band (eMBB) slice, while achieving significantly faster convergence than EB-DQN. Moreover, 40.8% lower URLLC delay and 59.8% higher eMBB throughput are observed with respect to PPF-TTL.
Recognizing human locomotion intent and activities is important for controlling the wearable robots while walking in complex environments. However, human-robot interface signals are usually user-dependent, which causes that the classifier trained on source subjects performs poorly on new subjects. To address this issue, this paper designs the ensemble diverse hypotheses and knowledge distillation (EDHKD) method to realize unsupervised cross-subject adaptation. EDH mitigates the divergence between labeled data of source subjects and unlabeled data of target subjects to accurately classify the locomotion modes of target subjects without labeling data. Compared to previous domain adaptation methods based on the single learner, which may only learn a subset of features from input signals, EDH can learn diverse features by incorporating multiple diverse feature generators and thus increases the accuracy and decreases the variance of classifying target data, but it sacrifices the efficiency. To solve this problem, EDHKD (student) distills the knowledge from the EDH (teacher) to a single network to remain efficient and accurate. The performance of the EDHKD is theoretically proved and experimentally validated on a 2D moon dataset and two public human locomotion datasets. Experimental results show that the EDHKD outperforms all other methods. The EDHKD can classify target data with 96.9%, 94.4%, and 97.4% average accuracy on the above three datasets with a short computing time (1 ms). Compared to a benchmark (BM) method, the EDHKD increases 1.3% and 7.1% average accuracy for classifying the locomotion modes of target subjects. The EDHKD also stabilizes the learning curves. Therefore, the EDHKD is significant for increasing the generalization ability and efficiency of the human intent prediction and human activity recognition system, which will improve human-robot interactions.
Recent advances in maximizing mutual information (MI) between the source and target have demonstrated its effectiveness in text generation. However, previous works paid little attention to modeling the backward network of MI (i.e., dependency from the target to the source), which is crucial to the tightness of the variational information maximization lower bound. In this paper, we propose Adversarial Mutual Information (AMI): a text generation framework which is formed as a novel saddle point (min-max) optimization aiming to identify joint interactions between the source and target. Within this framework, the forward and backward networks are able to iteratively promote or demote each other's generated instances by comparing the real and synthetic data distributions. We also develop a latent noise sampling strategy that leverages random variations at the high-level semantic space to enhance the long term dependency in the generation process. Extensive experiments based on different text generation tasks demonstrate that the proposed AMI framework can significantly outperform several strong baselines, and we also show that AMI has potential to lead to a tighter lower bound of maximum mutual information for the variational information maximization problem.
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
We introduce an effective model to overcome the problem of mode collapse when training Generative Adversarial Networks (GAN). Firstly, we propose a new generator objective that finds it better to tackle mode collapse. And, we apply an independent Autoencoders (AE) to constrain the generator and consider its reconstructed samples as "real" samples to slow down the convergence of discriminator that enables to reduce the gradient vanishing problem and stabilize the model. Secondly, from mappings between latent and data spaces provided by AE, we further regularize AE by the relative distance between the latent and data samples to explicitly prevent the generator falling into mode collapse setting. This idea comes when we find a new way to visualize the mode collapse on MNIST dataset. To the best of our knowledge, our method is the first to propose and apply successfully the relative distance of latent and data samples for stabilizing GAN. Thirdly, our proposed model, namely Generative Adversarial Autoencoder Networks (GAAN), is stable and has suffered from neither gradient vanishing nor mode collapse issues, as empirically demonstrated on synthetic, MNIST, MNIST-1K, CelebA and CIFAR-10 datasets. Experimental results show that our method can approximate well multi-modal distribution and achieve better results than state-of-the-art methods on these benchmark datasets. Our model implementation is published here: //github.com/tntrung/gaan