Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
We present a novel deep learning-based framework: Embedded Feature Similarity Optimization with Specific Parameter Initialization (SOPI) for 2D/3D registration which is a most challenging problem due to the difficulty such as dimensional mismatch, heavy computation load and lack of golden evaluating standard. The framework we designed includes a parameter specification module to efficiently choose initialization pose parameter and a fine-registration network to align images. The proposed framework takes extracting multi-scale features into consideration using a novel composite connection encoder with special training techniques. The method is compared with both learning-based methods and optimization-based methods to further evaluate the performance. Our experiments demonstrate that the method in this paper has improved the registration performance, and thereby outperforms the existing methods in terms of accuracy and running time. We also show the potential of the proposed method as an initial pose estimator.
Perception is a process that requires a great deal of mental processing, which provides the means by which one's concept of the environment is created and which helps one learn and interact with it. The compilation of previous studies throughout history has led to the conclusion that auditory performance improves when combined with visual stimuli and vice versa. Taking into account the previous consideration, in the present work the two sensory pathways (vision and hearing) were used with the intention of carrying out a series of multisensory training, which were presented in different instances and with the purpose of introducing sound as a signal detection tool. A web development was also included to create a site that would allow the execution of the designed training, which is still in development due to difficulties that arose and exceed the limits of this final work. The work described in this report gave rise to a future doctoral thesis, which has a CONICET scholarship, where the development of new training and the continuous development of the website that will allow its execution are proposed.
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{//magicvideo.github.io/#} for more examples.
Federated recommendation systems employ federated learning techniques to safeguard user privacy by transmitting model parameters instead of raw user data between user devices and the central server. Nevertheless, the current federated recommender system faces challenges such as heterogeneity and personalization, model performance degradation, and communication bottleneck. Previous studies have attempted to address these issues, but none have been able to solve them simultaneously. In this paper, we propose a novel framework, named PerFedRec++, to enhance the personalized federated recommendation with self-supervised pre-training. Specifically, we utilize the privacy-preserving mechanism of federated recommender systems to generate two augmented graph views, which are used as contrastive tasks in self-supervised graph learning to pre-train the model. Pre-training enhances the performance of federated models by improving the uniformity of representation learning. Also, by providing a better initial state for federated training, pre-training makes the overall training converge faster, thus alleviating the heavy communication burden. We then construct a collaborative graph to learn the client representation through a federated graph neural network. Based on these learned representations, we cluster users into different user groups and learn personalized models for each cluster. Each user learns a personalized model by combining the global federated model, the cluster-level federated model, and its own fine-tuned local model. Experiments on three real-world datasets show that our proposed method achieves superior performance over existing methods.
With the advent of the Exascale capability allowing supercomputers to perform at least $10^{18}$ IEEE 754 Double Precision (64 bits) operations per second, many concerns have been raised regarding the energy consumption of high-performance computing code. Recently, Frontier operated by the Oak Ridge National Laboratory, has become the first supercomputer to break the exascale barrier. In total, contains 9,408 CPUs, 37,632 GPUs, and 8,730,112 cores. This world-leading supercomputer consumes about 21 megawatts which is truly remarkable as was also ranked first on the Green500 list before being recently replaced. The previous top Green500 machine, MN-3 in Japan, provided 39.38 gigaflops per watt, while the delivered 62.68 gigaflops per watt. All these infrastructure and hardware improvements are just the tip of the Iceberg. Energy-aware code is now required to minimize the energy consumption of distributed and/or multi-threaded software. For example, the data movement bottleneck is responsible for $35-60\%$ of a system's energy consumption during intra-node communication. In an HPC environment, additional energy is consumed through inter-node communication. This position paper aims to introduce future research directions to enter now in the age of energy-aware software. The paper is organized as follows. First, we introduce related works regarding measurement and energy optimization. Then we propose to focus on the two different levels of granularity in energy optimization.
We present a novel deep learning-based framework: Embedded Feature Correlation Optimization with Specific Parameter Initialization (COSPI) for 2D/3D registration which is a most challenging problem due to the difficulty such as dimensional mismatch, heavy computation load and lack of golden evaluating standard. The framework we designed includes a parameter specification module to efficiently choose initialization pose parameter and a fine-registration network to align images. The proposed framework takes extracting multi-scale features into consideration using a novel composite connection encoder with special training techniques. The method is compared with both learning-based methods and optimization-based methods to further evaluate the performance. Our experiments demonstrate that the method in this paper has improved the registration performance, and thereby outperforms the existing methods in terms of accuracy and running time. We also show the potential of the proposed method as an initial pose estimator.
深度學習領域取得了(le)指數(shu)級的(de)(de)(de)(de)發展(zhan),像(xiang)BERT、GPT-3、ResNet等ML模(mo)型(xing)(xing)的(de)(de)(de)(de)足跡(ji)也在(zai)(zai)不斷擴大(da)(da)。雖然它們(men)(men)工作得很好,但(dan)在(zai)(zai)生產中訓練(lian)和部(bu)署這些(xie)大(da)(da)型(xing)(xing)(且不斷增長(chang)的(de)(de)(de)(de))模(mo)型(xing)(xing)是(shi)昂貴(gui)的(de)(de)(de)(de)。你(ni)可能(neng)想(xiang)在(zai)(zai)智能(neng)手(shou)機上部(bu)署你(ni)的(de)(de)(de)(de)面部(bu)濾鏡(jing)模(mo)型(xing)(xing),讓你(ni)的(de)(de)(de)(de)用戶在(zai)(zai)他們(men)(men)的(de)(de)(de)(de)自拍上添加(jia)一個小狗濾鏡(jing)。但(dan)它可能(neng)太大(da)(da)或(huo)(huo)太慢,或(huo)(huo)者(zhe)您(nin)可能(neng)想(xiang)提高(gao)基(ji)于云的(de)(de)(de)(de)垃圾郵件檢(jian)測模(mo)型(xing)(xing)的(de)(de)(de)(de)質量(liang),但(dan)又不想(xiang)花錢購買更(geng)大(da)(da)的(de)(de)(de)(de)云VM來(lai)承載更(geng)精(jing)確但(dan)更(geng)大(da)(da)的(de)(de)(de)(de)模(mo)型(xing)(xing)。如果您(nin)的(de)(de)(de)(de)模(mo)型(xing)(xing)沒有足夠的(de)(de)(de)(de)標(biao)記數(shu)據(ju),或(huo)(huo)者(zhe)不能(neng)手(shou)動調(diao)優您(nin)的(de)(de)(de)(de)模(mo)型(xing)(xing),該怎么辦? 所(suo)有這些(xie)都是(shi)令人生畏的(de)(de)(de)(de)!
如果您可以使您的模型更高效: 使用更少的資源(模型大小、延遲、訓練時間、數據、人工參與),并提供更好的質量(準確性、精確度、召回等),會怎么樣呢?這聽起(qi)來太(tai)棒了! 但如何?
這本書將通過在谷歌研究,Facebook人工智能研究(FAIR),和其他著名的人工智能實驗室使用算法和技術的研究人員和工程師訓練和部署他們的模型,設備從大型服務器端機器到微型微控制器。在(zai)這(zhe)本書(shu)中,我們提(ti)出了一個(ge)基(ji)本的(de)平(ping)衡,以及實踐知(zhi)識,以充分賦能(neng)你(ni)(ni)繼續前(qian)進,并優化你(ni)(ni)的(de)模型訓練和(he)部(bu)署(shu)工作流,這(zhe)樣你(ni)(ni)的(de)模型表現和(he)以前(qian)一樣好或更(geng)好,與(yu)一小部(bu)分資源。我們還將深入介紹(shao)流行的(de)模型、基(ji)礎設(she)施和(he)硬件,以及具有(you)挑戰(zhan)性的(de)項目,以測試您的(de)技能(neng)。
目錄內容:
Part I: 高效深度學習導論 Introduction to Efficient Deep Learning 導論 Introduction Introduction to Deep Learning Efficient Deep Learning Mental Model of Efficient Deep Learning Summary
Part II: Effciency Techniques 壓縮技(ji)術導論 Introduction to Compression Techniques An Overview of Compression Quantization Exercises: Compressing images from the Mars Rover Project: Quantizing a Deep Learning Model Summary 學習技(ji)術導論 Introduction to Learning Techniques
Project: Increasing the accuracy of an speech identification model with Distillation. Project: Increasing the accuracy of an image classification model with Data Augmentation.
Project: Increasing the accuracy of a text classification model with Data Augmentation. Learning Techniques and Efficiency Data Augmentation Distillation Summary
高效(xiao)架(jia)構 Efficient Architectures Project: Project: Snapchat-Like Filters for Pets Project: News Classification Using RNN and Attention Models Project: Using pre-trained embeddings to improve accuracy of a NLP task. Embeddings for Smaller and Faster Models Learn Long-Term Dependencies Using Attention Efficient On-Device Convolutions Summary
高級壓縮技術 Advanced Compression Techniques Exercise: Using clustering to compress a 1-D tensor. Exercise: Mars Rover beckons again! Can we do better with clustering? Exercise: Simulating clustering on a dummy dense fully-connected layer Project: Using Clustering to compress a deep learning model Exercise: Sparsity improves compression Project: Lightweight model for pet filters application Model Compression Using Sparsity Weight Sharing using Clustering Summary
高級學(xue)習(xi)技術(shu) Advanced Learning Techniques Contrastive Learning Unsupervised Pre-Training Project: Learning to classify with 10% labels. Curriculum Learning 自動化 Automation Project: Layer-wise Sparsity to achieve a pareto optimal model. Project: Searching over model architectures for boosting model accuracy. Project: Multi-objective tuning to get a smaller and more accurate model. Hyper-Parameter Tuning AutoML Compression Search
Part 3 - Infrastructure
軟(ruan)件基(ji)礎 Software Infrastructure PyTorch Ecosystem iOS Ecosystem Cloud Ecosystems 硬件基(ji)礎 Hardware infrastructure GPUs Jetson TPU M1 / A4/5? Microcontrollers
Part 3 - Applied Deep Dives Deep-Dives: Tensorflow Platforms Project: Training BERT efficiently with TPUs. Project: Face recognition on the web with TensorFlow.JS. Project: Speech detection on a microcontroller with TFMicro. Project: Benchmarking a tiny on-device model with TFLite.
Mobile Microcontrollers Web Google Tensor Processing Unit (TPU) Summary Deep-Dives: Efficient Models Project: Efficient speech detection models. Project: Comparing efficient mobile models on Mobile. Project: Training efficient BERT models.
BERT MobileNet EfficientNet architectures Speech Detection
For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. In the first stage, we sufficiently widen the deep thin network and train it until convergence. In the second stage, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by letting the thin network imitate the immediate outputs of the wide network from layer to layer. In the last stage, we further fine tune this well initialized deep thin network. The theoretical guarantee is established by using mean field analysis, which shows the advantage of layerwise imitation over traditional training deep thin networks from scratch by backpropagation. We also conduct large-scale empirical experiments to validate our approach. By training with our method, ResNet50 can outperform ResNet101, and BERT_BASE can be comparable with BERT_LARGE, where both the latter models are trained via the standard training procedures as in the literature.
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy subject to the time and memory constraints of training and inference. We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute: self-supervised pretraining and high-resource machine translation. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. Moreover, this acceleration in convergence typically outpaces the additional computational overhead of using larger models. Therefore, the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models. However, we show that large models are more robust to compression techniques such as quantization and pruning than small models. Consequently, one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.
In recent years, Graph Neural Networks (GNNs), which can naturally integrate node information and topological structure, have been demonstrated to be powerful in learning on graph data. These advantages of GNNs provide great potential to advance social recommendation since data in social recommender systems can be represented as user-user social graph and user-item graph; and learning latent factors of users and items is the key. However, building social recommender systems based on GNNs faces challenges. For example, the user-item graph encodes both interactions and their associated opinions; social relations have heterogeneous strengths; users involve in two graphs (e.g., the user-user social graph and the user-item graph). To address the three aforementioned challenges simultaneously, in this paper, we present a novel graph neural network framework (GraphRec) for social recommendations. In particular, we provide a principled approach to jointly capture interactions and opinions in the user-item graph and propose the framework GraphRec, which coherently models two graphs and heterogeneous strengths. Extensive experiments on two real-world datasets demonstrate the effectiveness of the proposed framework GraphRec. Our code is available at \url{//github.com/wenqifan03/GraphRec-WWW19}