報告主題: Neural Architecture Search and Beyond
報告簡介:
神經網絡結構搜索(NAS)是一種自動化設計人工神經網絡的技術。由于NAS能設計出與手工設計神經網絡結構相當甚至優于手工設計結構的網絡,而成為近兩年深度學習社區的研究熱點。來自Google的科學家Barret Zoph,ICCV2019上做了《Neural Architecture Search and Beyond》的報告,講述了Google在NAS方面的最新研究進展。
嘉賓介紹:
Barret Zoph目前是谷歌大腦團隊的高級研究科學家。之前,在信息科學研究所與Kevin Knight教授和Daniel Marcu教授一起研究與神經網絡機器翻譯相關的課題。
題目: A Survey of the Recent Architectures of Deep Convolutional Neural Networks
摘要:
深度卷積神經網絡(CNNs)是一種特殊類型的神經網絡,在計算機視覺和圖像處理等領域的多項競賽中均有出色的表現。CNN有趣的應用領域包括圖像分類與分割、目標檢測、視頻處理、自然語言處理、語音識別等。深度卷積神經網絡強大的學習能力很大程度上是由于它使用了多個特征提取階段,可以從數據中自動學習表示。大量數據的可用性和硬件技術的改進加速了CNNs的研究,最近出現了非常有趣的深度卷積神經網絡架構。事實上,人們已經探索了幾個有趣的想法來促進CNNs的發展,比如使用不同的激活和丟失函數、參數優化、正則化和架構創新。然而,深度卷積神經網絡的代表性能力的主要提升是通過架構上的創新實現的。特別是利用空間和信道信息、建筑的深度和寬度以及多路徑信息處理的思想得到了廣泛的關注。同樣,使用一組層作為結構單元的想法也越來越流行。因此,本次調查的重點是最近報道的深度CNN架構的內在分類,因此,將CNN架構的最新創新分為七個不同的類別。這七個類別分別基于空間開發、深度、多路徑、寬度、特征圖開發、通道提升和注意力。對CNN的組成部分、當前CNN面臨的挑戰和應用進行了初步的了解。
Current state-of-the-art convolutional architectures for object detection are manually designed. Here we aim to learn a better architecture of feature pyramid network for object detection. We adopt Neural Architecture Search and discover a new feature pyramid architecture in a novel scalable search space covering all cross-scale connections. The discovered architecture, named NAS-FPN, consists of a combination of top-down and bottom-up connections to fuse features across scales. NAS-FPN, combined with various backbone models in the RetinaNet framework, achieves better accuracy and latency tradeoff compared to state-of-the-art object detection models. NAS-FPN improves mobile detection accuracy by 2 AP compared to state-of-the-art SSDLite with MobileNetV2 model in [32] and achieves 48.3 AP which surpasses Mask R-CNN [10] detection accuracy with less computation time.
Deep Convolutional Neural Networks (CNNs) are a special type of Neural Networks, which have shown state-of-the-art results on various competitive benchmarks. The powerful learning ability of deep CNN is largely achieved with the use of multiple non-linear feature extraction stages that can automatically learn hierarchical representation from the data. Availability of a large amount of data and improvements in the hardware processing units have accelerated the research in CNNs and recently very interesting deep CNN architectures are reported. The recent race in deep CNN architectures for achieving high performance on the challenging benchmarks has shown that the innovative architectural ideas, as well as parameter optimization, can improve the CNN performance on various vision-related tasks. In this regard, different ideas in the CNN design have been explored such as use of different activation and loss functions, parameter optimization, regularization, and restructuring of processing units. However, the major improvement in representational capacity is achieved by the restructuring of the processing units. Especially, the idea of using a block as a structural unit instead of a layer is gaining substantial appreciation. This survey thus focuses on the intrinsic taxonomy present in the recently reported CNN architectures and consequently, classifies the recent innovations in CNN architectures into seven different categories. These seven categories are based on spatial exploitation, depth, multi-path, width, feature map exploitation, channel boosting and attention. Additionally, it covers the elementary understanding of the CNN components and sheds light on the current challenges and applications of CNNs.
Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. Because of this, there is growing interest in automated neural architecture search methods. We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy.
Designing convolutional neural networks (CNN) models for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant effort has been dedicated to design and improve mobile models on all three dimensions, it is challenging to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated neural architecture search approach for designing resource-constrained mobile CNN models. We propose to explicitly incorporate latency information into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike in previous work, where mobile latency is considered via another, often inaccurate proxy (e.g., FLOPS), in our experiments, we directly measure real-world inference latency by executing the model on a particular platform, e.g., Pixel phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that permits layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our model achieves 74.0% top-1 accuracy with 76ms latency on a Pixel phone, which is 1.5x faster than MobileNetV2 (Sandler et al. 2018) and 2.4x faster than NASNet (Zoph et al. 2018) with the same top-1 accuracy. On the COCO object detection task, our model family achieves both higher mAP quality and lower latency than MobileNets.
This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and non-differentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR-10, ImageNet, Penn Treebank and WikiText-2 show that our algorithm excels in discovering high-performance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than state-of-the-art non-differentiable techniques.