英语原文共 8 页,剩余内容已隐藏,支付完成后下载完整资料
F-CNN: An FPGA-based Framework for Training Convolutional Neural Networks
F-CNN:基于FPGA的卷积神经网络训练框架
Abstract—This paper presents a novel reconfigurable frame- work for training Convolutional Neural Networks (CNNs). The proposed framework is based on reconfiguring a streaming datapath at runtime to cover the training cycle for the various layers in a CNN. The streaming datapath can support various parameterized modules which can be customized to produce implementations with different trade-offs in performance and resource usage. The modules follow the same input and output data layout, simplifying configuration scheduling. For different layers, instances of the modules contain different computation kernels in parallel, which can be customized with different layer configurations and data precision. The associated models on performance, resource and bandwidth can be used in deriving parameters for the datapath to guide the analysis of design trade- offs to meet application requirements or platform constraints. They enable estimation of the implementation specifications given different layer configurations, to maximize performance under the constraints on bandwidth and hardware resources. Experimental results indicate that the proposed module design targeting Maxeler technology can achieve a performance of 62.06 GFLOPS for 32-bit floating-point arithmetic, outperforming existing accelerators. Further evaluation based on training LeNet- 5 shows that the proposed framework achieves about 4 times faster than CPU implementation of Caffe and about 7.5 times more energy efficient than the GPU implementation of Caffe.
摘要 - 本文提出了一种新的可重构框架,用于训练卷积神经网络(CNNs)。所提出的框架基于在运行时重新配置流数据路径以覆盖CNN中各层的训练周期。流数据路径可以支持各种参数化模块,这些模块可以定制以产生在性能和资源使用方面具有不同权衡的实现。模块遵循相同的输入和输出数据布局,简化了配置调度。对于不同的层,模块的实例包含并行的不同计算内核,可以使用不同的层配置和数据精度来定制。关于性能,资源和带宽的相关模型可用于导出数据路径的参数,以指导设计权衡的分析,以满足应用要求或平台约束。它们允许在给定不同层配置的情况下估计实现规范,以在带宽和硬件资源的约束下最大化性能。实验结果表明,针对Maxeler技术的所提出的模块设计可以实现32位浮点运算的62.06 GFLOPS性能,优于现有的加速器。基于LeNet-5培训的进一步评估表明,所提出的框架比Caffe的CPU实现速度快4倍,比Caffe的GPU实现速度快7.5倍。
I. INTRODUCTION
Convolutional Neural Network (CNN [1]) is one of the most successful deep learning models. FPGA-based designs [2]–[4] are proposed to accelerate CNN classification process (forward computation) and achieve considerable speedup and higher energy efficiency than CPU and GPU [5].
The training process of CNNs shares a similar computation pattern with classification, which shows the potential for accel- erating the training process on FPGA-based platforms. How- ever, the training process involves much more computation and more complicated workflow, which remains challenging for designing and accelerating the whole training process based on FPGAs. Specifically, the computation resources on modern FPGAs are too limited to implement the whole training process, which calls for improvements in task partition and scheduling. Moreover, classification is mainly based on specific pre-trained CNNs, while the training process should be more flexible to support different network configurations, which is challenging to FPGA-based hardware designs.
In this paper, we address the above problems and present an FPGA-based CNN training framework. In summary, the main contributions of this paper are as follows:
- 引言
卷积神经网络(CNN [1])是最成功的深度学习模型之一。基于FPGA的设计[2]-[4]被提出用于加速CNN分类过程(正向计算)并实现比CPU和GPU更高的加速和更高的能量效率[5]。
CNN的训练过程与分类具有相似的计算模式,这表明了在基于FPGA的平台上加速训练过程的潜力。然而,培训过程涉及更多计算和更复杂的工作流程,这对于设计和加速基于FPGA的整个培训过程仍然具有挑战性。具体而言,现代FPGA上的计算资源太有限,无法实现整个培训过程,这需要改进任务分区和调度。此外,分类主要基于特定的预先训练的CNN,而训练过程应该更灵活,以支持不同的网络配置,这对基于FPGA的硬件设计具有挑战性。
在本文中,我们解决了上述问题,并提出了一个基于FPGA的CNN培训框架。总之,本文的主要贡献如下:
bull; A novel reconfigurable design for CNN training called F-CNN. It involves reconfiguring a streaming datapath at runtime to cover the training tasks for the various layers in a CNN. The streaming datapath contains various parameterized modules, which can be customized to produce implementations with different configurations.
bull; Analytical models for performance, bandwidth and re- source usage of the modules. Given certain network configurations, these models can be used to estimate the implementation specifications and maximize performance under the constraints on bandwidth and hardware re- sources.
bull; Evaluation of the proposed F-CNN prototype targeting a Maxeler FPGA platform with Altera Stratix V FPGAs. The convolutional modules for AlexNet achieve 62.06 GFLOPS for 32-bit float, outperforming most existing accelerators. Overall evaluation based on training LeNet- 5 shows about 4 times speedup over CPU-based imple- mentations of Caffe and is about 7.5 times more energy efficient than GPU-based implementations of Caffe.
The organization of the paper is as follows. Section II reviews the CNN training process and some existing FPGA- based accelerators. Section III introduces the F-CNN design. Section IV presents the design of computation modules and the analytical models. Section V includes the experimental results. Section VI covers conclusions.
bull; 一种新的CNN训练可重构设计,称为F-CNN。它涉及在运行时重新配置流数据路径,以覆盖CNN中各个层的培训任务。流数据路径包含各种参数化模块,可以定制这些模块以生成具有不同配置的实现。
bull; 模块的性能、带宽和资源使用的分析模型。给定一定的网络配置,这些模型可用于估计实现规范,并在带宽和硬件资源的约束下实现性能的最大化。
bull; 基于Altera statix V FPGAs的Maxeler FPGA平台的F-CNN原型评估。针对AlexNet的卷积模块在32位浮点数上实现了62.06 GFLOPS,超过了大多数现有的加速器。基于训练LeNet- 5的总体评估显示,与基于cpu的Caffe相比,它的速度提高了4倍左右,并且比基于gpu的Caffe实现的能效提高了7.5倍左右。
论文的组织如下。第二节回顾了CNN的训练过程和一些现有的基于FPGA的加速器。第三节介绍F-CNN的设计。第四节给出了计算模块的设计和分析模型。第五部分为实验结果。第六节包括结论。
II. BACKGROUND AND MOTIVATION
A. CNN Model
Figure 1 shows an typical LeNet-5 CNN [6] for handwriting digit recognition, which is simpler than many modern CNNs, but sufficiently representative for introducing the basic CNN models. It has 2 convolutional layers, 2 max-pooling layers and 2 fully connected multilayer perceptron (MLP [7]) layers.
2. 背景和动机
A.CNN模型
图1显示了一个典型的用于手写数字识别的LeNet-5 CNN[6],它比许多现代
全文共28492字,剩余内容已隐藏,支付完成后下载完整资料
资料编号:[188],资料为PDF文档或Word文档,PDF文档可免费转换为Word
以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。