英语原文共 8 页,剩余内容已隐藏,支付完成后下载完整资料
In recent years, convolutional neural networks (CNNs)have gained great popularity not only in many computer vision applications including image classification[1][2], object detection[3][4] and video analysis[5][6], but also a wide range of fields such as natural language processing[7], speech recognition[8] and text classification[9]. For instance, AlexNet gained leadership with a top-5 accuracy of 84.7% in ImageNet Large-Scale Vision Recognition Challenge (ILSVRC)2012[10], which led to significant advances in image classification accuracy compared with other traditional image classification methods. Many succeeding works on CNN applications focus on improvements of AlexNet model for improved network structure and fine-tuned network super-parameters that accommodate specific application requirements[11][12].
近年来,卷积神经网络不仅在许多计算机视觉应用,如图像分类,物体检测和视频分析中获得了极大的普及,同时也应用在各种各样的领域,如自然语言处理,语音识别和文本分类。 例如,与其他传统的图像分类方法相比,AlexNet在ImageNet这一大规模的2012年视觉识别挑战赛(ILSVRC)中成为了前所未有的准确率为84.7%的领导者,这导致了图像分类准确性的显著提高。紧接着许多基于CNN应用的研究都集中在改进AlexNet模型,以改善网络结构、微调网络超级参数,以满足特定的应用需求。
A trained CNN model which is employed in real applications is computationally expensive, requiring over a billion operations per input image. General processors are not efficient for CNN implementation and can hardly meet the performance requirement, thus customized accelerators based on FPGA,GPU and ASIC have attracted more and more attentions because of their good performance, high-energy efficiency and capability of reconfiguration. In particular, FPGA technology shows extraordinary progress with examples of Xilinx series 7 FPGAs and the forthcoming 3D stacking FPGAs, Altera Stratix 10 series. Some of previous works only considered
small CNN models for simple tasks[13][14][15][16]. Some works implemented only convolutional layers while fullyconnected layers were not examined in depth[16][17]. In the OpenCL-based FPGA accelerator proposed in work[18], 3-D convolutions were mapped as matrix multiplication operations, which brought benefit of portability but also overheads caused by padding and rearranging data. Folding technique was adopted in implementing accelerator for VGG16 with uniform 3times;3 convolutional windows[19], therefore hardware resources are efficiently shared by different convolutional layers. However, this may encounter problems in implementing CNN with
non-uniform convolutional windows.
在实际应用中使用的经过训练的CNN模型在计算性能的费用上是浩大的,一张输入图像需要超过十亿次操作。通用处理器在CNN的实现上效率很低,难以满足性能要求。因此基于FPGA,GPU和ASIC的定制加速器由于其良好的性能,高能效和重新配置能力而受到越来越多的关注。特别是,FPGA技术通过Xilinx 7系列FPGA和即将推出的3D堆叠FPGA(Altera Stratix 10系列)展示了非凡的进展。之前的研究或只考虑过用于简单任务的小型CNN模型,或只实现了卷积层,而完全连接层没有深入研究。在工作中提出的基于OpenCL的FPGA加速器中,三维卷积被映射为矩阵乘法运算,这带来了可移植性的好处,但也带来了填充和重新排列数据所带来的开销。采用折叠技术实现VGG16加速器采用均匀的3times;3卷积窗,因此硬件资源可以被不同的卷积层有效地共享。但是,这可能会遇到实施CNN时非均匀卷积窗口的问题。
High capacity and improved memory speed of FPGA chips provide larger design space than they did in the past. This paper focuses on efficiently mapping feed-forward classification phase of CNN onto reconfigurable FPGA platforms with large capacity and abundant resources. Our main contributions are the following:
FPGA芯片的高容量和改进的存储器速度提供了比过去更大的设计空间。本文重点研究CNN的前馈分类阶段到大容量,资源丰富的可重构FPGA平台的有效映射。 我们的主要贡献如下:
We propose parallel structures for CNN layers to exploit the inherent parallelism. Besides, efficient computation engines to carry out operations in convolutional and fully-connected layers are implemented in fixed-point arithmetic.
我们提出CNN网络中层次的并行结构来探索固有的并行性。此外,在定点算术中实现用于在卷积层和全连接层中执行操作的高效计算引擎。
We propose an automatic generator which could generate Verilog HDL source code automatically according to high-level descriptions, which include some critical variables of one CNN layer. Execution time, DSP consumption and performance are analytically modeled based on these variables.
我们提出了一种自动生成器,它可以根据高级描述自动生成Verilog HDL源代码,描述中包括一个CNN层该有的关键变量。 执行时间,DSP消耗和性能基于这些变量进行分析建模。
We demonstrate the automatic methodology by implementing two representative CNNs on Xilinx VC709 board. The maximum performance of AlexNet could achieve 222.1 GOP/s under 100 MHz, showing rather good performance.
我们通过在Xilinx VC709板上实现两个有代表性的CNN来演示自动方法。 在100 MHz下,AlexNet的最高性能可达到222.1 GOP / s,表现出相当不错的性能。
The rest of the paper is organized as follows. Section 2 briefly introduces the basic operations of CNN. Section 3 presents the parallel structures and computation engines for convolutional and fully-connected layers. In section 4, the automatic generator is introduced and the critical design variables are described. In section 5, execution time, DSP consumption and performance are analytically modeled. Section 6 presents the experiments results and analysis. Finally section 7 concludes the paper.
本文的其余部分安排如下。 第2节简要介绍了CNN的基本操作。 第3节介绍卷积和全连接层的并行结构和计算引擎。 在第4节中,介绍了自动发电机,并描述了关键设计变量。 在第5节中,分析建模执行时间,DSP消耗和性能。 第6节介绍了实验结果和分析。 最后,第7节总结了论文。
A typical CNN consists of multiple convolutional layers and fully-connected layers, interspersed by normalization, pooling and non-linear activation functions. Fig. 1 shows the architecture of AlexNet, a famous model used in large-scale image classification. In this section, we briefly review the operations in each layer type of CNN.
典型的CNN由多个卷积层和完全连接的层组成,其中散布有归一化,汇集和非线性激活函数。 图1显示了AlexNet的架构,这是一种用于大规模图像分类的着名模型。 在本节中,我们将简要回顾每种CNN类型的操作。
A convolutional layer receives n feature maps as input and generates m output feature maps. Each input feature map is convolved by a shifting window with a shifting stride of s and window size of k times; k. Then the individual convolutional results are summed or aggregated. A bias value is added to each pixel in the aggregated output and a suitable nonlinear activation function f is used to limit the pixel value to a reasonable range. Commonly used activation functions include tanh, sigmoid and ReLU. Mathematically, output feature maps in a convolutional layer is gi
全文共29531字,剩余内容已隐藏,支付完成后下载完整资料
资料编号:[222],资料为PDF文档或Word文档,PDF文档可免费转换为Word
以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。