• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Bao, Z.-S. (Bao, Z.-S..) | Guo, J.-N. (Guo, J.-N..) | Zhang, W.-B. (Zhang, W.-B..) | Dang, H.-B. (Dang, H.-B..)

Indexed by:

EI Scopus

Abstract:

Convolutional Neural Network (CNN) has remarkable application effect in object detection, semantic segmentation and image classification in recent years. In order to meet the requirements of high precision, CNN models with deep layers need to be constructed. Due to the large number of parameters of the CNN and its intensive computational demands, it is a great challenge to the deployment of CNN applications with low latency requirements on edge devices which are resource-limited. Although GPU can be used to complete theoretical verification of accelerated computation of CNN model. Due to the limitation of GPU customization cost and power consumption, it cannot be applied in the actual low-power system. In contrast, as a low power consumption and high performance system, FPGA has the characteristics of high performance computing capability and reconfigurability, which are suitable for customized computing of CNNs. The method to solve the acceleration problem is to use the customized computing technology with FPGA reconfigurability. We can use the composable accelerator to deal with various CNN application scenarios and adjust the accelerator structure to suit the application to ensure power consumption efficiency. The bottleneck of the existing CNN accelerator on FPGA lies in the poor adaptation of CNN algorithm, which leads to the problems of large computing gap, the waste of latency and low utilization of computing resources. In this paper, we reorganize the dataflow structure to adapt to CNN parallel operation. According to the limited FPGA resources, the matrix multiplication, convolution calculation, pooling calculation and other units were customized from the bottom up to top, and the Ultra accelerator (UltraAcc) is proposed. An evaluation model is designed for hyperparameter tuning. From the bottom unit to the computing layer unit and then to the whole computing chain, storage resources, computing resources and latency are evaluated. With the precision result of CNN training, the whole application system is balanced and optimized from both software and hardware. The UltraAcc can achieve an average throughput of 126. 72 GOPs on the Ultra96v2, 5. 47 times higher than the first place method in IEEE/ACM DAC-SDC'19 on the same platform. The UltraAcc was used to participate in the DAC-SDC'20. And we won the first prize with accuracy of IoU 0. 65, speed of FPS 212. 73 and energy consumption of 1.64 kj. © 2023 Science Press. All rights reserved.

Keyword:

CNN accelerator hardware-software co-design FPGA dataflow

Author Community:

  • [ 1 ] [Bao Z.-S.]Faulty of Information Technology, Beijing University of Technology, Beijing, 100024, China
  • [ 2 ] [Guo J.-N.]Faulty of Information Technology, Beijing University of Technology, Beijing, 100024, China
  • [ 3 ] [Zhang W.-B.]Faulty of Information Technology, Beijing University of Technology, Beijing, 100024, China
  • [ 4 ] [Dang H.-B.]Faulty of Information Technology, Beijing University of Technology, Beijing, 100024, China

Reprint Author's Address:

Email:

Show more details

Related Keywords:

Source :

Chinese Journal of Computers

ISSN: 0254-4164

Year: 2023

Issue: 6

Volume: 46

Page: 1139-1155

Cited Count:

WoS CC Cited Count: 0

SCOPUS Cited Count:

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 2

Affiliated Colleges:

Online/Total:729/10566750
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.