Indexed by:
Abstract:
Convolutional Neural Network (CNN) has remarkable application effect in object detection, semantic segmentation and image classification in recent years. In order to meet the requirements of high precision, CNN models with deep layers need to be constructed. Due to the large number of parameters of the CNN and its intensive computational demands, it is a great challenge to the deployment of CNN applications with low latency requirements on edge devices which are resource-limited. Although GPU can be used to complete theoretical verification of accelerated computation of CNN model. Due to the limitation of GPU customization cost and power consumption, it cannot be applied in the actual low-power system. In contrast, as a low power consumption and high performance system, FPGA has the characteristics of high performance computing capability and reconfigurability, which are suitable for customized computing of CNNs. The method to solve the acceleration problem is to use the customized computing technology with FPGA reconfigurability. We can use the composable accelerator to deal with various CNN application scenarios and adjust the accelerator structure to suit the application to ensure power consumption efficiency. The bottleneck of the existing CNN accelerator on FPGA lies in the poor adaptation of CNN algorithm, which leads to the problems of large computing gap, the waste of latency and low utilization of computing resources. In this paper, we reorganize the dataflow structure to adapt to CNN parallel operation. According to the limited FPGA resources, the matrix multiplication, convolution calculation, pooling calculation and other units were customized from the bottom up to top, and the Ultra accelerator (UltraAcc) is proposed. An evaluation model is designed for hyperparameter tuning. From the bottom unit to the computing layer unit and then to the whole computing chain, storage resources, computing resources and latency are evaluated. With the precision result of CNN training, the whole application system is balanced and optimized from both software and hardware. The UltraAcc can achieve an average throughput of 126. 72 GOPs on the Ultra96v2, 5. 47 times higher than the first place method in IEEE/ACM DAC-SDC'19 on the same platform. The UltraAcc was used to participate in the DAC-SDC'20. And we won the first prize with accuracy of IoU 0. 65, speed of FPS 212. 73 and energy consumption of 1.64 kj. © 2023 Science Press. All rights reserved.
Keyword:
Reprint Author's Address:
Email:
Source :
Chinese Journal of Computers
ISSN: 0254-4164
Year: 2023
Issue: 6
Volume: 46
Page: 1139-1155
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 2
Affiliated Colleges: