[논문]Cycle-accurate NPU 시뮬레이터 및 데이터 접근 방식에 따른 NPU 성능평가

권구윤; 박상우; 서태원

doi:10.14372/iemek.2022.17.4.217

Cycle-accurate NPU 시뮬레이터 및 데이터 접근 방식에 따른 NPU 성능평가
Cycle-accurate NPU Simulator and Performance Evaluation According to Data Access Strategies 원문보기

대한임베디드공학회논문지 = IEMEK Journal of embedded systems and applications, v.17 no.4, 2022년, pp.217 - 228

권구윤 (Korea University) , 박상우 (Korea University) , 서태원 (Korea University)

Abstract ▼ AI-Helper

Currently, there are increasing demands for applying deep neural networks (DNNs) in the embedded domain such as classification and object detection. The DNN processing in embedded domain often requires custom hardware such as NPU for acceleration due to the constraints in power, performance, and area. Processing DNN models requires a large amount of data, and its seamless transfer to NPU is crucial for performance. In this paper, we developed a cycle-accurate NPU simulator to evaluate diverse NPU microarchitectures. In addition, we propose a novel technique for reducing the number of memory accesses when processing convolutional layers in convolutional neural networks (CNNs) on the NPU. The main idea is to reuse data with memory interleaving, which recycles the overlapping data between previous and current input windows. Data memory interleaving makes it possible to quickly read consecutive data in unaligned locations. We implemented the proposed technique to the cycle-accurate NPU simulator and measured the performance with LeNet-5, VGGNet-16, and ResNet-50. The experiment shows up to 2.08x speedup in processing one convolutional layer, compared to the baseline.

주제어

표/그림 (19)

그림 그림 1. Cycle-accurate 시뮬레이터의 한 사이클 동작 Fig. 1. A cycle of cycle-accurate simulator
그림 그림 2. (a) NPU 하드웨어 구조도 (b) PE (Processing Element) 구조 Fig. 2. (a) NPU Hardware Architecture (b) PE (Processing Element) Architecture
그림 그림 3. 후처리 유닛 구조 Fig. 3. Postprocessing Unit Architecutre
그림 그림 4. 정렬되지 않은 인풋 피처 맵 예시 Fig. 4. Example of Unaligned input feature map
그림 그림 5. 예시 #1: 데이터 메모리의 5개 데이터 읽기 Fig. 5. Example #1: Read 5 data from Data Memory
그림 그림 6. 예시 #2: 데이터 메모리의 8개 데이터 읽기 Fig. 6. Example #2: Read 8 data from Data Memory
그림 그림 7. 데이터 페처 구조 Fig. 7. Data Fetcher Architecture
그림 그림 8. 예시 #1: IPA가 준비되었을 때 IFM 서브페처의 동작 Fig. 8. Example #1: IFM SubFetcher Movement when IPA is ready
그림 그림 9. 예시 #2: IPA가 준비되지 않았을 때 IFM 서브페처의 동작 Fig. 9. Example #2: IFM SubFetcher Movement when IPA is not ready
그림 그림 10. (a) 2개의 5×5 커널을 합쳐 구성한 10x5 커널의 너비 우선 커널 분할 (b) 2개의 5×5 커널을 합쳐 구성한 10×5 커널의 높이 우선 커널 분할 Fig. 10. (a) Width-based Kernel Split of 10×5 Kernel Composed by Combining 2 5×5 Kernel (b) Height-first Kernel Split of 10×5 Kernel Composed by Combining 2 5×5 Kernel
그림 그림 11. 실험을 위해 설정한 내부 메모리 크기 및 데이터 버스 크기 Fig. 11. Internal Memory Size and Data Bus Width Setting for Experiments
$표 1. <TEX>$\hat{\alpha}$</TEX> 계산법에 따른 양자화한 ResNet-50 테스트 정확도 Table 1. Test Accuracy of Quantized ResNet-50 by CalculationMethod of <TEX>$\hat{\alpha}$</TEX>$ 표 표 1. $\hat{\alpha}$ 계산법에 따른 양자화한 ResNet-50 테스트 정확도 Table 1. Test Accuracy of Quantized ResNet-50 by CalculationMethod of $\hat{\alpha}$
그림 그림 12. FIFO 크기를 128로 설정하였을 때 데이터 재사용 단독 적용에 따른 가속 Fig. 12. Speedup with Data Reuse Alone when FIFO Size is 128
그림 그림 13. FIFO 크기를 128로 설정하였을 때 데이터 재사용과 메모리 인터리빙 적용에 따른 가속 Fig. 13. Speedup with Data Reuse and Memory Interleaving when FIFO Size is 128
표 표 2. ResNet-50의 첫 번째 convolutional layer 연산 총 사이클 수 (커널 사이즈 7×7) Table 2. Total number of cycles for computing the first layer in ResNet-50 (Kernel Size 7×7)
표 표 3. LeNet-5의 첫 번째 convolutional layer 연산 총 사이클 수(커널 사이즈 5x5)Table 3. Total number of cycles for computing the first layer inLeNet-5 (Kernel Size 5x5)
표 표 4. VGGNet-16의 첫 번째 convolutional layer 연산 총 사이클수 (커널 사이즈 3x3)Table 4. Total number of cycles for computing the first layer inVGGNet-16 (Kernel Size 3x3)
그림 그림 14. (a) 너비 기반 커널 분할 예시 #2 (b) 높이 기반 커널 분할 예시 #2 Fig. 14. (a) Example #2 of Width Fisrt Kernel Split (b) Example #2 of Height First Kernel Split
표 표 5. FIFO 크기를 128에서 256으로 늘렸을 때의 가속 Table 5. Speedup by Increasing FIFO Size from 128 to 256

참고문헌 (20)

H. Esmaeilzadeh, A. Sampson, L. Ceze, D. Burger, "Neural Acceleration for General-purpose Approximate Programs," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 449-460, 2012.
Y. Chen, Y. Xie, L. Song, F. Chen, T. Tang, "A Survey of Accelerator Architectures for Deep Neural Networks," Engineering, Vol. 6, No. 3, pp. 264-274, 2020.

상세보기
https://cloud.google.com/tpu
A. Skillman, T. Edso, "A Technical Overview of Cortex-m55 and Ethos-u55: Arm's most Capable Processors for Endpoint ai," in 2020 IEEE Hot Chips 32 Symposium (HCS), pp. 1-20, 2020.
J. Choquette, W. Gandhi, O. Giroux, N. Stam, R. Krashinsky, "NVIDIA A100 Tensor core GPU: Performance and Innovation," IEEE Micro, Vol. 41, No. 2, pp. 29-35, 2021.

상세보기
http://deepx.musigndm.com/product/
https://coral.ai/products/accelerator-module
J. W. Jang, S. Lee, D. Kim, H. Park, A. S. Ardestani, Y. Choi, C. Kim, Y. Kim, H. Yu, H. Abdel-Aziz, J. S. Park, H. Lee, D. Lee, M. W. Kim, H. Jung, H. Nam, D. Lim, S. Lee, J. H. Song, S. Kwon, J. Hassoun, S. H. Lim, C. Choi, "Sparsity-aware and Re-configurable npu Architecture for Samsung Flagship Mobile soc," in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 15-28, 2021.
https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-n78
Y. Wang, D. Deng, L. Liu, S. Wei, S. Yin, "PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing," IEEE Transactions on Circuits and Systems I: Regular Papers, 2022.
L. Lu, Y. Liang, Q. Xiao, S. Yan, "Evaluating fast Algorithms for Convolutional Neural Networks on FPGAs," in 2017 IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM), pp. 101-108, 2017.
Y. H. Chen, T. Krishna, J. S. Emer, V. Sze, "Eyeriss: An Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE Journal of Solid-state Circuits, Vol. 52, No. 1, pp. 127-138, 2016.

상세보기
Y. H. Chen, T. J. Yang, J. Emer, V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 9, No. 2, pp. 292-308, 2019.

상세보기
K. Prabhu, A. Gural, Z. F. Khan, R. M. Radway, M. Giordano, K. Koul, R. Doshi, J. W. Kustin, T. Liu, G. B. Lopes, V. Turbiner, W. S. Khwa, Y. D. Chih, M. F. Chang, G. Lallement, B. Murmann, S. Mitra, P. Raina, "CHIMERA: A 0.92-TOPS, 2.2-TOPS/W Edge AI Accelerator With 2-MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference," IEEE Journal of Solid-State Circuits, Vol. 57, No. 4, pp. 1013-1026, 2022.

상세보기
L. Deng, "The Mnist Database of Handwritten Digit Images for Machine Learning Research [best of the web]," IEEE Signal Processing Magazine, Vol. 29, No. 6, pp. 141-142, 2012.

상세보기
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, "Imagenet: A Large-scale Hierarchical Image Database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
Y. LeCun, "LeNet-5, Convolutional Neural Networks," URL: http://yann.lecun.com/exdb/lenet, Vol. 20, No. 5, pp. 14, 2015.
K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for Image Recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
H. Wu, P. Judd, X. Zhang, M. Isaev, P. Micikevicius, "Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation," arXiv preprint arXiv:2004.09602, 2020.

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증