[논문]깊이별 분리 합성곱을 위한 다중 스레드 오버랩 시스톨릭 어레이

윤종호; 이승규; 강석형

doi:10.22895/tse.2024.0001

깊이별 분리 합성곱을 위한 다중 스레드 오버랩 시스톨릭 어레이
Multithreaded and Overlapped Systolic Array for Depthwise Separable Convolution 원문보기

반도체공학회 논문지 = Transactions on semiconductor engineering, v.2 no.1, 2024년, pp.1 - 8

윤종호 (Department of Electrical Engineering, Pohang University of Science and Technology) , 이승규 (Hyosung Ventures) , 강석형 (Department of Electrical Engineering, Pohang University of Science and Technology)

초록
AI-Helper

깊이별 분리 합성곱 (Depthwise Separable Convolution)을 처리할 때, processing element (PE)의 저활용성은 시스톨릭 어레이 (SA)의 한계점 중 하나이다. 본 연구에서는 깊이별 합성곱의 처리량을 극대화하기 위한 새로운 SA 아키텍처를 제안한다. 더불어, 제안된 SA 는 깊이별 합성곱 계산 중에 유휴 PE 에서 후속 점별 합성곱 (pointwise convolution)을 수행하여 활용도를 증가시킨다. 모든 깊이별 합성곱 연산 후에는 모든 PE 를 활용하여 나머지 점별 합성곱 연산의 속도를 향상시킨다. 결과적으로, 제안된 128×128 SA 는 MobileNetV3 연산 시, 기본 SA 및 RiSA 와 비교하여 속도가 4.05 배, 1.75 배 향상되고, 에너지 소비량을 각각 66.7 %, 25.4 % 감소한다.

Abstract ▼ AI-Helper

When processing depthwise separable convolution, low utilization of processing elements (PEs) is one of the challenges of systolic array (SA). In this study, we propose a new SA architecture to maximize throughput in depthwise convolution. Moreover, the proposed SA performs subsequent pointwise convolution on the idle PEs during depthwise convolution computation to increase the utilization. After the computation, we utilize unused PEs to boost the remaining pointwise convolution. Consequently, the proposed 128x128 SA achieves a 4.05x and 1.75x speed improvement and reduces the energy consumption by 66.7 % and 25.4 %, respectively, compared to the basic SA and RiSA in MobileNetV3.

주제어

표/그림 (12)

그림 그림 1. SA에서 합성곱 별 PE 활용률
그림 그림 2. 두 개의 입력 데이터 스트림이 있는 1D PE 체인의 깊이별 합성곱 메커니즘
그림 그림 3. 제안하는 SA 구조
그림 그림 4. 제안하는 다중 스레드 1D PE 체인의 깊이별 합성곱 메커니즘
그림 그림 5. 1D PE 체인의 입력 데이터 흐름
그림 그림 6. 오버랩 구조에서 깊이별 합성곱 및 점별 합성곱을 동시에 수행하는 메커니즘
그림 그림 7. 제안하는 PE 재할당 방법
그림 그림 8. 기본 SA, RiSA 그리고 제안하는 SA에 대한 여러 CNN 모델의 추론 시간 비교
그림 그림 9. 깊이별 합성곱의 PE 활용률
그림 그림 10. 각 SA 디자인 별 면적 오버헤드
그림 그림 11. 기본 SA, RiSA 그리고 제안하는 SA에 대한 여러 CNN 모델의 에너지 소모량 비교
표 표 1. SA 디자인 및 어레이 크기 별 면적 효율성 (* 오버랩 구조 없는 디자인)

참고문헌 (16)

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen,？J. Tran et al. "cuDNN: Efficient Primitives for Deep？Learning", arXiv preprint arXiv:1410.0759, 2014.
N.P. Jouppi, C. Young, N. Patil, D. Patterson, G.？Agrawal et al., "In- Datacenter Performance Analysis of a？Tensor Processing Unit", Int. Symp. on Computer Architecture (ISCA), 2017, pp. 1-12.
S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, J.？S. Vetter, "NVIDIA Tensor Core Programmability, Performance & Precision", Int. Symp. on Parallel and Distributed Processing Symp. Workshops (IPDPSW), 2018,？pp. 522-531.
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,？W. Wang et al., "Mobilenets: Efficient Convolutional？Neural Networks for Mobile Vision Applications", arXiv？preprint arXiv:1704.04861, 2017.
M. Sandler, A. Howard, M. Zhu, A. Zhmoginow, L.？C. Chen, "Mo- bileNetV2: Inverted Residuals and Linear Bottlenecks", Computer Vision and Pattern Recognition？(CVPR), 2018, pp. 4510-4520.
A. Howear, M. Sandler, G. Chu, L. C. Chen, B. Chen,？M. Tan, "Searching for MobileNetv3", Int. Conf. on Computer Vision (ICCV), 2019, pp. 1314-1324.
M. Tan, Q. Le, "Efficientnet: Rethinking Model Scaling for Convolu- tional Neural Networks", Proc. Machine？Learning Research (PMLR), 2019, pp. 6105-6114.
Z. Liu, H. Mao, C. Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, "A ConvNet for the 2020s", arXiv preprint？arXiv:2201.03545, 2022.
Z. Dai, H. Liu, QV. Le, M. Tan, A. Howear, M.？Sandler, G. Chu, L. C. Chen, B. Chen, M. Tan, "CoAtNet:？Marrying Convolution and Attention for All Data Sizes",？Advances in Neural Information Processing Systems 34,？2021, pp. 3965-3977.
S. Ghodrati, B. H. Ahn, J. Kim, S. Kinzer, B. R.？Yatham et al., "Planaria: Dynamic Architecture Fission？for Spatial Multi-Tenant Acceleration of Deep Neural？Networks", Int. Symp. on Microarchitecture (MICRO),？2020, pp. 681-697.
J. Lee, J. Choi, J. Kim, J. Lee, Y. Kim, "Dataflow？Mirroring: Archi- tectural Support for Highly Efficient？Fine-Grained Spatial Multitasking on Systolic Array？NPUs", Design Automation Conf. (DAC), 2021, pp. 247-252.
H. Cho, "RiSA: A Reinforced Systolic Array for？Depthwise Convolu- tions and Embedded Tensor Reshaping", Trans. Embedded Computing Systems (TECS) 20.5s,？2021, pp. 1-20.
R. Xu, S. Ma, Y. Wang, Y. Guo, "CMSA: Configurable Multi-directional Systolic Array for Convolutional？Neural Networks", International Con- ference on Computer Design (ICCD), 2020, pp. 494-497.
L. Bai, Y. Zhao and X. Huang, "A CNN Accelerator？on FPGA Using Depthwise Separable Convolution",？IEEE Trans. Circuits and Syst. II, Exp. Briefs, vol. 65, no.？10, pp. 1415-1419, Oct. 2018.
R. Xu, S. Ma, Y. Wang, Y. Guo, "HeSA: Heterogeneous Systolic Array Architecture for Compact CNNs？Hardware Accelerators", Design, Automation & Test in？Europe Conf. & Exhibit. (DATE), 2021, pp. 657- 662.
H. T. Kung, B. McDanel, S. Q. Zhang, "Adaptive？Tiling: Apply Fixed- size Systolic Arrays to Sparse Convolutional Neural Networks", Int. Conf. on Pattern Recognition (ICPR), 2018, pp. 1006-1011.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증