[논문]효율적인 Transformer 모델 경량화를 위한 구조화된 프루닝

류은지; 이영주

doi:10.22895/tse.2023.0004

효율적인 Transformer 모델 경량화를 위한 구조화된 프루닝
Structured Pruning for Efficient Transformer Model compression 원문보기

반도체공학회 논문지 = Transactions on semiconductor engineering, v.1 no.1, 2023년, pp.23 - 30

류은지 (Pohang University of Science and Technology) , 이영주 (Pohang University of Science and Technology)

초록
AI-Helper

최근 거대 IT 기업들의 Generative AI 기술 개발로 Transformer 모델의 규모가 조 단위를 넘어가며 기하급수적으로 증가하고 있다. 이러한 AI 서비스를 지속적으로 가능케 하기 위해선 모델 경량화가 필수적이다. 본 논문에서는 하드웨어 친화적으로 구조화된(structured) 프루닝 패턴을 찾아 Transformer 모델의 경량화 방법을 제안한다. 이는 모델 알고리즘의 특성을 살려 압축을 진행하기 때문에 모델의 크기는 줄어들면서 성능은 최대한 유지할 수 있다. 실험에 따르면 GPT2 와 BERT 언어 모델을 프루닝할 때 제안하는 구조화된 프루닝 기법은 희소성이 높은 영역에서도 미세 조정된(fine-grained) 프루닝과 거의 흡사한 성능을 보여준다. 이 접근 방식은 미세 조정된 프루닝 대비 0.003%의 정확도 손실로 모델매개 변수를 80% 줄이고 구조화된 형태로 하드웨어 가속화를 진행할 수 있다.

Abstract ▼ AI-Helper

With the recent development of Generative AI technology by IT giants, the size of the transformer model is increasing exponentially over trillion won. In order to continuously enable these AI services, it is essential to reduce the weight of the model. In this paper, we find a hardware-friendly structured pruning pattern and propose a lightweight method of the transformer model. Since compression proceeds by utilizing the characteristics of the model algorithm, the size of the model can be reduced and performance can be maintained as much as possible. Experiments show that the structured pruning proposed when pruning GPT-2 and BERT language models shows almost similar performance to fine-grained pruning even in highly sparse regions. This approach reduces model parameters by 80% and allows hardware acceleration in structured form with 0.003% accuracy loss compared to fine-tuned pruning.

주제어

참고문헌 (32)

A. Vaswani et al., "Attention is all you need," in Proc.？of NeurIPS, 2017.？
T. Brown et al., "Language models are few-shot learners," in Proc. of NeurIPS, 2020, pp. 1877-1901.？
A. Chowdhery et al., "Palm: Scaling language modeling with pathways," arXiv preprint arXiv:2204.02311,？2022.？
S. Han, H. Mao, and W. J. Dally, "Deep compression:？Compressing deep neural networks with pruning,？trained quantization and Huffman coding," arXiv preprint arXiv:1510.00149, 2015.？
Sutskever, I., Vinyals, O., & Le, Q. V.. "Sequence to？sequence learning with neural networks." Advances？in neural information processing systems 27 (2014).？
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova,？"Bert: Pre-training of deep bidirectional transformers？for language understanding," in Proc. of NAACL,？2019, pp. 4171-4186.？
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., &？Bowman, S. R. "GLUE: A multi-task benchmark and？analysis platform for natural language understanding." arXiv preprint arXiv:1804.07461 (2018).？
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q.？N., Bernardi, R., Pezzelle, S., ... & Fernandez, R. "The？LAMBADA dataset: Word prediction requiring a？broad discourse context." arXiv preprint？arXiv:1606.06031 (2016).？
G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee, and D.？Lee, "nuqmm: Quantized matmul for efficient inference of large-scale generative language models,"？arXiv preprint arXiv:2206.09557, 2022.？
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer,？L. "Llm. int8 (): 8-bit matrix multiplication for transformers at scale." arXiv preprint arXiv:2208.07339？(2022).？
Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X.,？Li, C., & He, Y. "Zeroquant: Efficient and affordable post-training quantization for large-scale transformers." Advancesin Neural Information Processing Systems 35 (2022): 27168-27183.？
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., &？Han, S. "Smoothquant: Accurate and efficient post-training quantization for large language models." International Conference on Machine Learning. PMLR,？2023.？
Gou, Jianping, et al. "Knowledge distillation: A survey." International Journal of Computer Vision 129？(2021): 1789-1819.？

상세보기
Gu, Y., Dong, L., Wei, F., & Huang, M. "Knowledge？Distillation of Large Language Models." arXiv preprint arXiv:2306.08543 (2023).？
Frantar, E., & Alistarh, D. "SparseGPT: Massive？Language Models Can Be Accurately Pruned in OneShot." (2023).？
Ma, X., Fang, G., & Wang, X. "LLM-Pruner: On the？Structural Pruning of Large Language Models." arXiv？preprint arXiv:2305.11627 (2023).？
Zhang, M., Shen, C., Yang, Z., Ou, L., Yu, X., &？Zhuang, B. "Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning." arXiv preprint？arXiv:2305.18403 (2023).？
V. Sanh, T. Wolf, and A. Rush, "Movement pruning:？Adaptive sparsity by fine-tuning," in Proc. of NeurIPS, 2020, pp. 20 378-20 389.？
Babak Hassibi, David G Stork, and Gregory J Wolff.？Optimal brain surgeon and general network pruning.？In IEEE International Conference on Neural Networks, 1993.？
Elias Frantar, Sidak Pal Singh, and Dan Alistarh. Optimal Brain Compression: A framework for accurate？post-training quantization and pruning. arXiv preprint？arXiv:2208.11580, 2022.？
Y. He, X. Zhang, and J. Sun, "Channel pruning for？accelerating very deep neural networks," in Proc. of？ICCV, 2017, pp. 1389-1397.？
M. Zhu, T. Zhang, Z. Gu, and Y. Xie, "Sparse tensor？core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus," in Proc.？of MICRO, 2019, pp. 359-371.？
E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I.？Titov, "Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,"？in Proc. of ACL, 2019, pp. 5797-5808.？
P. Michel, O. Levy, and G. Neubig, "Are sixteen？heads really better than one?" Proc. of NeurIPS, vol.？32, 2019.？
Lagunas, Francois, et al. "Block pruning for faster？transformers." arXiv preprint arXiv:2109.04838？(2021).？
E. Yoo, G. Park, J. Min, S. Kwon, B. Park, D. Lee,？and Y. Lee*, "TF-MVP: Novel sparsity-aware transformer accelerator with mixed-length vector pruning," Design Automation Conference (DAC), San？Francis-co, CA, USA, July 2023.
J. Park, H. Yoon, D. Ahn, J. Choi, and J.-J. Kim, "Optimus: Optimized matrix multiplication structure for？transformer neural network accelerator," Proc. of？MLSys, pp. 363-378, 2020.？
A. Parashar et al., "Scnn: An accelerator for compressed-sparse convolutional neural networks," ACM？SIGARCH computer architecture news, vol. 45, no. 2,？pp. 27-40, 2017.？

상세보기
S. Zhang et al., "Cambricon-x: An accelerator for？sparse neural networks," in Proc. of MICRO. IEEE,？2016, pp. 1-12.？
S. Moon, H. Lee, Y. Byun, J. Park, J. Joe, S. Hwang,？S. Lee, and Y. Lee*, "FPGA-based sparsity-aware？CNN accelerator for noise-resilient edge-level image？recognition," IEEE Asian Solid-State Circuits Confer-ence (A-SSCC), Macao, China, Nov. 2019, pp.？205-208.？
H. Kwon, Y. Byun, S. Kang, and Y. Lee*, "CHAMP:？Channel merging process for cost-efficient highly-pruned CNN acceleration," IEEE Transactions on？Circuits and Systems I: Regular vol. 69, no. 8, pp.？3308-3319, Aug. 2022.？
Y. Byun, S. Moon, B. Park, S. Kwon, D. Lee, G.？Park, E. Yoo, J. Min and Y. Lee*, "Sparsity-Aware？Memory Interface Architecture using Stacked XOR-Net Compression for Accelerating Pruned-DNN？Models," Proceedings of Machine Learning and Systems, Miami, FL, USA, June 2023.？

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

효율적인 Transformer 모델 경량화를 위한 구조화된 프루닝
Structured Pruning for Efficient Transformer Model compression 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (32)

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

효율적인 Transformer 모델 경량화를 위한 구조화된 프루닝 Structured Pruning for Efficient Transformer Model compression 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (32)

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

효율적인 Transformer 모델 경량화를 위한 구조화된 프루닝
Structured Pruning for Efficient Transformer Model compression 원문보기

초록
AI-Helper