[논문]심층 신경망 병렬 학습 방법 연구 동향

육동석; 이효원; 유인철

doi:10.7776/ask.2020.39.6.505

심층 신경망 병렬 학습 방법 연구 동향
A survey on parallel training algorithms for deep neural networks 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.39 no.6, 2020년, pp.505 - 514

육동석 (고려대학교 컴퓨터학과 인공지능 연구실) , 이효원 (KT 융합기술원 AI 연구소) , 유인철 (고려대학교 컴퓨터학과 인공지능 연구실)

초록
AI-Helper

심층 신경망(Deep Neural Network, DNN) 모델을 대량의 학습 데이터로 학습시키기 위해서는 많은 시간이 소요되기 때문에 병렬 학습 방법이 필요하다. DNN의 학습에는 일반적으로 Stochastic Gradient Descent(SGD) 방법이 사용되는데, SGD는 근본적으로 순차적인 처리가 필요하므로 병렬화하기 위해서는 다양한 근사(approximation) 방법을 적용하게 된다. 본 논문에서는 기존의 DNN 병렬 학습 알고리즘들을 소개하고 연산량, 통신량, 근사 방법 등을 분석한다.

Abstract ▼ AI-Helper

Since a large amount of training data is typically needed to train Deep Neural Networks (DNNs), a parallel training approach is required to train the DNNs. The Stochastic Gradient Descent (SGD) algorithm is one of the most widely used methods to train the DNNs. However, since the SGD is an inherently sequential process, it requires some sort of approximation schemes to parallelize the SGD algorithm. In this paper, we review various efforts on parallelizing the SGD algorithm, and analyze the computational overhead, communication overhead, and the effects of the approximations.

주제어

표/그림 (8)

그림 Fig. 1. Synchronous SGD for DNN training; (1) master model distribution; (2) local gradient computation; (3) local gradient upload; (4) synchronous master model update. Steps (1) through (4) are repeated.
그림 Fig. 2. Asynchronous SGD for DNN training; (1) asynchronous master model download; (2) local gradient computation; (3) asynchronous local gradient upload; (4) asynchronous master model update. Steps (1) through (4) are repeated.
그림 Fig. 3. General model parallelism for DNN training. The neurons, represented by grey circles, of a DNN are distributed over the computing nodes, which are represented as large dashed rectangles.
그림 Fig. 4. Pipelined parallelism for DNN training. The layers of a DNN are distributed over the computing nodes. For simplicity, only two layers are shown in each computing node. In general, more than two layers may be assigned to a computing node.
그림 Fig. 5. Parallel execution of pipelined SGD for DNN training, where $F^t_l$ and $B^t_l$ represent the forward pass and the backward pass, respectively, at layer  for mini-batch . For simplicity, it is assumed that a single layer is assigned to each computing node.
그림 Fig. 6. Parallel execution of pipelined SGD for DNN training using the appropriate stale weights for the stale gradients. $W^t_l$ represents the stored weights at layer l for mini-batch t. For simplicity, it is assumed that a single layer is assigned to each computing node.
그림 Fig. 7. The pipelined SGD for DNN training using the corresponding stale weights for the stale gradients becomes the data parallel SGD.
표 Table 1. Computation and communication overhead of various parallel SGD methods for DNN training.

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

[14] 본 논문에서는 mini-batch SGD를 추가적으로 병렬화하는 방법들을 소개한다.
본 논문에서는 DNN 학습을 위한 여러 가지 병렬 알고리즘에 대하여 알아보았다. DNN 기반의 알고리즘은 많은 데이터를 사용하여 대규모의 신경망을 구축할수록 인식 성능이 향상됨을 기대할 수 있다.
이러한 근사 방법들은 SGD에서 필요한 gradient 값을 근사적인 방식으로 구하는데, 이러한 근사 gradient의 영향을 최소화할 필요가 있다. 본 논문에서는 기존의 SGD 기반 DNN 병렬 학습 알고리즘을 소개하고 연산량, 통신량, gradient 근사 방법 등을 분석한다.

가설 설정

본 장에서는 연산량 및 통신량 분석의 계산을 간단하게 하기 위하여 모든 층의 뉴런 개수가 같은 MultiLayer Perceptron(MLP) 모델을 가정한다.
이러한 문제를 해결하기 위하여 stale gradient를 이용하여 현재 마스터 모델 파라미터를 갱신하는 것은 과거 마스터 모델 파라미터로부터 유도된 gradient를 사용하는 것이기 때문에 일종의 momentum으로 간주할 수 있다고 가정하고 momentum 가중치를 실험적으로 최적화하였다.^[27] Stale gradient 문제 해결을 위한 또 다른 연구에서는 Taylor expansion을 이용하여 stale gradient 값을 보정함으로써 현재 마스터 모델 파라미터의 gradient 값을 예측하는 방법도 제안되었다.

제안 방법

여기서 컴퓨팅 노드란 하드웨어 구성에 따라서 GPU 또는 Central Processing Unit(CPU) 코어 등이 될 수 있다. 전체 학습 데이터를 컴퓨팅 노드 개수로 나눈 일부 학습 데이터와 DNN 모델 파라미터를 모든 컴퓨팅 노드에 전송하고, 각 컴퓨팅 노드에서는 서로 다른 학습 데이터를 사용하여 동시에 학습한다. 각 컴퓨팅 노드의 DNN은 서로 다른 데이터를 사용하여 학습하기 때문에 결과적으로 서로 다른 지역 모델 파라미터 값을 갖게 되는데, 이러한 지역 모델 파리미터들, 즉 지역 DNN 모델의 weight들을 통합하여 하나의 마스터 모델을 만들어야 한다.

성능/효과

데이터 병렬화 방식은 모델 파라미터 동기화 방법에 따라서 synchronous SGD와 asynchronous SGD로 나뉜다. 모델 병렬화 방식은 모델 파라미터를 여러 컴퓨팅 노드로 분산해서 학습을 진행하는데, 특히 층 단위로 분산하는 pipelined SGD가 연산 자원을 효율적으로 사용하면서 통신량이 적은 것을 알 수있었다. 데이터 병렬화 방식과 모델 병렬화 방식 모두 순차적 수행이 필요한 SGD 알고리즘을 병렬화하기 때문에 gradient 값을 근사적인 방법으로 구하는데, 이러한 근사 gradient의 영향을 최소화하는 여러 연구가 진행되어왔다.

참고문헌 (39)

G. Hinton, "Training products of experts by minimizing contrastive divergence," Neural Computation, 14, 1771-1800 (2002).

상세보기
G. Hinton, S. Osindero, and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, 18, 1527-1554 (2006).

상세보기
A. Krizhevsky, I. Sutskever, and G. Hinton, "Image Net classification with deep convolutional neural networks," Proc. Advances in Neural Information Processing Systems, 1097-1105 (2012).
G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal Processing Magazine, 29, 82-97 (2012).
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Proc. Advances in Neural Information Processing Systems, 2672-2680 (2014).
D. Kingma and M. Welling, "Auto-encoding variational Bayes," arXiv:1312.6114 (2013).
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770-778 (2016).
D. Rumelhart, G. Hinton, and R. Williams, "Learning representations by back-propagating errors," Nature, 323, 533-536 (1986).

상세보기
A. Paszke, S. Gross, F. Massa, A. Lere, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. Devito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," Proc. Advances in Neural Information Processing Systems, 8026-8037 (2019).
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Ollah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Wardern, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv:1603.04467 (2016).
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanneman, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," Proc. IEEE Automatic Speech Recognition and Understanding Workshop (2011).
S. Smith and Q. Le, "A Bayesian perspective on generalization and stochastic gradient descent," Proc. Int. Conf. on Learning Representations (2018).
T. Zhang, "Solving large scale linear prediction problems using stochastic gradient descent algorithms," Proc. Int. Conf. on Machine learning (2004).
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, "On parallelizability of stochastic gradient descent for speech DNNs," Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processing, 235-239 (2014).
M. Zinkevich, M. Weimer, A. Smola, and L. Li, "Parallelized stochastic gradient descent," Proc. Advances in Neural Information Processing Systems, 2595-2603 (2010).
L. Valiant, "A bridging model for parallel computation," Communications of the ACM, 33, 103-111 (1990).

상세보기
H. Su, H. Chen, and H. Xu, "Experiments on parallel training of deep neural network using model averaging," arXiv:1507.01239 (2015).
J. Hermans, On scalable deep learning and parallelizing gradient descent, (Master Thesis, Maastricht University, 2017).
S. Zhang, A. Choromanska, and Y. LeCun, "Deep learning with elastic averaging SGD," Proc. Advances in Neural Information Processing Systems, 685-693 (2015).
K. Chen and Q. Huo, "Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering," Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processing, 5880-5884 (2016).
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, "Large scale distributed deep networks," Proc. Advances in Neural Information Processing Systems, 1223-1231 (2012).
J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, "Revisiting distributed synchronous SGD," arXiv:1604.00981 (2016).
F. Niu, B. Recht, C. Re, and S. Wright, "Hogwild!: A lock-free approach to parallelizing stochastic gradient descent," Proc. Advances in Neural Information Processing Systems, 693-701 (2011).
S. Sallinen, N. Satish, M. Smelyanskiy, S. Sury, and C. Re, "High performance parallel stochastic gradient descent in shared memory," Proc. IEEE International Parallel and Distributed Processing Symposium, 873-882 (2016).
S. Zhao and W. Li, "Fast asynchronous parallel stochastic gradient descent: A lock-free approach with convergence guarantee," Proc. AAAI Conference on Artificial Intelligence, 2379-2385 (2016)
X. Lian, W. Zhang, C. Zhang, and J. Liu, "Asynchronous decentralized parallel stochastic gradient descent," Proc. Int. Conf. on Machine Learning, 3049-3058 (2018).
I. Mitliagkas, C. Zhang, S. Hadjis, and C. Re, "Asynchrony begets momentum, with an application to deep learning," Proc. Annual Allerton Conference, 997-1004 (2016).
S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z. Ma, and T. Liu, "Asynchronous stochastic gradient descent with delay compensation," Proc. Int. Conf. on Machine Learning, 4120-4129 (2017).
O. Yadan, K. Adams, Y. Taigman, and M. Ranzato, "Multi-GPU training of ConvNets," arXiv:1312.5853 (2013).
A. Petrowski, G. Dreyfus, and C. Girault, "Performance analysis of a pipelined backpropagation parallel algorithm," IEEE Trans. on Neural Networks, 4, 970-981 (1993).

상세보기
X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, "Pipelined back-propagation for context-dependent deep neural networks," Proc. Interspeech, 26-29 (2012).
Z. Huo, B. Gu, Q. Yang, and H. Huang, "Decoupled parallel backpropagation with convergence guarantee," Proc. Int. Conf. on Machine Learning, 2098-2106 (2018).
Z. Huo, B. Gu, and H. Huang, "Training neural networks using features replay," Proc. Advances in Neural Information Processing Systems, 6659-6668 (2018).
M. Jaderberg, W. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu, "Decoupled neural interfaces using synthetic gradients," Proc. Int. Conf. on Machine Learning, 1627-1635 (2017).
C. Chen, C. Yang, and H. Cheng, "Efficient and robust parallel DNN training through model parallelism on multi-GPU platform," arXiv:1809.02839 (2018).
Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. Chen, D. Chen, H. Lee, J. Ngiam, Q. Le, Y. Wu, and Z. Chen, "GPipe: Efficient training of giant neural networks using pipeline parallelism," Proc. Advances in Neural Information Processing Systems, 103-112 (2019).
H. Lee, K. Lee, I. Yoo, and D. Yook, "Analysis of parallel training algorithms for deep neural networks," Proc. Annual Conference on Computational Science and Computational Intelligence, 1462-1463 (2018).
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, P. Gibbons, and M. Zaharia, "PipeDream: Generalized pipeline parallelism for DNN training," ACM Symposium on Operating Systems Principles, 1-15 (2019).
S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. Subramania, J. Trmal, B. Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, and T. Yoshioka, "CHiME-6 Challenge: Tackling multispeaker speech recognition for unsegmented recordings," Proc. Int. Workshop on Speech Processing in Everyday Environments (2020).

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증