[논문]Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment

Woo, Junghoon; Choi, Hyeonseong; Lee, Jaehwan

doi:10.3390/app10196717

[해외논문] Empirical Performance Analysis of Collective Communication for Distributed Deep Learning in a Many-Core CPU Environment 원문보기

Applied sciences, v.10 no.19, 2020년, pp.6717 -

Woo, Junghoon (School of Electronics and Information Engineering, Korea Aerospace University, 76, Hanggongdaehak-ro, Deogyang-gu, Gyeonggi-do, Goyang-si 10540, Korea) , Choi, Hyeonseong (School of Electronics and Information Engineering, Korea Aerospace University, 76, Hanggongdaehak-ro, Deogyang-gu, Gyeonggi-do, Goyang-si 10540, Korea) , Lee, Jaehwan (School of Electronics and Information Engineering, Korea Aerospace University, 76, Hanggongdaehak-ro, Deogyang-gu, Gyeonggi-do, Goyang-si 10540, Korea)

Abstract ▼ AI-Helper

To accommodate lots of training data and complex training models, “distributed” deep learning training has become employed more and more frequently. However, communication bottlenecks between distributed systems lead to poor performance of distributed deep learning training. In this study, we proposed a new collective communication method in a Python environment by utilizing Multi-Channel Dynamic Random Access Memory (MCDRAM) in Intel Xeon Phi Knights Landing processors. Major deep learning software platforms, such as TensorFlow and PyTorch, offer Python as a main development language, so we developed an efficient communication library by adapting Memkind library, which is a C-based library to utilize high-performance memory MCDRAM. For performance evaluation, we tested the popular collective communication methods in distributed deep learning, such as Broadcast, Gather, and AllReduce. We conducted experiments to analyze the effect of high-performance memory and processor location on communication performance. In addition, we analyze performance in a Docker environment for further relevance given the recent major trend of Cloud computing. By extensive experiments in our testbed, we confirmed that the communication in our proposed method showed performance improvement by up to 487%.

참고문헌 (45)

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning, MIT Press.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., and Isard, M. (2016, January 2-4). Tensorflow: A system for large-scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017, January 4-9). Automatic differentiation in pytorch. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
10.1109/MICRO.2018.00023 Li, Y., Park, J., Alian, M., Yuan, Y., Qu, Z., Pan, P., Wang, R., Schwing, A., Esmaeilzadeh, H., and Kim, N.S. (2018, January 20-24). A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks. Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2018), Fukuoka, Japan.
Oliphant Python for scientific computing Comput. Sci. Eng. 2007 10.1109/MCSE.2007.58 9 10

상세보기
Beazley, D. (2010, January 25-26). Understanding the python gil. Proceedings of the 2010 PyCON Python Conference, Atlanta, GA, USA.
(2020, May 20). Global Interpreter Lock. Available online: https://wiki.python.org/moin/GlobalInterpreterLock/.
(2020, May 20). MPI Forum. Available online: https://www.mpi-forum.org/.
Gropp A high-performance, portable implementation of the MPI message passing interface standard Parallel Comput. 1996 10.1016/0167-8191(96)00024-5 22 789

상세보기
(2020, May 20). MPI Tutorial. Available online: https://mpitutorial.com/.
10.1109/HOTCHIPS.2015.7477467 Sodani, A. (2015, January 22-25). Knights landing (knl): 2nd generation intel® xeon phi processor. Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), Cupertino, CA, USA.
(2020, May 20). Memkind Library. Available online: https://github.com/memkind/memkind/.
(2020, May 20). Cython. Available online: https://cython.org/.
Behnel Cython: The best of both worlds Comput. Sci. Eng. 2011 10.1109/MCSE.2010.118 13 31

상세보기
(2020, May 20). Docker. Available online: https://www.docker.com/.
Ahn Soft memory box: A virtual shared memory framework for fast deep neural network training in distributed high performance computing IEEE Access 2018 10.1109/ACCESS.2018.2834146 6 26493

상세보기
Peng Characterizing the performance benefit of hybrid memory system for HPC applications Parallel Comput. 2018 10.1016/j.parco.2018.04.007 76 57

상세보기
Cho Exploring the Performance Impact of Emerging Many-Core Architectures on MPI Communication J. Comput. Sci. Eng. 2018 10.5626/JCSE.2018.12.4.170 12 170

상세보기
10.1109/AINA.2017.79 Li, Z., Kihl, M., Lu, Q., and Andersson, J.A. (2017, January 27-29). Performance overhead comparison between hypervisor and container based virtualization. Proceedings of the 31st IEEE International Conference on Advanced Information Networking and Applications (AINA 2017), Taipei, Taiwan.
10.1145/3147213.3147231 Zhang, J., Lu, X., and Panda, D.K. (2017, January 5-8). Is singularity-based container technology ready for running MPI applications on HPC clouds?. Proceedings of the 10th International Conference on Utility and Cloud Computing (UCC 2017), Austin, TX, USA.
10.1109/ICPP.2016.38 Zhang, J., Lu, X., and Panda, D.K. (2016, January 16-19). High performance MPI library for container-based HPC cloud on InfiniBand clusters. Proceedings of the 45th International Conference on Parallel Processing (ICPP 2016), Philadelphia, PA, USA.
10.1109/CVPR.2009.5206848 Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20-25). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., and Yang, K. (2012, January 3-8). Large scale distributed deep networks. Proceedings of the 26th Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q.V., and Wu, Y. (2019, January 8-14). Gpipe: Efficient training of giant neural networks using pipeline parallelism. Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada.
10.1109/FAS-W.2018.00023 Kim, Y., Lee, J., Kim, J.S., Jei, H., and Roh, H. (2018, January 3-7). Efficient multi-GPU memory management for deep learning acceleration. Proceedings of the 2018 IEEE 3rd International Workshops on Foundations and Applications of Self* Systems (FAS* W), Trento, Italy.
Kim Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration Clust. Comput. 2019 10.1007/s10586-019-02974-6 23 2193

상세보기
10.1109/FAS-W.2019.00050 Kim, Y., Choi, H., Lee, J., Kim, J.S., Jei, H., and Roh, H. (2019, January 16-20). Efficient Large-Scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster. Proceedings of the 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS* W), Umeå, Sweden.
Kim Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster Clust. Comput. 2020 10.1007/s10586-020-03144-9 23 2287

상세보기
10.1109/ICASSP.2014.6854672 Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., and Bacchiani, M. (2014, January 4-9). Asynchronous stochastic optimization for sequence training of deep neural networks. Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy.
Sergeev, A., and Del Balso, M. (2018). Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv.
(2020, May 20). CPython. Available online: https://docs.python.org/3/.
Zhang, C., Yuan, X., and Srinivasan, A. (2010, January 19-23). Processor affinity and MPI performance on SMP-CMP clusters. Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), Atlanta, GA, USA.
10.1109/ICPADS.2015.69 Neuwirth, S., Frey, D., and Bruening, U. (2015, January 14-17). Communication models for distributed intel xeon phi coprocessors. Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), Melbourne, Australia.
10.1109/ACSOS-C51401.2020.00020 Lee, C., Lee, J., Koo, D., Kim, C., Bang, J., Byun, E., and Eom, H. (2020, January 17-21). Empirical Analysis of the I/O Characteristics of a Highly Integrated Many-Core Processor. Proceedings of the 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), Washington, DC, USA.
Sodani Knights landing: Second-generation intel xeon phi product IEEE Micro 2016 10.1109/MM.2016.25 36 34

상세보기
Agelastos, A.M., Rajan, M., Wichmann, N., Baker, R., Domino, S.P., Draeger, E.W., Anderson, S., Balma, J., Behling, S., and Berry, M. (2017). Performance on Trinity Phase 2 (a Cray XC40 Utilizing Intel Xeon Phi Processors) with Acceptance Applications and Benchmarks, Sandia National Lab. (SNL-NM).
Vladimirov, A., and Asai, R. (2016). Clustering Modes in Knights Landing Processors: Developer’s Guide, Colfax International.
10.1016/B978-0-12-809194-4.00002-8 Jeffers, J., Reinders, J., and Sodani, A. (2016). Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, Morgan Kaufmann.
(2020, May 20). Numpy. Available online: https://numpy.org/.
Oliphant, T.E. (2020, September 23). A Guide to NumPy. Available online: https://web.mit.edu/dvp/Public/numpybook.pdf.
Walt The NumPy array: A structure for efficient numerical computation Comput. Sci. Eng. 2011 10.1109/MCSE.2011.37 13 22

상세보기
(2020, May 20). SciPy. Available online: https://scipy.org/.
(2020, May 20). MPI for Python. Available online: https://mpi4py.readthedocs.io/en/stable/.
(2020, May 20). Pickle-Python Object Serialization. Available online: https://docs.python.org/3/library/pickle.html/.
(2020, May 20). Buffer Protocol. Available online: https://docs.python.org/3/c-api/buffer.html/.

LOADING...

활용도 분석정보

상세보기

다운로드

내보내기

활용도 Top5 논문

해당 논문의 주제분야에서 활용도가 높은 상위 5개 콘텐츠를 보여줍니다.
더보기 버튼을 클릭하시면 더 많은 관련자료를 살펴볼 수 있습니다.

원문 보기

AccessON 원문보기

원문 URL 링크

DOI : 10.3390/app10196717 [무료]
Molecular Diversity Preservation International : 저널
AccessON : 저널

*원문 PDF 파일 및 링크정보가 존재하지 않을 경우 KISTI DDS 시스템에서 제공하는 원문복사서비스를 사용할 수 있습니다.

오픈액세스(OA) 유형

GOLD

오픈액세스 학술지에 출판된 논문

저작권 관리 안내

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관

저장형식

Text(ASCII format)
Excel format
RefWorks Direct Export
RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

연합인증