[논문]쿠버네티스에서 분산 학습 작업 성능 향상을 위한 오토스케일링 기반 동적 자원 조정 오퍼레이터

정진원; 유헌창

doi:10.3745/ktccs.2022.11.7.205

쿠버네티스에서 분산 학습 작업 성능 향상을 위한 오토스케일링 기반 동적 자원 조정 오퍼레이터
Dynamic Resource Adjustment Operator Based on Autoscaling for Improving Distributed Training Job Performance on Kubernetes 원문보기

정보처리학회논문지. KIPS transactions on computer and communication systems 컴퓨터 및 통신 시스템, v.11 no.7, 2022년, pp.205 - 216

초록
AI-Helper

딥러닝 분산 학습에 사용되는 많은 도구 중 하나는 컨테이너 오케스트레이션 도구인 쿠버네티스에서 실행되는 큐브플로우이다. 그리고 큐브플로우에서 기본적으로 제공하는 오퍼레이터를 사용하여 텐서플로우 학습 작업을 관리할 수 있다. 하지만 파라미터 서버 아키텍처 기반의 딥러닝 분산 학습 작업을 고려할 때 기존의 오퍼레이터가 사용하는 스케줄링 정책은 분산학습 작업의 태스크 친화도를 고려하지 않으며 자원을 동적으로 할당하거나 해제하는 기능을 제공하지 않는다. 이는 작업의 완료 시간이 오래 걸리거나 낮은 자원 활용률로 이어질 수 있다. 따라서 본 논문에서는 작업의 완료 시간을 단축시키고 자원 활용률을 높이기 위해 딥러닝 분산 학습 작업을 효율적으로 스케줄링하는 새로운 오퍼레이터를 제안한다. 기존 오퍼레이터를 수정하여 새로운 오퍼레이터를 구현하고 성능 평가를 위한 실험을 수행한 결과, 제안한 스케줄링 정책은 평균 작업 완료 시간 감소율을 최대 84%, 평균 CPU 활용 증가율을 최대 92%까지 향상시킬 수 있음을 보여준다.

Abstract ▼ AI-Helper

One of the many tools used for distributed deep learning training is Kubeflow, which runs on Kubernetes, a container orchestration tool. TensorFlow jobs can be managed using the existing operator provided by Kubeflow. However, when considering the distributed deep learning training jobs based on the parameter server architecture, the scheduling policy used by the existing operator does not consider the task affinity of the distributed training job and does not provide the ability to dynamically allocate or release resources. This can lead to long job completion time and low resource utilization rate. Therefore, in this paper we proposes a new operator that efficiently schedules distributed deep learning training jobs to minimize the job completion time and increase resource utilization rate. We implemented the new operator by modifying the existing operator and conducted experiments to evaluate its performance. The experiment results showed that our scheduling policy improved the average job completion time reduction rate of up to 84% and average CPU utilization increase rate of up to 92%.

주제어

표/그림 (11)

그림 Fig. 1. Parameter Server Architecture
그림 Fig. 2. Kubernetes Scheduling Lifecycle
그림 Fig. 3. Overall Architecture
그림 Fig. 4. Relationship between Training Time and Number of Workers
표 Table 1. Notation of Algorithm
그림 Fig. 5. State transition diagram of a TFJob
그림 Fig. 6. KOD2 Architecture
그림 Fig. 7. Average Job Completion Time
표 Table 2. Cluster Configuration
표 Table 3. Training Job Configuration
그림 Fig. 8. Average CPU Utilization

참고문헌 (27)

T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," In ACM Computing Surveys (CSUR), Vol.52, No.4, pp.1-43, 2019.
P. Goyal, et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
X. Jia, et al., "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.
C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, "Image classification at supercomputer scale," arXiv preprint arXiv:1811.06992, 2018.
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. "Measuring the effects of data parallelism on neural network training," arXiv:1811.03600, 2018.
M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. "Parameter server for distributed machine learning," In Big Learning NIPS Workshop, 2013.
Kubeflow 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org.
TensorFlow Operator 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org/docs/components/training/tftraining/#installing-tensorflow-operator.
S. Li, et al., "Pytorch distributed: Experiences on accelerating data parallel training," arXiv preprint arXiv: 2006.15704, 2020.
H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, "Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server," In Proceedings of the Eleventh European Conference on Computer Systems, 2016.
Kubernetes 2021, accessed 1 September 2021 [Internet], https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler.
Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. "Optimus: An efficient dynamic resource scheduler for deep learning clusters," In Proceedings of ACM EuroSys, 2018.
Operator 2021, accessed 1 September 2021 [Internet], https://cloud.redhat.com/learn/topics/operators.
J. Gu, "Tiresias: A GPU cluster manager for distributed deep learning," In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. "Multi-resource packing for cluster schedulers," In ACM SIGCOMM Computer Communication Review, Vol.44, No.4, pp.455-466, 2014.

상세보기
Y. Chen, "Convolutional neural network for sentence classification," MS thesis, University of Waterloo, 2015.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Prometheus 2021, accessed 1 September 2021 [Internet], https://prometheus.io.
Grafana 2021, accessed 1 September 2021 [Internet], https://grafana.com
J. Geng, D. Li, and S. Wang. "Accelerating distributed machine learning by smart parameter server," In Proceedings of 3rd Asia-Pacific Workshop Networking, 2019.
E. Gebremeskel, "Analysis and comparison of distributed training techniques for deep neural networks in a dynamic environment," 2018.
W. Xiao, "Gandiva: Introspective cluster scheduling for deep learning," In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018.
Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, 2018.
M. Khalil-Hani and S. Liew, "A-sdlm: an asynchronous stochastic learning algorithm for fast distributed learning," In 13th Australasian Symposium on Parallel and Distributed Computing, 2015.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증