최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기정보처리학회논문지. KIPS transactions on computer and communication systems 컴퓨터 및 통신 시스템, v.11 no.7, 2022년, pp.205 - 216
정진원 (고려대학교 컴퓨터학과) , 유헌창 (고려대학교 컴퓨터학과)
One of the many tools used for distributed deep learning training is Kubeflow, which runs on Kubernetes, a container orchestration tool. TensorFlow jobs can be managed using the existing operator provided by Kubeflow. However, when considering the distributed deep learning training jobs based on the...
T. Ben-Nun and T. Hoefler, "Demystifying parallel and distributed deep learning: An in-depth concurrency analysis," In ACM Computing Surveys (CSUR), Vol.52, No.4, pp.1-43, 2019.
P. Goyal, et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
X. Jia, et al., "Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes," arXiv preprint arXiv:1807.11205, 2018.
C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, "Image classification at supercomputer scale," arXiv preprint arXiv:1811.06992, 2018.
S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. "Measuring the effects of data parallelism on neural network training," arXiv:1811.03600, 2018.
M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen, and A. J. Smola. "Parameter server for distributed machine learning," In Big Learning NIPS Workshop, 2013.
Kubeflow 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org.
TensorFlow Operator 2021, accessed 1 September 2021 [Internet], https://www.kubeflow.org/docs/components/training/tftraining/#installing-tensorflow-operator.
S. Li, et al., "Pytorch distributed: Experiences on accelerating data parallel training," arXiv preprint arXiv: 2006.15704, 2020.
H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing, "Geeps: Scalable deep learning on distributed gpus with a gpu-specialized parameter server," In Proceedings of the Eleventh European Conference on Computer Systems, 2016.
Kubernetes 2021, accessed 1 September 2021 [Internet], https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler.
Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo. "Optimus: An efficient dynamic resource scheduler for deep learning clusters," In Proceedings of ACM EuroSys, 2018.
Operator 2021, accessed 1 September 2021 [Internet], https://cloud.redhat.com/learn/topics/operators.
J. Gu, "Tiresias: A GPU cluster manager for distributed deep learning," In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. "Multi-resource packing for cluster schedulers," In ACM SIGCOMM Computer Communication Review, Vol.44, No.4, pp.455-466, 2014.
Y. Chen, "Convolutional neural network for sentence classification," MS thesis, University of Waterloo, 2015.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Prometheus 2021, accessed 1 September 2021 [Internet], https://prometheus.io.
Grafana 2021, accessed 1 September 2021 [Internet], https://grafana.com
J. Geng, D. Li, and S. Wang. "Accelerating distributed machine learning by smart parameter server," In Proceedings of 3rd Asia-Pacific Workshop Networking, 2019.
E. Gebremeskel, "Analysis and comparison of distributed training techniques for deep neural networks in a dynamic environment," 2018.
W. Xiao, "Gandiva: Introspective cluster scheduling for deep learning," In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018.
Y. Bao, Y. Peng, C. Wu, and Z. Li, "Online job scheduling in distributed machine learning clusters," In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, 2018.
M. Khalil-Hani and S. Liew, "A-sdlm: an asynchronous stochastic learning algorithm for fast distributed learning," In 13th Australasian Symposium on Parallel and Distributed Computing, 2015.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.