[논문]Lightening of Human Pose Estimation Algorithm Using MobileViT and Transfer Learning

Kunwoo Kim; Jonghyun Hong; Jonghyuk Park

doi:10.9708/jksci.2023.28.09.017

Lightening of Human Pose Estimation Algorithm Using MobileViT and Transfer Learning 원문보기

韓國컴퓨터情報學會論文誌 = Journal of the Korea Society of Computer and Information, v.28 no.9, 2023년, pp.17 - 25

Kunwoo Kim (Dept. of AI, Big Data & Management, Kookmin University) , Jonghyun Hong (Dept. of AI, Big Data & Management, Kookmin University) , Jonghyuk Park (Dept. of AI, Big Data & Management, Kookmin University)

초록
AI-Helper

본 논문에서는 매개변수가 더 적고, 빠르게 추정 가능한 MobileViT 기반 모델을 통해 사람 자세 추정 과업을 수행할 수 있는 모델을 제안한다. 기반 모델은 합성곱 신경망의 특징과 Vision Transformer의 특징이 결합한 구조를 통해 경량화된 성능을 입증한다. 본 연구에서 주요 매커니즘이 되는 Transformer는 그 기반의 모델들이 컴퓨터 비전 분야에서도 합성곱 신경망 기반의 모델들 대비 더 나은 성능을 보이며, 영향력이 커지게 되었다. 이는 사람 자세 추정 과업에서도 동일한 상황이며, Vision Transformer기반의 ViTPose가 COCO, OCHuman, MPII 등 사람 자세 추정 벤치마크에서 모두 최고 성능을 지키고 있는 것이 그 적절한 예시이다. 하지만 Vision Transformer는 매개변수의 수가 많고 상대적으로 많은 연산량을 요구하는 무거운 모델 구조를 가지고 있기 때문에, 학습에 있어 사용자에게 많은 비용을 야기시킨다. 이에 기반 모델은 Vision Transformer가 많은 계산량을 요구하는 부족한 Inductive Bias 계산 문제를 합성곱 신경망 구조를 통한 Local Representation으로 극복하였다. 최종적으로, 제안 모델은 MS COCO 사람 자세 추정 벤치마크에서 제공하는 Validation Set으로 ViTPose 대비 각각 5분의 1과 9분의 1만큼의 3.28GFLOPs, 972만 매개변수를 나타내었고, 69.4 Mean Average Precision을 달성하여 상대적으로 우수한 성능을 보였다.

Abstract ▼ AI-Helper

In this paper, we propose a model that can perform human pose estimation through a MobileViT-based model with fewer parameters and faster estimation. The based model demonstrates lightweight performance through a structure that combines features of convolutional neural networks with features of Vision Transformer. Transformer, which is a major mechanism in this study, has become more influential as its based models perform better than convolutional neural network-based models in the field of computer vision. Similarly, in the field of human pose estimation, Vision Transformer-based ViTPose maintains the best performance in all human pose estimation benchmarks such as COCO, OCHuman, and MPII. However, because Vision Transformer has a heavy model structure with a large number of parameters and requires a relatively large amount of computation, it costs users a lot to train the model. Accordingly, the based model overcame the insufficient Inductive Bias calculation problem, which requires a large amount of computation by Vision Transformer, with Local Representation through a convolutional neural network structure. Finally, the proposed model obtained a mean average precision of 0.694 on the MS COCO benchmark with 3.28 GFLOPs and 9.72 million parameters, which are 1/5 and 1/9 the number compared to ViTPose, respectively.

주제어

표/그림 (7)

그림 Fig. 1. System Architecture
표 Table 1. Dataset
그림 Fig. 2. Comparison with Model Accuracy
그림 Fig. 3. Comparison with Model Loss
표 Table 2. Comparative Experiment Results Table
표 Table 3. Comparative Experiment Results Table
그림 Fig. 4. Mobile ViTPose Experimental Results Photos

AI 본문요약
AI-Helper

문제 정의

결과적으로 ViTPose가 자세 추정에서 좋은 성능을 보여주었지만 이를 위해서는 많은 계산 자원이 필요하게 되었다. 본 논문에서는 이와 같은 무겁고 복잡한 구조의 약점을 극복하고자 MobileViT를 사용하여 모델 규모 측면에서의 개선을 시도하였다.
본 논문의 목표는 사람 자세 추정에 있어 MobileViT를 사용하여 성능손실을 최소화하고 경량화된 모델 MobileViTPose를 제안하는 것이다. 이를 위해 사용된 모델 MobileViT는 MobileNetV2블록(MV2)과 MobileViT블록의 연속적 조합으로 구성되었으며, Fig.
본 연구에서는 합성곱 신경망과 ViT의 장점을 결합한 MobileViT를 활용하여 사람의 자세를 추정할 수 있는 모델을 제안한다. 실험결과 ViTPose보다 더 적은 연산량과 매개변수로, 더 가벼운 모델을 선보이며 MS COCO Validation Dataset에서 69.

제안 방법

제안 모델인 MobileViTPose의 구조가 단순하여 Layer와 Feature Dimension을 자유롭게 조절하며 모델의 규모를 쉽게 바꿀 수 있다. 이와 같은 특징을 활용하여 본 연구에서는 560만 개의 매개변수를 가지며 모델규모가 큰 MobileViT-S와 130만 개의 매개변수를 갖는 작은 사이즈 MobileViT-XXS를 통해 모델의 성능을 확인했다.
제안 모델 MobileViTPose의 우수성을 확인하기 위해 아래 4가지 모델들의 변형들과 성능 비교를 진행하였다. 먼저, ViTPose는 Transformer 모델을 비전 과업에 맞게 설계한 ViT 인코더를 가지고 사람 자세 추정을 수행한 모델이며, 현재 여러 벤치마크에서 최고성능을 달성한 모델이다.

대상 데이터

본 연구는 MS COCO Dataset을 통해 실험이 진행되었다. 해당 Dataset의 통계자료는 Table 1과 같다.
본 논문의 목표는 사람 자세 추정에 있어 MobileViT를 사용하여 성능손실을 최소화하고 경량화된 모델 MobileViTPose를 제안하는 것이다. 이를 위해 사용된 모델 MobileViT는 MobileNetV2블록(MV2)과 MobileViT블록의 연속적 조합으로 구성되었으며, Fig. 1과 같다.

이론/모형

본 연구에서의 평가지표로는 Average Precision(AP), Average Recall(AR) 두 가지를 사용하였다. AP, AR은 사람 자세 추정 평가에서, 추정 자세와 정답 자세의 유사성을 나타내는 척도로 사용되며, Object Keypoint Similarity(OKS)에 의해 계산된다.

성능/효과

Fig. 2를 통해 기본 모델보다 초기에 훨씬 높은 정확도와, 낮은 손실률을 기록하는 걸 확인할 수 있으며, 학습이 완료된 210epoch의 결과에서도 5.5%P 높은 AP Score를 달성했다. 이하 Fig.
ViT를 기반으로 한 ViTPose는 현재 사람 자세 추정의 기준이 되는 Dataset인 MS COCO, OCHuman, MPII에서 최고 성능을 달성했다. ViTPose는 ViT구조를 사용하였고, Masked Image Modeling으로 사전학습 된 초깃값을 통해 좋은 성능을 보여주었다. 그러나 동시에 Transformer는 Inductive Bias가 부족하여 학습에 많은 Dataset과 매개변수가 사용되기 때문에 많은 자원이 필요하다는 단점이 있다.
ViT를 기반으로 한 ViTPose는 현재 사람 자세 추정의 기준이 되는 Dataset인 MS COCO, OCHuman, MPII에서 최고 성능을 달성했다. ViTPose는 ViT구조를 사용하였고, Masked Image Modeling으로 사전학습 된 초깃값을 통해 좋은 성능을 보여주었다.
ViT-H 모델의 경우, 약 6억 3천 2백만 개의 매개변수를 사용했으며, 이는 ResNet-50이 약 2천 3백만 매개변수를 사용한 것 대비, 훨씬 큰 규모의 매개변수를 요구하였다. 결과적으로 ViTPose가 자세 추정에서 좋은 성능을 보여주었지만 이를 위해서는 많은 계산 자원이 필요하게 되었다. 본 논문에서는 이와 같은 무겁고 복잡한 구조의 약점을 극복하고자 MobileViT를 사용하여 모델 규모 측면에서의 개선을 시도하였다.
4%P 낮지만, ViTPose-B의 매개변수에서 90%, Flops에서는 83% 더 적은 수치를 보여주었다. 또한, 제안 모델의 규모 변화 실험을 통해 MobileViTPose-XXS에서 S로의 모델 크기가 증가함에 따라 MobileViTPose의 성능이 향상되는 것을 보아, MobileViTPose의 우수한 확장성과 유연성을 확인할 수 있었다.
Table 2는 비교 모델과 제안 모델의 성능을 나타낸다. 비교를 위한 CNN 계열의 모델로는 ResNet의 백본과 Deconvolution Head 네트워크의 결합을 통해 간단한 구조를 만든 SimpleBaseline(ResNet-50), 이미지의 다양한 해상도를 병렬적으로 처리한 HRNet(W32), 합성곱 층과 풀링 층의 반복적인 구조를 통해 깊은 네트워크를 표현한 VGG가 있으며, 제안 모델 MobileViTPose(S)사이에서의 성능 차이는 AP Score 기준, 각각 1%P, 5%P, 0.4%P 낮지만, 비교대상 측 모델들의 매개변수의 개수에서 73%, 68%, 52%, Flops에서는 각각, 66%, 62%, 81% 더 적은 수치를 보여주었다. 한편, 특징 학습에 있어 ViT 네트워크를 백본으로 가지며, 자세 추정을 위한 Deconvolution Head 네트워크를 결합한 모델인 ViTPose(ViT-B)는 제안 모델 MobileViTPose(S)와의 성능 차이에서, AP Score가 6.
본 연구에서는 합성곱 신경망과 ViT의 장점을 결합한 MobileViT를 활용하여 사람의 자세를 추정할 수 있는 모델을 제안한다. 실험결과 ViTPose보다 더 적은 연산량과 매개변수로, 더 가벼운 모델을 선보이며 MS COCO Validation Dataset에서 69.4% AP Score를 달성했다.
제안 모델인 MobileViTPose의 구조가 단순하여 Layer와 Feature Dimension을 자유롭게 조절하며 모델의 규모를 쉽게 바꿀 수 있다. 이와 같은 특징을 활용하여 본 연구에서는 560만 개의 매개변수를 가지며 모델규모가 큰 MobileViT-S와 130만 개의 매개변수를 갖는 작은 사이즈 MobileViT-XXS를 통해 모델의 성능을 확인했다.
4%P 낮지만, 비교대상 측 모델들의 매개변수의 개수에서 73%, 68%, 52%, Flops에서는 각각, 66%, 62%, 81% 더 적은 수치를 보여주었다. 한편, 특징 학습에 있어 ViT 네트워크를 백본으로 가지며, 자세 추정을 위한 Deconvolution Head 네트워크를 결합한 모델인 ViTPose(ViT-B)는 제안 모델 MobileViTPose(S)와의 성능 차이에서, AP Score가 6.4%P 낮지만, ViTPose-B의 매개변수에서 90%, Flops에서는 83% 더 적은 수치를 보여주었다. 또한, 제안 모델의 규모 변화 실험을 통해 MobileViTPose-XXS에서 S로의 모델 크기가 증가함에 따라 MobileViTPose의 성능이 향상되는 것을 보아, MobileViTPose의 우수한 확장성과 유연성을 확인할 수 있었다.

참고문헌 (26)

B. Xiao, H. Wu, and Y. Wei, "Simple baselines for human pose？estimation and tracking," Proceedings of the European Conference on Computer Vision, pp. 466-481, Apr. 2018. DOI:？10.48550/arXiv.1804.06208
J. Park, D. Park, D. Hwan, Y. Na, S. Lee, "Deep-Learning Based？Real-time Fire Detection Using Object Tracking Algorithm,"？Proceedings of The Korea Society of Computer and Information,？Vol. 27, No. 1, pp. 1-8, Jan. 2022. DOI:10.9708/jksci.2022.27.01.001

원문보기 상세보기
D. Hwang, G. Moon, Y. Kim, "SKU-Net: Improved U-Net using？Selective Kernel Convolution for Retinal Vessel Segmentation,"？Proceedings of The Korea Society of Computer and Information,？Vol. 26, No. 4, pp. 29-37, Apr. 2021. DOI:10.9708/jksci.2021.26.04.029

원문보기 상세보기
S. Yang, S. Lee, "Improved CNN Algorithm for Object Detection？in Large Images," Proceedings of The Korea Society of Computer？and Information, Vol. 25, No. 1, pp. 45-53, Jan. 2020. DOI:10.9708/jksci.2020.25.01.045

원문보기 상세보기
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, Aidan,？N. Gomez, L. Kaiser, I. Polosukhin, "Attention Is All You Need,"？Proceedings of the Neural Information Processing Systems, Dec.？2017. DOI: 10.48550/arXiv.1706.03762
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X.？Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold,？S. Gelly, J. Uszkoreit, N. Houlsby. "An Image Is Worth 16X16？Words: Transformers For Image Recognition At Scale,"？Proceedings of the International Conference on Learning？Representations, Aug. 2021 DOI: 10.48550/arXiv.2010.11929
H. Touvron, M. Cord, M. Douze. F. Massa, A. Sablayrolles, H.？J'egou, "Training data-efficient image transformers & distillation？through attention," arXiv preprint arXiv:2012.12877, Dec. 2022.？DOI: 10.48550/arXiv.2012.12877
A.. Krizhevsky, I. Sutskever, Geoffrey E. Hinton, "ImageNet？Classification with Deep Convolutional Neural Networks,"？Proceedings of the Neural Information Processing Systems pp.？84-90, 2012. DOI: 10.1145/3065386

상세보기
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,？D. Erhan, V. Vanhoucke, A. Rabinovich, "Going Deeper with？Convolutions," Proceedings of the IEEE Conference on Computer？Vision and Pattern Recognition, June. 2015. DOI: 10.1109/CVPR.2015.7298594
K. Simonyan, A. Zisserman, "Very Deep Convolutional？Networks for Large-Scale Image Recognition," arXiv preprint？arXiv:1409.1556, Apr. 2014. DOI: 10.48550/arXiv.1409.1556
K. He, X. Zhang, S. Ren, J. Sun, "Deep Residual Learning for？Image Recognition," Proceedings of the IEEE Conference on？Computer Vision and Pattern Recognition, 770-778. June. 2016.？DOI: 10.1109/CVPR.2016.90
G. Huang, Z. Liu, L. van der Maaten, Kilian Q. Weinberger,？"Densely Connected Convolutional Networks," Proceedings of？the IEEE Conference on Computer Vision and Pattern？Recognition, pp. 4700-4708, July. 2017. DOI: 10.1109/CVPR.2017.243
J. Wang, Ke Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D.？Liu, Y. Mu, M. Tan, X. Wang, W. Liu, B. Xiao, " Deep？High-Resolution Representation Learning for Visual？Recognition," Proceedings of the IEEE Conference on Computer？Vision and Pattern Recognition, pp. 3349-3364, Apr. 2021. DOI:？10.1109/TPAMI.2020.2983686

상세보기
Y. Xu1, J. Zhang, Q. Zhang, D. Tao. "ViTPose: Simple Vision？Transformer Baselines for Human Pose Estimation,"？Proceedings of the Neural Information Processing Systems, Oct.？2022. DOI: 10.48550/arXiv.2204.12484
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D.？Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft coco:？Common objects in context," Proceedings of the European？Conference on Computer Vision, May. 2014. DOI: 10.48550/arXiv.1405.0312
S.-H. Zhang, R. Li, X. Dong, P. Rosin, Z. Cai, X. Han, D.？Yang, H. Huang, and S.-M. Hu. "Pose2seg: Detection free？human instance segmentation," Proceedings of the IEEE/CVF？Conference on Computer Vision and Pattern Recognition, pp.？889-898. June. 2019. DOI: 10.1109/CVPR.2019.00098
M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, "2d？human pose estimation: New benchmark and state of the art？analysis," Proceedings of the Conference on Computer Vision？and Pattern Recognition, pp. 3686-3693. June. 2014. DOI:？10.1109/CVPR.2014.471
T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollar, and R.？Girshick, "Early convolutions help transformers see better,"？arXiv preprint arXiv:2106.14881. June. 2021. DOI: 10.48550/arXiv.2106.14881
D. Mehta, M. Rastegari, "Mobilevit: Light-Weight, General-Purpose, And Mobile-Freindly Vision Transformer,"？Proceedings of the International Conference on Learning？Representations. Jan. 2022. DOI: 10.48550/arXiv.2110.02178
Q. Cheng, X. Li, B. Zhu,Y. Shi, B. Xie, "Drone Detection？Method Based on MobileViT and CA-PANet," Proceedings of？the Electronics, pp. 223-239. Dec. 2023. DOI: 10.3390/electronics12010223

상세보기
Y. Yang, L. Zhang, L. Ren, X. Wang, "MMViT-Seg: A？lightweight transformer and CNN fusion network for？COVID-19 segmentation," Proceedings of the Computer？Methods and Programs in Biomedicine, Mar. 2023. DOI:？10.1016/j.cmpb.2023.107348

상세보기
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,？"Imagenet: A large-scale hierarchical image database,"？Proceedings of the Conference on Computer Vision and Pattern？Recognition, pp. 248-255, June. 2009. DOI: 10.1109/CVPR.2009.5206848
M. Contributors, "Openmmlab pose estimation toolbox and？benchmark," https://github.com/open-mmlab/mmpose.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S.？Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,？Alexander, C. Berg. Li Fei-Fei, "ImageNet Large Scale Visual？Recognition Challenge," arXiv preprint arXiv:1409.0575, Sep.？2014. DOI: 10.48550/arXiv.1409.0575
S. J. Reddi, S. Kale, and S. Kumar, "On the convergence of？adam and beyond," Proceedings of the International Conference？on Learning Representations, Sep. 2019. DOI: 10.48550/arXiv.1904.09237
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov,？and Q. V. Le, "Xlnet: Generalized autoregressive pretraining？for language understanding," Proceedings of the Neural？Information Processing Systems, June. 2019. DOI: 10.48550/arXiv.1906.08237

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증