[논문]스켈레톤 조인트 매핑을 이용한 딥 러닝 기반 행동 인식

타스님; 백중환

doi:10.12673/jant.2020.24.2.155

스켈레톤 조인트 매핑을 이용한 딥 러닝 기반 행동 인식
Deep Learning-based Action Recognition using Skeleton Joints Mapping 원문보기

한국항행학회논문지 = Journal of advanced navigation technology, v.24 no.2, 2020년, pp.155 - 162

타스님 (한국항공대학교 항공전자정보공학부) , 백중환 (한국항공대학교 항공전자정보공학부)

초록
AI-Helper

최근 컴퓨터 비전과 딥러닝 기술의 발전으로 비디오 분석, 영상 감시, 인터렉티브 멀티미디어 및 인간 기계 상호작용 응용을 위해 인간 행동 인식에 관한 연구가 활발히 진행되고 있다. 많은 연구자에 의해 RGB 영상, 깊이 영상, 스켈레톤 및 관성 데이터를 사용하여 인간 행동 인식 및 분류를 위해 다양한 기술이 도입되었다. 그러나 스켈레톤 기반 행동 인식은 여전히 인간 기계 상호작용 분야에서 도전적인 연구 주제이다. 본 논문에서는 동적 이미지라 불리는 시공간 이미지를 생성하기 위해 동작의 종단간 스켈레톤 조인트 매핑 기법을 제안한다. 행동 클래스 간의 분류를 수행하기 위해 효율적인 심층 컨볼루션 신경망이 고안된다. 제안된 기법의 성능을 평가하기 위해 공개적으로 액세스 가능한 UTD-MHAD 스켈레톤 데이터 세트를 사용하였다. 실험 결과 제안된 시스템이 97.45 %의 높은 정확도로 기존 방법보다 성능이 우수함을 보였다.

Abstract ▼ AI-Helper

Recently, with the development of computer vision and deep learning technology, research on human action recognition has been actively conducted for video analysis, video surveillance, interactive multimedia, and human machine interaction applications. Diverse techniques have been introduced for human action understanding and classification by many researchers using RGB image, depth image, skeleton and inertial data. However, skeleton-based action discrimination is still a challenging research topic for human machine-interaction. In this paper, we propose an end-to-end skeleton joints mapping of action for generating spatio-temporal image so-called dynamic image. Then, an efficient deep convolution neural network is devised to perform the classification among the action classes. We use publicly accessible UTD-MHAD skeleton dataset for evaluating the performance of the proposed method. As a result of the experiment, the proposed system shows better performance than the existing methods with high accuracy of 97.45%.

주제어

표/그림 (9)

그림 그림 1. 손 제스처 Fig. 1. Hand Gestures.
그림 그림 2. 최신 행동 인식 시스템 Fig. 2. State-of-arts action recognition systems.
그림 그림 3. 제안한 기법의 구조 Fig. 3. Architecture of proposed method.
그림 그림 4. 20개의 스켈레톤 조인트 Fig. 4. Twenty skeleton joints.
그림 그림 5. 엔드-투-엔드 스켈레톤 조인트 매핑 Fig. 5. End-to-end skeleton joints mapping.
그림 그림 6. DCNN 네트워크의 구조 Fig. 6. The DCNN Network Architecture.
표 표 1. 파라미터 세팅 Table 1. Parameters setting.
표 표 2. 분류 결과 Table 2. Classification results.
표 표 3. 성능 비교 Table 3. Performance comparison.

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

Digital devices such as computers, smart phones, cameras are now becoming an essential part of our daily life. The main motive of our research is to provide easy and comfortable methods for interacting with those machines. With the improvements of the research, the form of interaction with those devices has also updated.

제안 방법

We introduce a noble method for action recognition based on skeleton data using a deep convolution neural network. First, the mapping of the skeleton joints is done along the temporal direction and then discriminates using the DCNN for deciding the final class. We perform experiments on three different views along XY, YZ, and ZX-axes and then fused all of them to generate the final results.
In this paper, we design an algorithm for discriminating various types of human actions performed by different parts of the human body. Initially, we generate dynamic images using for all actions by mapping between different joints information of the neighboring frames.
In this paper, we introduce a new method for action recognition using skeleton joints mapping information of 3-dimensional coordinate systems. First, we convert all the joints along XY, YZ, and ZX-axes of every frame in a video into a single frame by joining the line between the joints in neighboring frames.
Each class has 32 videos except three of them that are corrupted making a total of 861 videos of skeleton data. Most of the action is performed by hands like swipe left, swipe right, wave, clap, throw, arm cross, basketball shoot, draw x, draw circle, draw triangle, bowling, boxing, baseball swing, tennis swing, arm curl, tennis serve, push, knock, catch, pickup, and throw. Some of the actions are also captured by leg such as Jogging, walking, lunging.
Initially, we generate dynamic images using for all actions by mapping between different joints information of the neighboring frames. The spatio-temporal images along with three different views are fed into the networks for extracting meaningful features and then fused them to improve the classification rates. A modified version of the AlexNet is introduced for the purpose of the classification among the 27 action classes.
[4] illustrated an efficient method using a temporal Gabor filter and a spatial Gaussian filter for detecting spatio-temporal interest points (STIPs). Then, the authors proposed some other STIP detectors and descriptors for improving the results. Wu et al.
We evaluate the method through three different experiments along XY, YZ, and ZX-axes and then fuse all of them by concatenating 3 views in order to get better results. The experimental results are shown in terms of accuracy given by the following equation (7):
First, the mapping of the skeleton joints is done along the temporal direction and then discriminates using the DCNN for deciding the final class. We perform experiments on three different views along XY, YZ, and ZX-axes and then fused all of them to generate the final results. The proposed method outperforms over the existing systems in case of side, front, top and fused results.

대상 데이터

The 27 classes of actions are performed by 8 different subjects including 4-females and 4-males. Each class has 32 videos except three of them that are corrupted making a total of 861 videos of skeleton data. Most of the action is performed by hands like swipe left, swipe right, wave, clap, throw, arm cross, basketball shoot, draw x, draw circle, draw triangle, bowling, boxing, baseball swing, tennis swing, arm curl, tennis serve, push, knock, catch, pickup, and throw.
The dataset is captured using a Microsoft Kinect sensor in an indoor environment. It contains 27 different classes of skeleton data in which each frame having 20 joints along X, Y, and Z-axes. The 27 classes of actions are performed by 8 different subjects including 4-females and 4-males.
It contains 27 different classes of skeleton data in which each frame having 20 joints along X, Y, and Z-axes. The 27 classes of actions are performed by 8 different subjects including 4-females and 4-males. Each class has 32 videos except three of them that are corrupted making a total of 861 videos of skeleton data.

성능/효과

By observing the results in Table 2, it is clear that we get 95.76% classification accuracy for mapping joints along XY-axes which is larger in compared to the results of YZ and ZX-axes (92.48% and 94.10% respectively). The concatenation of XY, YZ and ZX-axes have strong tendency to classify the actions that shows about 97.

참고문헌 (22)

Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Boston: MC, pp. 1110-1118, 2015.
X. Yang and Y. Tian, "Super normal vector for activity recognition using depth sequences," in IEEE Conference on Computer Vision and Pattern Recognition, Columbus: OH, pp. 804-811, 2014.
V. S. Kulkarni, and S. D. Lokhande, "Appearance based recognition of american sign language using gesture segmentation," International Journal on Computer Science and Engineering, No. 3, pp. 560-565, 2010.
P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Breckenridge: CO, pp. 65-72, 2005.
D. Wu, and L. Shao, “Silhouette analysis-based action recognition via exploiting human poses,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 23, No. 2, pp. 236-243, 2012.
M. Ahmad, and S. W Lee, "HMM-based human action recognition using multiview image sequences," in 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp. 263-266, 2006.
L. Xia, C. C. Chen, and J. K. Aggarwal, "View invariant human action recognition using histograms of 3d joints," in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence: IR, pp. 20-27, 2012.
J. Luo, W. Wang, and H. Qi, "Spatio-temporal feature extraction and representation for RGB-D human action recognition," Pattern Recognition Letters, Vol. 50, pp. 139-148, 2014.

상세보기
V. Megavannan, B. Agarwal, and R. V. Babu, "Human action recognition using depth maps," in 2012 International Conference on Signal Processing and Communications (SPCOM), Piscataway: NJ, pp. 1-5, 2012.
J. Trelinski, and B. Kwolek, "Convolutional neural network-based action recognition on depth maps," in International Conference on Computer Vision and Graphics, Warsaw: Poland, pp. 209-221, 2018.
P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P.O. Ogunbona, “Action recognition from depth maps using deep convolutional neural networks,” IEEE Transactions on Human-Machine Systems, Vol. 46, No. 4, pp. 498-509, 2015.
K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in Neural Information Processing Systems, Montreal: Canada, pp. 568-576, 2014.
C. Li, Q. Zhong, D. Xie, and S. Pu, "Skeleton-based action recognition with convolutional neural networks," in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, pp. 597-600, 2017.
M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, "Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations," in Twenty-Third International Joint Conference on Artificial Intelligence, Beijing: China, pp. 2466-2472, 2013.
Y. Du, Y. Fu, and L. Wang, "Skeleton based action recognition with convolutional neural network," IEEE 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur: Malaysia, pp. 579-583, 2015.
P. Wang, Z. Li, Y. Hou, and W. Li, "Action recognition based on joint trajectory maps using convolutional neural networks," in Proceedings of the 24th ACM International Conference on ACM Multimedia, Amsterdam: Netherlands, pp. 102-106, 2016.
Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra-based action recognition using convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, No. 3, pp. 807-811, 2016.
C. Li, Y. Hou, P. Wang, and W. Li, “Joint distance maps-based action recognition with convolutional neural networks,” IEEE Signal Processing Letters, Vol. 24, No. 5, pp. 624-628, 2017.

상세보기
J. Imran, and B. Raman, "Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition," Journal of Ambient Intelligence and Humanized Computing, pp. 1-20, 2019.
UTD-MHAD skeleton dataset, University of Texas at Dalas, [Internet]. Available: https://personal.utdallas.edu/-kehtar/UTD-MHAD.html
C. Shorten, and TM. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, Vol. 6, No. 1, pp. 60, 2019.

상세보기
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Advances in Neural Information Processing Systems, Lake Tahoe: NV, pp. 1097-1105, 2012.

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증