최근 컴퓨터 비전과 딥러닝 기술의 발전으로 비디오 분석, 영상 감시, 인터렉티브 멀티미디어 및 인간 기계 상호작용 응용을 위해 인간 행동 인식에 관한 연구가 활발히 진행되고 있다. 많은 연구자에 의해 RGB 영상, 깊이 영상, 스켈레톤 및 관성 데이터를 사용하여 인간 행동 인식 및 분류를 위해 다양한 기술이 도입되었다. 그러나 스켈레톤 기반 행동 인식은 여전히 인간 기계 상호작용 분야에서 도전적인 연구 주제이다. 본 논문에서는 동적 이미지라 불리는 시공간 이미지를 생성하기 위해 동작의 종단간 스켈레톤 조인트매핑 기법을 제안한다. 행동 클래스 간의 분류를 수행하기 위해 효율적인 심층 컨볼루션 신경망이 고안된다. 제안된 기법의 성능을 평가하기 위해 공개적으로 액세스 가능한 UTD-MHAD 스켈레톤 데이터 세트를 사용하였다. 실험 결과 제안된 시스템이 97.45 %의 높은 정확도로 기존 방법보다 성능이 우수함을 보였다.
최근 컴퓨터 비전과 딥러닝 기술의 발전으로 비디오 분석, 영상 감시, 인터렉티브 멀티미디어 및 인간 기계 상호작용 응용을 위해 인간 행동 인식에 관한 연구가 활발히 진행되고 있다. 많은 연구자에 의해 RGB 영상, 깊이 영상, 스켈레톤 및 관성 데이터를 사용하여 인간 행동 인식 및 분류를 위해 다양한 기술이 도입되었다. 그러나 스켈레톤 기반 행동 인식은 여전히 인간 기계 상호작용 분야에서 도전적인 연구 주제이다. 본 논문에서는 동적 이미지라 불리는 시공간 이미지를 생성하기 위해 동작의 종단간 스켈레톤 조인트 매핑 기법을 제안한다. 행동 클래스 간의 분류를 수행하기 위해 효율적인 심층 컨볼루션 신경망이 고안된다. 제안된 기법의 성능을 평가하기 위해 공개적으로 액세스 가능한 UTD-MHAD 스켈레톤 데이터 세트를 사용하였다. 실험 결과 제안된 시스템이 97.45 %의 높은 정확도로 기존 방법보다 성능이 우수함을 보였다.
Recently, with the development of computer vision and deep learning technology, research on human action recognition has been actively conducted for video analysis, video surveillance, interactive multimedia, and human machine interaction applications. Diverse techniques have been introduced for hum...
Recently, with the development of computer vision and deep learning technology, research on human action recognition has been actively conducted for video analysis, video surveillance, interactive multimedia, and human machine interaction applications. Diverse techniques have been introduced for human action understanding and classification by many researchers using RGB image, depth image, skeleton and inertial data. However, skeleton-based action discrimination is still a challenging research topic for human machine-interaction. In this paper, we propose an end-to-end skeleton joints mapping of action for generating spatio-temporal image so-called dynamic image. Then, an efficient deep convolution neural network is devised to perform the classification among the action classes. We use publicly accessible UTD-MHAD skeleton dataset for evaluating the performance of the proposed method. As a result of the experiment, the proposed system shows better performance than the existing methods with high accuracy of 97.45%.
Recently, with the development of computer vision and deep learning technology, research on human action recognition has been actively conducted for video analysis, video surveillance, interactive multimedia, and human machine interaction applications. Diverse techniques have been introduced for human action understanding and classification by many researchers using RGB image, depth image, skeleton and inertial data. However, skeleton-based action discrimination is still a challenging research topic for human machine-interaction. In this paper, we propose an end-to-end skeleton joints mapping of action for generating spatio-temporal image so-called dynamic image. Then, an efficient deep convolution neural network is devised to perform the classification among the action classes. We use publicly accessible UTD-MHAD skeleton dataset for evaluating the performance of the proposed method. As a result of the experiment, the proposed system shows better performance than the existing methods with high accuracy of 97.45%.
* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.
문제 정의
Digital devices such as computers, smart phones, cameras are now becoming an essential part of our daily life. The main motive of our research is to provide easy and comfortable methods for interacting with those machines. With the improvements of the research, the form of interaction with those devices has also updated.
제안 방법
We introduce a noble method for action recognition based on skeleton data using a deep convolution neural network. First, the mapping of the skeleton joints is done along the temporal direction and then discriminates using the DCNN for deciding the final class. We perform experiments on three different views along XY, YZ, and ZX-axes and then fused all of them to generate the final results.
In this paper, we design an algorithm for discriminating various types of human actions performed by different parts of the human body. Initially, we generate dynamic images using for all actions by mapping between different joints information of the neighboring frames.
In this paper, we introduce a new method for action recognition using skeleton joints mapping information of 3-dimensional coordinate systems. First, we convert all the joints along XY, YZ, and ZX-axes of every frame in a video into a single frame by joining the line between the joints in neighboring frames.
Each class has 32 videos except three of them that are corrupted making a total of 861 videos of skeleton data. Most of the action is performed by hands like swipe left, swipe right, wave, clap, throw, arm cross, basketball shoot, draw x, draw circle, draw triangle, bowling, boxing, baseball swing, tennis swing, arm curl, tennis serve, push, knock, catch, pickup, and throw. Some of the actions are also captured by leg such as Jogging, walking, lunging.
Initially, we generate dynamic images using for all actions by mapping between different joints information of the neighboring frames. The spatio-temporal images along with three different views are fed into the networks for extracting meaningful features and then fused them to improve the classification rates. A modified version of the AlexNet is introduced for the purpose of the classification among the 27 action classes.
[4] illustrated an efficient method using a temporal Gabor filter and a spatial Gaussian filter for detecting spatio-temporal interest points (STIPs). Then, the authors proposed some other STIP detectors and descriptors for improving the results. Wu et al.
We evaluate the method through three different experiments along XY, YZ, and ZX-axes and then fuse all of them by concatenating 3 views in order to get better results. The experimental results are shown in terms of accuracy given by the following equation (7):
First, the mapping of the skeleton joints is done along the temporal direction and then discriminates using the DCNN for deciding the final class. We perform experiments on three different views along XY, YZ, and ZX-axes and then fused all of them to generate the final results. The proposed method outperforms over the existing systems in case of side, front, top and fused results.
대상 데이터
The 27 classes of actions are performed by 8 different subjects including 4-females and 4-males. Each class has 32 videos except three of them that are corrupted making a total of 861 videos of skeleton data. Most of the action is performed by hands like swipe left, swipe right, wave, clap, throw, arm cross, basketball shoot, draw x, draw circle, draw triangle, bowling, boxing, baseball swing, tennis swing, arm curl, tennis serve, push, knock, catch, pickup, and throw.
The dataset is captured using a Microsoft Kinect sensor in an indoor environment. It contains 27 different classes of skeleton data in which each frame having 20 joints along X, Y, and Z-axes. The 27 classes of actions are performed by 8 different subjects including 4-females and 4-males.
It contains 27 different classes of skeleton data in which each frame having 20 joints along X, Y, and Z-axes. The 27 classes of actions are performed by 8 different subjects including 4-females and 4-males. Each class has 32 videos except three of them that are corrupted making a total of 861 videos of skeleton data.
성능/효과
By observing the results in Table 2, it is clear that we get 95.76% classification accuracy for mapping joints along XY-axes which is larger in compared to the results of YZ and ZX-axes (92.48% and 94.10% respectively). The concatenation of XY, YZ and ZX-axes have strong tendency to classify the actions that shows about 97.
참고문헌 (22)
Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in IEEE Conference on Computer Vision and Pattern Recognition, Boston: MC, pp. 1110-1118, 2015.
X. Yang and Y. Tian, "Super normal vector for activity recognition using depth sequences," in IEEE Conference on Computer Vision and Pattern Recognition, Columbus: OH, pp. 804-811, 2014.
V. S. Kulkarni, and S. D. Lokhande, "Appearance based recognition of american sign language using gesture segmentation," International Journal on Computer Science and Engineering, No. 3, pp. 560-565, 2010.
P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior recognition via sparse spatio-temporal features," in IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Breckenridge: CO, pp. 65-72, 2005.
D. Wu, and L. Shao, “Silhouette analysis-based action recognition via exploiting human poses,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 23, No. 2, pp. 236-243, 2012.
M. Ahmad, and S. W Lee, "HMM-based human action recognition using multiview image sequences," in 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp. 263-266, 2006.
L. Xia, C. C. Chen, and J. K. Aggarwal, "View invariant human action recognition using histograms of 3d joints," in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence: IR, pp. 20-27, 2012.
J. Luo, W. Wang, and H. Qi, "Spatio-temporal feature extraction and representation for RGB-D human action recognition," Pattern Recognition Letters, Vol. 50, pp. 139-148, 2014.
V. Megavannan, B. Agarwal, and R. V. Babu, "Human action recognition using depth maps," in 2012 International Conference on Signal Processing and Communications (SPCOM), Piscataway: NJ, pp. 1-5, 2012.
J. Trelinski, and B. Kwolek, "Convolutional neural network-based action recognition on depth maps," in International Conference on Computer Vision and Graphics, Warsaw: Poland, pp. 209-221, 2018.
P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P.O. Ogunbona, “Action recognition from depth maps using deep convolutional neural networks,” IEEE Transactions on Human-Machine Systems, Vol. 46, No. 4, pp. 498-509, 2015.
K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Advances in Neural Information Processing Systems, Montreal: Canada, pp. 568-576, 2014.
C. Li, Q. Zhong, D. Xie, and S. Pu, "Skeleton-based action recognition with convolutional neural networks," in 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, pp. 597-600, 2017.
M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, "Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations," in Twenty-Third International Joint Conference on Artificial Intelligence, Beijing: China, pp. 2466-2472, 2013.
Y. Du, Y. Fu, and L. Wang, "Skeleton based action recognition with convolutional neural network," IEEE 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur: Malaysia, pp. 579-583, 2015.
P. Wang, Z. Li, Y. Hou, and W. Li, "Action recognition based on joint trajectory maps using convolutional neural networks," in Proceedings of the 24th ACM International Conference on ACM Multimedia, Amsterdam: Netherlands, pp. 102-106, 2016.
Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra-based action recognition using convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 28, No. 3, pp. 807-811, 2016.
C. Li, Y. Hou, P. Wang, and W. Li, “Joint distance maps-based action recognition with convolutional neural networks,” IEEE Signal Processing Letters, Vol. 24, No. 5, pp. 624-628, 2017.
J. Imran, and B. Raman, "Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition," Journal of Ambient Intelligence and Humanized Computing, pp. 1-20, 2019.
UTD-MHAD skeleton dataset, University of Texas at Dalas, [Internet]. Available: https://personal.utdallas.edu/-kehtar/UTD-MHAD.html
C. Shorten, and TM. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of Big Data, Vol. 6, No. 1, pp. 60, 2019.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," in Advances in Neural Information Processing Systems, Lake Tahoe: NV, pp. 1097-1105, 2012.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.