[논문]수어 번역을 위한 3차원 컨볼루션 비전 트랜스포머

성호렬; 조현중

doi:10.3745/tkips.2024.13.3.140

수어 번역을 위한 3차원 컨볼루션 비전 트랜스포머
Three-Dimensional Convolutional Vision Transformer for Sign Language Translation 원문보기

The Transactions of the Korea Information Processing Society, v.13 no.3, 2024년, pp.140 - 147

성호렬 (고려대학교 컴퓨터정보학과) , 조현중 (고려대학교 컴퓨터융합소프트웨어학과)

초록
AI-Helper

한국에서 청각장애인은 지체장애인에 이어 두 번째로 많은 등록 장애인 그룹이다. 하지만 수어 기계 번역은 시장 성장성이 작고, 엄밀하게 주석처리가 된 데이터 세트가 부족해 발전 속도가 더디다. 한편, 최근 컴퓨터 비전과 패턴 인식 분야에서 트랜스포머를 사용한 모델이 많이 제안되고 있는데, 트랜스포머를 이용한 모델은 동작 인식, 비디오 분류 등의 분야에서 높은 성능을 보여오고 있다. 이에 따라 수어 기계 번역 분야에서도 트랜스포머를 도입하여 성능을 개선하려는 시도들이 제안되고 있다. 본 논문에서는 수어 번역을 위한 인식 부분을 트랜스포머와 3D-CNN을 융합한 3D-CvT를 제안한다. 또, PHOENIX-Wether-2014T [1]를 이용한 실험을 통해 제안 모델은 기존 모델보다 적은 연산량으로도 비슷한 번역 성능을 보이는 효율적인 모델임을 실험적으로 증명하였다.

Abstract ▼ AI-Helper

In the Republic of Korea, people with hearing impairments are the second-largest demographic within the registered disability community, following those with physical disabilities. Despite this demographic significance, research on sign language translation technology is limited due to several reasons including the limited market size and the lack of adequately annotated datasets. Despite the difficulties, a few researchers continue to improve the performacne of sign language translation technologies by employing the recent advance of deep learning, for example, the transformer architecture, as the transformer-based models have demonstrated noteworthy performance in tasks such as action recognition and video classification. This study focuses on enhancing the recognition performance of sign language translation by combining transformers with 3D-CNN. Through experimental evaluations using the PHOENIX-Wether-2014T dataset [1], we show that the proposed model exhibits comparable performance to existing models in terms of Floating Point Operations Per Second (FLOPs).

주제어

표/그림 (8)

그림 Fig. 1. Translation Performance(BLEU) over Computation (GFLOPs) between MMTL(SoTA) and 3D-CvT(Ours)
그림 Fig. 2. The Common Structure of Sign Language Translation Models
그림 Fig. 3. The Structure of 3D-CvT
그림 Fig. 4. Sample Frames of PHOENIX2014T (left) and Kinetics400 (right).
표 Table 1. The Detailed Structure of 3D-CvT. Input: T*3*224*224 (T: The Number of Frames), Conv. Embed.: Convolution Embedding, Conv. Proj.: Convolution Projection, MHSA : Multihead Self Attention, H : The Number of Heads, D : Feature Dimension
그림 Fig. 5. The Convolution Projectio of 3D-CvT
표 Table 2. Performance Comparison on PHOENIX2014T. The Performance of 3D-CvT is Comparable to the SoTA Models, Two-Stream (RGB and Landmark Inputs) and MMTL (RGB Inputs). The Best Performance is Marked in Bold and the Second Best Performance is Marked Underscored for Each Column.
표 Table 3. FLOPS Comparison with the SOTA Models. 3D-CvT Shows Lower FLOPs than MMTL.

참고문헌 (21)

N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, "Neural sign language translation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.？
N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, "Sign language transformers: Joint end-to-end sign language recognition and translation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.？
K. Yin, and R. Jesse, "Better sign language translation with STMC-transformer," arXiv preprint arXiv:2004.00588, 2020.？
H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, "Improving sign language translation with monolingual data by sign back-translation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.？
Y. Chen, F. Wei, X. Sun, Z. Wu, and S. Lin, "A simple multi-modality transfer learning baseline for sign language translation," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.？
Y. Chen, R. Zuo, F. Wei, Y. Wu, S. Liu, and B. Mak, "Two-stream network for sign language recognition and translation," Advances in Neural Information Processing Systems, Vol.35, pp.17043-17056, 2022.？
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.？
H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan et al., "Cvt: Introducing convolutions to vision transformers," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.？
A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," Proceedings of the 23rd International Conference on Machine Learning, 2006.？
S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, "Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification," Proceedings of the European Conference on Computer Vision (ECCV), 2018.？
W. Kay et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.？
D. Li, C. R. Opazo, X. Yu, and H. Li, "Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison," Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020.？
Y. Liu et al., "Multilingual denoising pre-training for neural machine translation," Transactions of the Association for Computational Linguistics, Vol.8, pp.726-742, 2020.？

상세보기
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, "Bleu: a method for automatic evaluation of machine translation," Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002.？
Y. Wang et al., "Internvideo: General video foundation models via generative and discriminative learning," arXiv preprint arXiv:2212.03191, 2022.？
A. J. Piergiovanni, W. Kuo, and A. Angelova, "Rethinking video vits: Sparse video tubes for joint image and video learning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.？
G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?," ICML, Vol.2, No.3, 2021.？
A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lucic, and C. Schmid, "Vivit: A video vision transformer," Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.？
M. Lewis et al., "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension," arXiv preprint arXiv:1910.13461, 2019.？
J. Guo et al., "Cmt: Convolutional neural networks meet vision transformers," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.？
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증