[논문]KMSAV: Korean multi-speaker spontaneous audiovisual dataset

Kiyoung Park; Changhan Oh; Sunghee Dong

doi:10.4218/etrij.2023-0352

KMSAV: Korean multi-speaker spontaneous audiovisual dataset 원문보기

ETRI journal, v.46 no.1, 2024년, pp.71 - 81

Kiyoung Park (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute) , Changhan Oh (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute) , Sunghee Dong (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute)

Abstract ▼ AI-Helper

Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open-source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state-of-the-art ASR and AVSR techniques, capitalizing on both pretrained models and fine-tuning processes. After fine-tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

주제어

참고문헌 (30)

T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A.？Zisserman, Deep audio-visual speech recognition, IEEE Trans.？Pattern Anal. Machine Intellig. 44 (2022), no. 12, 8717-8727.
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and？M. Pantic, End-to-end audiovisual speech recognition, (IEEE？Int. Conf. Acoust. Speech Signal Process. (ICASSP), Calgary,？Canada), 2018, pp. 6548-6552.
J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, An overview of？noise-robust automatic speech recognition, IEEE/ACM Trans.？Audio Speech Lang. Process. 22 (2014), no. 4, 745-777.
J. Chung, A. Senior, O. Vinyals, and A. Zisserman, Lip reading？sentences in the wild, (IEEE Conf. Comput. Vision Pattern？Recognit. (CVPR), Honolulu, HI, USA), 2017, pp. 3444-3453.
P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis,？and M. Pantic, Auto-AVSR: audio-visual speech recognition？with automatic labels, (IEEE Int. Conf. Acoust. Speech Signal？Process. (ICASSP), Rhodes Island, Greece), 2023, pp. 1-5.
B. Shi, W.-N. Hsu, and A. Mohamed, Robust self-supervised？audio-visual speech recognition, arXiv preprint, 2022, DOI 10.48550/arXiv.2201.01763
I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R.？Harvey, Extraction of visual features for lipreading, IEEE Trans.？Pattern Anal. Machine Intell. 24 (2002), no. 2, 198-213.
E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy,？CUAVE: a new audio-visual database for multimodal human-computer interface research, (IEEE Int. Conf. Acoust. Speech？Signal Process., Orlando, FL, USA), 2002, DOI 10.1109/ICASSP.2002.5745028.
I. Anina, Z. Zhou, G. Zhao, and M. Pietikainen, OuluVS2: a？multi-view audiovisual database for non-rigid mouth motion？analysis, (11th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit. (FG), Ljubljana, Slovenia), 2015, DOI 10.1109/FG.2015.7163155
T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, (Proc. 6th Int. Conf. Multimodal？Interfaces, ICMI '04, Association for Computing Machinery,？New York, NY, USA), 2004, pp. 235-242.
J. Park, J.-W. Hwang, K. Choi, S.-H. Lee, J. H. Ahn, R.-H.？Park, and H.-M. Park, OLKAVS: an open large-scale Korean？audio-visual speech dataset, arXiv preprint, 2023, DOI 10.48550/arXiv.2301.06375.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec 2.0: a？framework for self-supervised learning of speech representations,？(34th Conference Neural Information Processing Systems,？Vancouver, Canada), 2020, pp. 12449-12460.
Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency,？and R. Salakhutdinov, Multimodal transformer for unaligned？multimodal language sequences, (Proc. 57th Annu. Meet.？Assoc. Comput. Ling., Florence, Italy), 2019, pp. 6558-6569.
B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, Learning？audio-visual speech representation by masked multimodal cluster prediction, arXiv preprint, 2022, DOI 10.48550/arXiv.2201.02184
T. Afouras, J. Son Chung, and A. Zisserman, LRS3-TED: a？large-scale dataset for visual speech recognition, arXiv preprint,？2018, DOI 10.48550/arXiv.1809.00496
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A.？Hassidim, W. Freeman, and M. Rubinstein, Looking to listen at？the cocktail party: a speaker-independent audio-visual model for？speech separation, ACM Trans. Graph. 37 (2018), no. 4, 1-11.

상세보기
J. S. Chung, A. Nagrani, and A. Zisserman, VoxCeleb2: deep？speaker recognition, (Proc. INTERSPEECH, Hyderabad, India),？2018, pp. 1086-1090. DOI 10.21437/Interspeech.2018-1929
T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G.？Avidov, R. Collobert, and G. Synnaeve, Rethinking evaluation？in ASR: are our models robust enough? (INTERSPEECH, Brno,？Czechia), 2021, pp. 311-315.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and？I. Sutskever, Robust speech recognition via large-scale weak？supervision, (Int. Conf. Mach. Learn., Honolulu, HI, USA),？2023, pp. 28492-28518.
Y. Zhang, D. Park, W. Han, J. Qin, A. Gulati, J. Shor, A.？Jansen, Y. Xu, Y. Huang, S. Wang, Z. Zhou, B. Li, M. Ma, W.？Chan, J. Yu, Y. Wang, L. Cao, K. Sim, B. Ramabhadran, and？Y. Wu, BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition, IEEE？J. Sel. Top. Signal Process. 16 (2022), 1-14.
M. Cooke, J. Barker, S. P. Cunningham, and X. Shao, An？audio-visual corpus for speech perception and automatic speech？recognition, J. Acoust. Soc. Am. 120 (2006), no. 5, 2421-2424,？DOI 10.1121/1.2229005

상세보기
B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S.？Borys, M. Liu, and T. Huang, AVICAR: audio-visual speech？corpus in a car environment, (Proc. INTERSPEECH, Jeju, Rep.？of Korea), 2004, pp. 2489-2492.
G. Zhao, M. Barnard, and M. Pietikainen, Lipreading with local？spatiotemporal descriptors, IEEE Trans. Multimed. 11 (2009),？no. 7, 1254-1265.
J. S. Chung and A. Zisserman, Lip reading in the wild, (Proc.？Asian Conf. Comput. Vision, Taipei, Taiwan), 2016,？pp. 87-103.
A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, Voxceleb:？large-scale speaker verification in the wild, Comput. Speech？Lang. 60 (2020), 101027, DOI 10.1016/j.csl.2019.101027

상세보기
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y.？Fu, and A. C. Berg, SSD: single shot multibox detector, Computer Vision-ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, (eds.), Lecture Notes in Computer Science, Vol. 9905,？Springer, Cham, 2016, pp. 21-37.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y.？Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N.？Chen, A. Renduchintala, and T. Ochiai, ESPnet: end-to-end？speech processing toolkit, (Proc. INTERSPEECH, Hyderabad,？India), 2018, pp. 2207-2211.
R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, Is？someone speaking? Exploring long-term temporal features for？audio-visual active speaker detection, (Proc. 29th ACM Int.？Conf. Multimedia, Association for Computing Machinery,？New York, NY, USA), 2021, pp. 3927-3935.
D. E. King, Dlib-ml: a machine learning toolkit, J. Mach. Learn.？Res. 10 (2009), 1755-1758.
D. Snyder, G. Chen, and D. Povey, MUSAN: a music, speech,？and noise corpus, arXiv preprint, 2015, DOI 10.48550/arXiv.1510.08484

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

KMSAV: Korean multi-speaker spontaneous audiovisual dataset 원문보기

Abstract ▼ AI-Helper

주제어

참고문헌 (30)

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트