최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기ETRI journal, v.46 no.1, 2024년, pp.71 - 81
Kiyoung Park (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute) , Changhan Oh (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute) , Sunghee Dong (Superintelligence Creative Research Laboratory, Electronics and Telecommunications Research Institute)
Recent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and ...
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A.?Zisserman, Deep audio-visual speech recognition, IEEE Trans.?Pattern Anal. Machine Intellig. 44 (2022), no. 12, 8717-8727.
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and?M. Pantic, End-to-end audiovisual speech recognition, (IEEE?Int. Conf. Acoust. Speech Signal Process. (ICASSP), Calgary,?Canada), 2018, pp. 6548-6552.
J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, An overview of?noise-robust automatic speech recognition, IEEE/ACM Trans.?Audio Speech Lang. Process. 22 (2014), no. 4, 745-777.
J. Chung, A. Senior, O. Vinyals, and A. Zisserman, Lip reading?sentences in the wild, (IEEE Conf. Comput. Vision Pattern?Recognit. (CVPR), Honolulu, HI, USA), 2017, pp. 3444-3453.
P. Ma, A. Haliassos, A. Fernandez-Lopez, H. Chen, S. Petridis,?and M. Pantic, Auto-AVSR: audio-visual speech recognition?with automatic labels, (IEEE Int. Conf. Acoust. Speech Signal?Process. (ICASSP), Rhodes Island, Greece), 2023, pp. 1-5.
I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R.?Harvey, Extraction of visual features for lipreading, IEEE Trans.?Pattern Anal. Machine Intell. 24 (2002), no. 2, 198-213.
T. J. Hazen, K. Saenko, C.-H. La, and J. R. Glass, A segment-based audio-visual speech recognizer: data collection, development, and initial experiments, (Proc. 6th Int. Conf. Multimodal?Interfaces, ICMI '04, Association for Computing Machinery,?New York, NY, USA), 2004, pp. 235-242.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec 2.0: a?framework for self-supervised learning of speech representations,?(34th Conference Neural Information Processing Systems,?Vancouver, Canada), 2020, pp. 12449-12460.
Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency,?and R. Salakhutdinov, Multimodal transformer for unaligned?multimodal language sequences, (Proc. 57th Annu. Meet.?Assoc. Comput. Ling., Florence, Italy), 2019, pp. 6558-6569.
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A.?Hassidim, W. Freeman, and M. Rubinstein, Looking to listen at?the cocktail party: a speaker-independent audio-visual model for?speech separation, ACM Trans. Graph. 37 (2018), no. 4, 1-11.
T. Likhomanenko, Q. Xu, V. Pratap, P. Tomasello, J. Kahn, G.?Avidov, R. Collobert, and G. Synnaeve, Rethinking evaluation?in ASR: are our models robust enough? (INTERSPEECH, Brno,?Czechia), 2021, pp. 311-315.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and?I. Sutskever, Robust speech recognition via large-scale weak?supervision, (Int. Conf. Mach. Learn., Honolulu, HI, USA),?2023, pp. 28492-28518.
Y. Zhang, D. Park, W. Han, J. Qin, A. Gulati, J. Shor, A.?Jansen, Y. Xu, Y. Huang, S. Wang, Z. Zhou, B. Li, M. Ma, W.?Chan, J. Yu, Y. Wang, L. Cao, K. Sim, B. Ramabhadran, and?Y. Wu, BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition, IEEE?J. Sel. Top. Signal Process. 16 (2022), 1-14.
M. Cooke, J. Barker, S. P. Cunningham, and X. Shao, An?audio-visual corpus for speech perception and automatic speech?recognition, J. Acoust. Soc. Am. 120 (2006), no. 5, 2421-2424,?DOI 10.1121/1.2229005
B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S.?Borys, M. Liu, and T. Huang, AVICAR: audio-visual speech?corpus in a car environment, (Proc. INTERSPEECH, Jeju, Rep.?of Korea), 2004, pp. 2489-2492.
G. Zhao, M. Barnard, and M. Pietikainen, Lipreading with local?spatiotemporal descriptors, IEEE Trans. Multimed. 11 (2009),?no. 7, 1254-1265.
J. S. Chung and A. Zisserman, Lip reading in the wild, (Proc.?Asian Conf. Comput. Vision, Taipei, Taiwan), 2016,?pp. 87-103.
A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, Voxceleb:?large-scale speaker verification in the wild, Comput. Speech?Lang. 60 (2020), 101027, DOI 10.1016/j.csl.2019.101027
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y.?Fu, and A. C. Berg, SSD: single shot multibox detector, Computer Vision-ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, (eds.), Lecture Notes in Computer Science, Vol. 9905,?Springer, Cham, 2016, pp. 21-37.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y.?Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N.?Chen, A. Renduchintala, and T. Ochiai, ESPnet: end-to-end?speech processing toolkit, (Proc. INTERSPEECH, Hyderabad,?India), 2018, pp. 2207-2211.
R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, Is?someone speaking? Exploring long-term temporal features for?audio-visual active speaker detection, (Proc. 29th ACM Int.?Conf. Multimedia, Association for Computing Machinery,?New York, NY, USA), 2021, pp. 3927-3935.
D. E. King, Dlib-ml: a machine learning toolkit, J. Mach. Learn.?Res. 10 (2009), 1755-1758.
*원문 PDF 파일 및 링크정보가 존재하지 않을 경우 KISTI DDS 시스템에서 제공하는 원문복사서비스를 사용할 수 있습니다.
오픈액세스 학술지에 출판된 논문
※ AI-Helper는 부적절한 답변을 할 수 있습니다.