[논문]Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory

Hong, Joanna; Kim, Minsu; Park, Se Jin; Ro, Yong Man

doi:10.1109/taslp.2021.3126925

[해외논문] Speech Reconstruction With Reminiscent Sound Via Visual Voice Memory

IEEE/ACM transactions on audio, speech, and language processing, v.29, 2021년, pp.3654 - 3667

Hong, Joanna (Korea Advanced Institute of Science and Technology (KAIST), Image and Video Systems Laboratory, School of Electrical Engineering, Daejeon, Korea) , Kim, Minsu (Korea Advanced Institute of Science and Technology (KAIST), Image and Video Systems Laboratory, School of Electrical Engineering, Daejeon, Korea) , Park, Se Jin (Korea Advanced Institute of Science and Technology (KAIST), Image and Video Systems Laboratory, School of Electrical Engineering, Daejeon, Korea) , Ro, Yong Man (Korea Advanced Institute of Science and Technology (KAIST), Image and Video Systems Laboratory, School of Electrical Engineering, Daejeon, Korea)

Abstract ▼ AI-Helper

The goal of this work is to reconstruct speech from silent video, in both speaker dependent and independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech. Hence, our method employs both video and audio information during training time but does not require any additional auditory input during inference. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and unseen speaker training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works on both multi-speaker and speaker independent settings. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.

참고문헌 (64)

Cooke, Martin, Barker, Jon, Cunningham, Stuart, Shao, Xu. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, vol.120, no.5, 2421-2424.

상세보기
IEEE/ACM Trans Audio Speech Lang Process Non-parallel sequence-to-sequence voice conversion with disentangled linguistic and speaker representations zhang 2019 10.1109/TASLP.2019.2960721 28 540

상세보기
10.1109/ICASSP.2018.8461368
10.4324/9780203098752
10.1109/ICASSP39728.2021.9414040
10.21437/Interspeech.2019-1445
Zhang, Jing-Xuan, Ling, Zhen-Hua, Liu, Li-Juan, Jiang, Yuan, Dai, Li-Rong. Sequence-to-Sequence Acoustic Modeling for Voice Conversion. IEEE/ACM transactions on audio, speech, and language processing, vol.27, no.3, 631-644.

상세보기
10.21437/Interspeech.2017-314
Wavenet: A generative model for raw oord 2016
Proc Adv Neural Inf Process Syst Melgan: Generative adversarial networks for conditional waveform synthesis kumar 2019 32 14?910
Int J Signal Process Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks chakroborty 2007 4 114
Proc Int Conf Learn Representations Deep voice 3: 2000-speaker neural text-to-speech ping 0 214
Proc Int Conf Mach Learn Parallel wavenet: Fast high-fidelity speech synthesis oord 0 3918
Fast wavenet generation algorithm paine 2016
10.1109/ISM.2018.00-19
Proc Adv Neural Inf Process Syst Transfer learning from speaker verification to multispeaker text-to-speech synthesis jia 2018 31
Salik, Khwaja Mohd., Aggarwal, Swati, Kumar, Yaman, Shah, Rajiv Ratn, Jain, Rohit, Zimmermann, Roger. Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading. Proceedings of the ... aaai conference on artificial intelligence, vol.33, 10023-10024.

상세보기
10.21437/Interspeech.2019-3269
Afouras, Triantafyllos, Chung, Joon Son, Senior, Andrew, Vinyals, Oriol, Zisserman, Andrew. Deep Audio-Visual Speech Recognition. IEEE transactions on pattern analysis and machine intelligence, vol.44, no.12, 8717-8727.

상세보기
Dupont, S., Luettin, J.. Audio-visual speech modeling for continuous speech recognition. IEEE transactions on multimedia, vol.2, no.3, 141-151.

상세보기
10.1109/ICASSP.2017.7953127
10.1109/ICCVW.2017.61
Le Cornu, Thomas, Milner, Ben. Generating Intelligible Audio Speech From Visual Speech. IEEE/ACM transactions on audio, speech, and language processing, vol.25, no.9, 1751-1761.

상세보기
10.1109/CVPR42600.2020.01381
10.1109/ICASSP.2018.8461856
10.1145/3240508.3241911
10.21437/Interspeech.2020-1026
10.1109/CVPR.2016.90
Schuster, M., Paliwal, K.K.. Bidirectional recurrent neural networks. IEEE transactions on signal processing : a publication of the IEEE Signal Processing Society, vol.45, no.11, 2673-2681.

상세보기
10.1109/ICASSP.2001.941023
Jensen, Jesper, Taal, Cees H.. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers. IEEE/ACM transactions on audio, speech, and language processing, vol.24, no.11, 2009-2022.

상세보기
10.1109/ICASSP.2010.5495701
Griffin, D., Jae Lim,. Signal estimation from modified short-time Fourier transform. IEEE transactions on acoustics, speech, and signal processing, vol.32, no.2, 236-243.

상세보기
Proc Int Conf Learn Representations (ICLR) Adam: A method for stochastic optimization kingma 2015
Proc IEEE Int Conf Comput Vis S3FD: Single shot scale-invariant face detector zhang 0 192
Proc Adv Neural Inf Process Syst Attention-based models for speech recognition chorowski 2015 577
10.1214/aoms/1177729694
10.1109/CVPR.2017.367
LipNet: End-to-end sentence-level lipreading assael 2016
Harte, Naomi, Gillen, Eoin. TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech. IEEE transactions on multimedia, vol.17, no.5, 603-615.

상세보기
Lip-reading with hierarchical pyramidal convolution and self-attention chen 2020
10.1007/978-3-319-54184-6_6
10.21437/Interspeech.2017-85
Proc Brit Mach Vis Conf Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading weng 2019
Multi-grained spatio-temporal modeling for lip-reading wang 2019
10.1109/FG47880.2020.00132
10.1109/FG47880.2020.00133
10.1109/CVPR42600.2020.01444
10.1109/ICASSP39728.2021.9414353
Jong-Seok Lee, Cheol Hoon Park. Robust Audio-Visual Speech Recognition Based on Late Integration. IEEE transactions on multimedia, vol.10, no.5, 767-779.

상세보기
Adeel, Ahsan, Gogate, Mandar, Hussain, Amir. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Information fusion, vol.59, 163-170.

상세보기
Sadeghi, Mostafa, Leglaive, Simon, Alameda-Pineda, Xavier, Girin, Laurent, Horaud, Radu. Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders. IEEE/ACM transactions on audio, speech, and language processing, vol.28, 1788-1800.

상세보기
10.21437/Interspeech.2018-1400
Adeel, Ahsan, Ahmad, Jawad, Larijani, Hadi, Hussain, Amir. A Novel Real-Time, Lightweight Chaotic-Encryption Scheme for Next-Generation Audio-Visual Hearing Aids. Cognitive computation, vol.12, no.3, 589-601.

상세보기
Proc Proc 27th Int Conf Mach Learn Rectified linear units improve restricted Boltzmann machines nair 2010 807
10.1109/ICASSP.2018.8461326
Adv Neural Inf Process Syst End-to-end memory networks sukhbaatar 2015 28 2440
10.1109/CVPR.2019.00595
Proc Int Conf Mach Learn Batch normalization: Accelerating deep network training by reducing internal covariate shift ioffe 0 448
10.18653/v1/D16-1147
10.1109/ICASSP40776.2020.9053841
10.3115/v1/W14-4012
10.1109/CVPR.2018.00429
Proc Int Conf Learn Representations (ICLR) Learning to remember rare events kaiser 2017

LOADING...

활용도 분석정보

상세보기

다운로드

내보내기

활용도 Top5 논문

해당 논문의 주제분야에서 활용도가 높은 상위 5개 콘텐츠를 보여줍니다.
더보기 버튼을 클릭하시면 더 많은 관련자료를 살펴볼 수 있습니다.

원문 URL 링크

DOI : 10.1109/TASLP.2021.3126925
IEEE : 저널 > 논문
Association for Computing Machinery : 저널

*원문 PDF 파일 및 링크정보가 존재하지 않을 경우 KISTI DDS 시스템에서 제공하는 원문복사서비스를 사용할 수 있습니다.

저작권 관리 안내

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관

저장형식

Text(ASCII format)
Excel format
RefWorks Direct Export
RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

연합인증