[논문]영상 기반 음성합성에서 심도 영상의 유용성

이기승

doi:10.7776/ask.2023.42.1.067

영상 기반 음성합성에서 심도 영상의 유용성
The usefulness of the depth images in image-based speech synthesis 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.42 no.1, 2023년, pp.67 - 74

이기승 (건국대학교 전기전자공학부)

초록
AI-Helper

발성하고 있는 입 주변에서 취득한 영상은 발성 음에 따라 특이적인 패턴을 나타낸다. 이를 이용하여 화자의 얼굴 하단에서 취득한 영상으로부터 발성 음을 인식하거나 합성하는 방법이 제안되고 있다. 본 연구에서는 심도 영상을 협력적으로 이용하는 영상 기반 음성합성 기법을 제안하였다. 심도 영상은 광학 영상에서는 관찰되지 않는 깊이 정보의 취득이 가능하기 때문에 평면적인 광학 영상을 보완하는 목적으로 사용이 가능하다. 본 논문에서는 음성 합성 관점에서 심도 영상의 유용성을 평가하고자 한다. 60개의 한국어 고립어 음성에 대해 검증 실험을 수행하였으며, 실험결과 객관적, 주관적 평가에서 광학적 영상과 근접한 성능을 얻는 것을 확인할 수 있었으며 두 영상을 조합하여 사용하는 경우 각 영상을 단독으로 사용하는 경우보다 향상된 성능을 나타내었다.

Abstract ▼ AI-Helper

The images acquired from the speaker's mouth region revealed the unique patterns according to the corresponding voices. By using this principle, the several methods were proposed in which speech signals were recognized or synthesized from the images acquired at the speaker's lower face. In this study, an image-based speech synthesis method was proposed in which the depth images were cooperatively used. Since depth images yielded depth information that cannot be acquired from optical image, it can be used for the purpose of supplementing flat optical images. In this paper, the usefulness of depth images from the perspective of speech synthesis was evaluated. The validation experiment was carried out on 60 Korean isolated words, it was confirmed that the performance in terms of both subjective and objective evaluation was comparable to the optical image-based method. When the two images were used in combination, performance improvements were observed compared with when each image was used alone.

주제어

표/그림 (6)

그림 Fig. 1. Examples of the acquired optical (top) and corresponding depth (bottom) images when vocalizing vowel "ah" (left) "uh" (right).
그림 Fig. 2. Block diagram of the proposed image-based speech estimation scheme.
그림 Fig. 3. (Color available online) Average PESQs for optical image (top) and depth image (bottom) for the different number of image features and frames.
그림 Fig. 4. (Color available online) Average PESQ (top) and RMSE (bottom) according to the weights of perceptul disturbance, for optical/depth images when DCT and PCA were adopted as image feature.
그림 Fig. 5. (Color available online) Average PESQ (top) and RMSE (bottom) according to the percentage of depth image, when DCT and PCA were adopted as image feature.
그림 Fig. 6. (Color available online) Subjective listening test result.

참고문헌 (11)

B. Denby, T. Schultz, K. Honda, T. Hueber, J. M.？Gilbert, and J. S. Brumberg, "Silent speech interfaces," Speech Comm. 52, 270-287 (2010).

상세보기
K.-S. Lee, "EMG-based speech recognition using？hidden markov models with global control variables,"？IEEE Trans. Biomed. Eng. 55, 930-940 (2008).

상세보기
I. Almajai and B. Milner, "Visually derived wiener？filters for speech enhancement," IEEE Trans. Audio,？Speech, Language Proc. 19, 1642-1651 (2011).

상세보기
S. Li, Y. Tian, G. Lu, Y. Zhang, H. Lv, X. Yu, H. Xue,？H. Zhang, J. Wang, and X. Jing, "A 94-GHz milimeter-wave sensor for speech signal acquisition," Sensors,？13, 14248-14260 (2013).

상세보기
K.-S. Lee, "Speech synthesis using Doppler signal"？(in Korean), J. Acoust. Soc. Kr. 35, 134-142 (2016).

원문보기 상세보기
K.-S. Lee, "Ultrasonic doppler based silent speech？interface using perceptual distance," Appl. Sci. 12, 827？(2022).
M. A. Subhi, S. H. M. Ali, A. G. Ismail, and M.？Othman, "Food volume estimation based on stereo？image analysis," IEEE IMM, 6, 36-43 (2018).
P. Viola and M. Jones, "Rapid object detection using a？boosted cascade of simple features," Proc. IEEE？CSPV, 511-518 (2001).
D. W. Griffin and J. S. Lim, "Signal estimation from？the modified short-time fourier transform," IEEE？Trans. on Acoustic, Speech Signal Proc. 32, 236-243？(1984).

상세보기
J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez,？and A. M. Peinado, "A deep learning loss function？based on the perceptual evaluation of the speech？quality," IEEE Signal Process. Lett. 25, 1680-1684？(2018).

상세보기
ITU-T, Rec. P. 862, Perceptual Evaluation of Speech？Quality(PESQ): An Objective Method for End-ToEnd Speech Quality Assessment of Narrow Band？Telephone Networks and Speech Codecs, Int. Telecomm. Union-Telecomm. Stand. Sector, 2001.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증