[논문]Improved Bimodal Speech Recognition Study Based on Product Hidden Markov Model

Xi, Su Mei; Cho, Young Im

doi:10.5391/ijfis.2013.13.3.164

[국내논문] Improved Bimodal Speech Recognition Study Based on Product Hidden Markov Model 원문보기

International journal of fuzzy logic and intelligent systems : IJFIS, v.13 no.3, 2013년, pp.164 - 170

Xi, Su Mei (College of Information Technology, The University of Suwon) , Cho, Young Im (College of Information Technology, The University of Suwon)

Abstract ▼ AI-Helper

Recent years have been higher demands for automatic speech recognition (ASR) systems that are able to operate robustly in an acoustically noisy environment. This paper proposes an improved product hidden markov model (HMM) used for bimodal speech recognition. A two-dimensional training model is built based on dependently trained audio-HMM and visual-HMM, reflecting the asynchronous characteristics of the audio and video streams. A weight coefficient is introduced to adjust the weight of the video and audio streams automatically according to differences in the noise environment. Experimental results show that compared with other bimodal speech recognition approaches, this approach obtains better speech recognition performance.

주제어

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

Therefore, asynchrony can be permitted in the auditory and visual training model. This paper proposes an improved product HMM model based on the product HMM method. For Chinese words recognition, which usually corresponds to five or six states, only one state migration is allowed between the audio and video streams as a result of the presence of asynchrony.

제안 방법

Considering the asynchronous nature of speech signal and video signal, we put forward an improved product hidden Markov model (HMM) in this paper, used for implementing the bimodal voice recognition of Chinese words, which formulate an improved HMM as a multi-stream HMM. Moreover, we control the stream weights of the audio-visual HMM by the generalized Pareto distribution (GPD) algorithm [1, 3], in order to adaptively optimize the audio-visual ASR.
In this paper, with the aim of achieving effective speech recognition in noisy environments, a product HMM-based bimodal speech model allowing a one-step state offset to adapt to the asynchronous nature of the video signal and audio signal is proposed.
The normalized energy, MFCC and linear predictive cepstrum coefficients (LPCC) of speech describe the prosodic features, timbre features and perceived features, respectively, so they are selected as audio feature parameters in this paper.
A recently developed method for overcoming model mismatch is to use a reverberant speech database for training target models [9]. This method was tested on an adaptive-GMM (AGMM)-based SVR system [10] with reverberant speech, with various values of reverberation time (RT). Matching of RT between training and testing data was reported to reduce the equal-error rate (EER) from 16.
Under different SNR conditions, according to the recognition result (Table 1), we selected the MFCC-LPCC joint feature to train modal and recognize speech.

대상 데이터

We constructed a bimodal corpus and selected from seven people (five for male and two for female). The corpus contains 50 Chinese words, totaling 750 words for the seven people, including 550 words for training and the others for recognition. As needed, we added some noises of different intensity for recognition speech words.
In order to ensure the synchronization of the video and audio streams after the extraction of video and audio features, we interpolated the video features and input these feature parameters into the improved product HMM, shown as in Figure 4. The video features were the lip parameter v_t and the dynamic parameter v_t, totaling 14 dimensions. To determine the final scheme of the audio features, we preselected three sets of features as follows:

성능/효과

This method was tested on an adaptive-GMM (AGMM)-based SVR system [10] with reverberant speech, with various values of reverberation time (RT). Matching of RT between training and testing data was reported to reduce the equal-error rate (EER) from 16.44% to 9.9%, on average, when using both Z-norm and T-norm score normalizations. However, the study in [9] did not investigate the effect of GMM order on SVR performance under reverberation conditions.

후속연구

Also, more complicated interactions between the modalities can be modeled by using cross-modal associations and influences, where we still can use the proposed integration method for adaptive robustness. With these considerations, further investigation of applying the proposed system to complex tasks such as multiword or continuous speech recognition is in progress.

참고문헌 (19)

B. V. Dasarathy, "Sensor fusion potential exploitation: innovative architectures and illustrative applications," Proceedings of the IEEE, vol. 85, no. 1, pp. 24-38, Jan. 1997. http://dx.doi.org/10.1109/5.554206

상세보기
D. W. Massaro, "Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry," Hillsdale, NJ: Lawrence Erlbaum, 1987.
J. S. Lee and C. H. Park, "Training hidden Markov models by hybrid simulated annealing for visual speech recognition," in Proceedings of 2006 IEEE International Conference on Systems, Man and Cybernetics, Taipei, 2006, pp. 198-202, Oct. 2006.
K. Kumatani, S. Nakamura, and K. Shikano, "An adaptive integration based on product HMM for audio-visual speech recognition," in Proceedings of 2001 IEEE International Conference on Multimedia and Expo, Tokyo, 2001, pp. 813-816. http://dx.doi.org/10.1109/ICME.2001.1237846
J. S. Lee and C. H. Park. "Robust audio-visual speech recognition based on late integration," IEEE Transactions on Multimedia, vol. 10, no. 5, pp. 767-779, Aug. 2008. http://dx.doi.org/10.1109/TMM.2008.922789

상세보기
S. Dupont and J. Luettin, "Audio-visual speech modeling for continuous speech recognition," IEEE Transactions on Multimedia, vol. 2, no. 3, pp. 141-151, Sep. 2000. http://dx.doi.org/10.1109/6046.865479

상세보기
S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357-366, Aug. 1980. http://dx.doi.org/10.1109/TASSP. 1980.1163420

상세보기
D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models," Digital Signal Processing, vol. 10, no. 1, pp. 19-41, Jan. 2000.

상세보기
I. Peer , B. Rafaely, and Y. Zigel, "Reverberation matching for speaker recognition," in Proceedings of 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, 2008, pp. 4829-4832. http: //dx.doi.org/10.1109/ICASSP.2008.4518738
F. Bimbot, J. F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega- Garcia, D. Petrovska-Delacretaz, and D. A. Reynolds, "A tutorial on text-independent speaker verification," EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 4, pp. 430-451, Apr. 2004. http://dx.doi.org/10. 1155/S1110865704310024

상세보기
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, "Audio visual speech recognition," in Final Workshop 2000 Report, Center for Language and Speech Processing, Baltimore, 2000.
L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Englewood Cliffs, NJ: PTR Prentice Hall, 1993.
J. Luettin, G. Potamianos, and C. Neti, "Asynchronous stream modeling for large vocabulary audio-visual speech recognition," in Proceedings of 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, UT, 2001, pp. 169-172. http://dx.doi.org/ 10.1109/ICASSP.2001.940794
H. Zhao, C. Tang, and T. Yu, "Fast thresholding segmentation for image with high noise," in Proceedings of 2008 International Conference on Information and Automation, Changsha, 2008, pp. 290-295. http://dx.doi.org/10.1109/ ICINFA.2008.4608013
Lei Xie and D. Jiang, "Audio-visual synthesis and synchronous asynchronous experimental research for bimodal speech recognition," Journal of Northwestern Polytechnical University, vol. 22, no. 2, pp.171-175, 2004.
H. Zhao, Y. Gu, and C. Tang, "Research of relationship between weight coefficient of product HMM and instantaneous SNR in bimodal speech recognition", Journal of Computer Application, vol. 29, pp. 279-285, 2009.
A. Adjoudani and C. Benot, "On the integration of auditory and visual parameters in an HMM-based ASR," in Proceedings NATO ASI Conference on Speechreading by Man and Machine: Models, Systems and Applications, 1995, pp. 461-471.
L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989. http: //dx.doi.org/10.1109/5.18626

상세보기
C. Bregler and S. M. Omohundro, "Nonlinear manifold learning for visual speech recognition," in Proceedings of 1995 5th International Conference on Computer Vision, Cambridge, MA, 1995, pp. 494-499. http://dx.doi.org/10. 1109/ICCV.1995.466899

저자의 다른 논문 :

원문 URL 링크

DOI : 10.5391/IJFIS.2013.13.3.164 [무료]
한국지능시스템학회 : 저널
DBPia : 저널
AccessON : 저널

*원문 PDF 파일 및 링크정보가 존재하지 않을 경우 KISTI DDS 시스템에서 제공하는 원문복사서비스를 사용할 수 있습니다.

오픈액세스(OA) 유형

BRONZE

출판사/학술단체 등이 한시적으로 특별한 프로모션 또는 일정기간 경과 후 접근을 허용하여, 출판사/학술단체 등의 사이트에서 이용 가능한 논문

이 논문과 함께 이용한 콘텐츠

저작권 관리 안내

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관

저장형식

Text(ASCII format)
Excel format
RefWorks Direct Export
RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

연합인증