[논문]드론 소음 환경에서 심층 신경망 기반 음성 향상 기법 적용에 관한 연구

김지민; 정재희; 여찬은; 김우일

doi:10.7776/ask.2022.41.3.342

드론 소음 환경에서 심층 신경망 기반 음성 향상 기법 적용에 관한 연구
A study on deep neural speech enhancement in drone noise environment 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.41 no.3, 2022년, pp.342 - 350

김지민 (인천대학교 컴퓨터공학부) , 정재희 (인천대학교 컴퓨터공학부) , 여찬은 (인천대학교 컴퓨터공학부) , 김우일 (인천대학교 컴퓨터공학부)

초록
AI-Helper

본 논문에서는 재난 환경과 같은 환경에서의 음성 처리를 위해 실제 드론 소음 데이터를 수집하여 오염 음성 데이터베이스를 구축하고 음성 향상 기법인 스펙트럼 차감법과 심층 신경망을 이용한 마스크 기반 음성 향상 기법을 적용하여 성능을 평가한다. 기존의 심층 신경망 기반의 음성 향상 모델인 VoiceFilter(VF)의 성능 향상을 위해 Self-Attention 연산을 적용하고 추정한 잡음 정보를 Attention 모델의 입력으로 이용한다. 기존 VF 모델 기법과 비교하여 Source to Distortion Ratio(SDR), Perceptual Evaluation of Speech Quality(PESQ), Short-Time Objective Intelligibility(STOI)에 대해 각각 3.77 %, 1.66 %, 0.32 % 향상된 결과를 나타낸다. 인터넷에서 수집한 오염 음성 데이터를 75 % 혼합하여 훈련한 경우, 실제 드론 소음만을 사용한 경우에 비해 상대적인 성능 하락률 평균이 SDR, PESQ, STOI에 대해 각각 3.18 %, 2.79 %, 0.96 %를 나타낸다. 이는 실제 데이터를 취득하기 어려운 환경에서 실제 데이터와 유사한 데이터를 수집하여 음성 향상을 위한 모델 훈련에 효과적으로 활용할 수 있음을 확인해준다.

Abstract ▼ AI-Helper

In this paper, actual drone noise samples are collected for speech processing in disaster environments to build noise-corrupted speech database, and speech enhancement performance is evaluated by applying spectrum subtraction and mask-based speech enhancement techniques. To improve the performance of VoiceFilter (VF), an existing deep neural network-based speech enhancement model, we apply the Self-Attention operation and use the estimated noise information as input to the Attention model. Compared to existing VF model techniques, the experimental results show 3.77%, 1.66% and 0.32% improvements for Source to Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligence (STOI), respectively. When trained with a 75% mix of speech data with drone sounds collected from the Internet, the relative performance drop rates for SDR, PESQ, and STOI are 3.18%, 2.79% and 0.96%, respectively, compared to using only actual drone noise. This confirms that data similar to real data can be collected and effectively used for model training for speech enhancement in environments where real data is difficult to obtain.

주제어

표/그림 (11)

그림 Fig. 1. Photos of the drone and the sound recorder.
그림 Fig. 2. Layout diagram of the drone and the recorder.
그림 Fig. 3. Process of spectral subtraction method.
그림 Fig. 4. Training process of mask-based speech enhancement method.
그림 Fig. 5. Architecture of the proposed Self-Attention model.
그림 Fig. 6. Block diagram of the proposed speech enhancement system with Self-Attention model.
표 Table 1. Speech enhancement evaluation results in SDR.
표 Table 2. Mask-based speech enhancement evaluation results with various mixed-rates of real drone sound and internet scraped sound for training data.
표 Table 3. Speech enhancement evaluation results using Self-Attention model with different types of noise estimation for query of the attention model.
표 Table 4. Speech enhancement evaluation results using average method for the query noise of the Self-Attention model with various mixed-rates of real drone sound and internet scraped sound for training data.
표 Table 5. Speech enhancement evaluation results using minimum statistics method for the query noise of the Self-Attention model with various mixed-rates of real drone sound and internet scraped sound for training data.

참고문헌 (20)

M.Narinen, Active noise cancellation of drone propeller noise through waveform approximation and Pitch-shifting, (Ph.D. thesis, Georgia State University, 2020).
J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Trans. on Acoustics, Speech, and Signal Process. 26, 197-210 (1978).
Y. Ephraim and D. Malah, "Speech enhancement using minimum mean square error short time spectral amplitude estimator," IEEE Trans. on Acoustics, Speech and Signal Process. 32, 1109-1121 (1984).
S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. on Acoustics, Speech and Signal Proccess. 27, 113-120 (1979).
R. Martin, "Spectral subtraction based on minimum statistics," Proc. EUSIPCO, 1182-1185 (1994).
P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: a unified approach," Speech Communication, 24, 267-285 (1998).

상세보기
Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, "Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking," arXiv preprint arXiv:1810.04826 (2018).
C. Deng, H. Song, Y. Zhang, Y. Sha, and X. Li, "DNN-based mask estimation integrating spectral and spatial features for robust beamforming," Proc. IEEE ICASSP, 4647-4651 (2020).
N. Saleem, M. I. Khattak, M. Al-Hasan, and A. B. Qazi, "On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks," in IEEE Access, 8, 160581-160595 (2020).

상세보기
M. Hasannezhad, Z. Ouyang, W. -P. Zhu, and B. Champagne, "Speech enhancement with phase sensitive mask estimation using a novel hybrid neural network," IEEE Open Journal of Signal Processing, 2, 136-150 (2021).

상세보기
M. Hasannezhad, Z. Ouyang, W. -P. Zhu, and B. Champagne, "An integrated CNN-GRU framework for complex ratio mask estimation in speech enhancement," Proc. APSIPA ASC, 764-768 (2020).
Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi, "Speech enhancement using self-adaptation and multi-head self-attention," Proc. IEEE ICASSP, 181-185 (2020).
X. Hao, C. Shan, Y. Xu, S. Sun, and L. Xie, "An attention-based neural network approach for single channel speech enhancement," Proc. IEEE ICASSP, 6895-6899 (2019).
S. K. Roy, A. Nicolson, and K. K. Paliwal, "Deep LPC-MHANet: Multi-head self-attention for augmented kalman filter-based speech enhancement," IEEE Access, 9, 70516-70530 (2021).

상세보기
A. Pandey and D. Wang, "Dense CNN with self-attention for time-domain speech enhancement," IEEE/ACM Trans Audio Speech Lang Process. 29, 1270-1279 (2021).
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, "DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1," NASA STI/Recon Tech. Rep. 1993.
Syma X5C-1, https://youtube.com/watch?vaR3NgjOwzAo&featureshare, (Last viewed August 19, 2021).
E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. on audio, speech, and lang. process. 14, 1462-1469 (2006).
A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," Proc. IEEE ICASSP, 01CH37221 (2001).
C. H. Taal, R. C. Hendriks, R, Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time- frequency weighted noisy speech," Proc. IEEE ICASSP, 4214-4217 (2010).

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증