[논문]주파수 영역 심층 신경망 기반 음성 향상을 위한 실수 네트워크와 복소 네트워크 성능 비교 평가

황서림; 박성욱; 박영철

doi:10.7776/ask.2022.41.1.030

초록
AI-Helper

본 논문은 주파수 영역에서 심층 신경망 기반 음성 향상 모델 학습을 위하여 학습 대상과 네트워크 구조에 따라 두 가지 관점에서 성능을 비교 평가한다. 이때, 학습 대상으로는 스펙트럼 매핑과 Time-Frequency(T-F) 마스킹 기법을 사용하였고 네트워크 구조는 실수 네트워크와 복소 네트워크를 사용하였다. 음성 향상 모델의 성능은 데이터 셋 규모에 따라 Perceptual Evaluation of Speech Quality(PESQ)와 Short-Time Objective Intelligibility(STOI) 두 가지 객관적 평가지표를 통해 평가하였다. 실험 결과, 네트워크의 종류와 데이터 셋 종류에 따라 적정한 훈련 데이터의 크기가 다르다는 것을 확인하였다. 또한, 데이터의 크기와 학습 대상에 따라 복소 네트워크보다 실수 네트워크가 비교적 높은 성능을 보이기 때문에 총 파라미터의 수를 고려한다면 경우에 따라 실수 네트워크를 사용하는 것이 보다 현실적인 해결책일 수 있다는 것을 확인하였다.

Abstract ▼ AI-Helper

This paper compares and evaluates model performance from two perspectives according to the learning target and network structure for training Deep Neural Network (DNN)-based speech enhancement models in the frequency domain. In this case, spectrum mapping and Time-Frequency (T-F) masking techniques ...

This paper compares and evaluates model performance from two perspectives according to the learning target and network structure for training Deep Neural Network (DNN)-based speech enhancement models in the frequency domain. In this case, spectrum mapping and Time-Frequency (T-F) masking techniques were used as learning targets, and a real network and a complex network were used for the network structure. The performance of the speech enhancement model was evaluated through two objective evaluation metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) depending on the scale of the dataset. Test results show the appropriate size of the training data differs depending on the type of networks and the type of dataset. In addition, they show that, in some cases, using a real network may be a more realistic solution if the number of total parameters is considered because the real network shows relatively higher performance than the complex network depending on the size of the data and the learning target.

주제어

표/그림 (8)

그림 Fig. 1. The architecture of baseline network. (a) is complex network,^[7] and (b) is real network of (a).
표 Table 1. CRN architecture. Here F denotes the number of frequency bins, and T denotes the number of time frames.
표 Table 2. Performance evaluation of various network types using spectral mapping method on dataset-1.
표 Table 3. Performance evaluation of various network types using spectral mapping method on dataset-2.
표 Table 4. Performance evaluation of various network types using T-F masking method on dataset-1.
그림 Fig. 2. (Color available online) Average PESQ results according to the scale of dataset-1.
그림 Fig. 3. (Color available online) Average PESQ results according to the scale of dataset-2.
표 Table 5. Performance evaluation of various network types using T-F masking method on dataset-2.

참고문헌 (15)

A. Narayanan and D. Wang, "Ideal ratio mask estimation using deep neural networks for robust speech recognition," Proc. IEEE ICASSP. 7092-7096 (2013).
T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, "Phase processing for single-channel speech enhancement: History and recent advances," IEEE Signal Process. Mag. 32, 55-66 (2015).

상세보기
H.-S. Choi, J-H Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee,"Phase-aware speech enhancement with deep complexu-net," Proc. ICLR. 2019.
S. A. Nossier, J. Wall, M. Moniri, C. Glackin, and N. Cannings, "Mapping and masking targets comparison using different deep learning based speech enhancement architectures," Proc. IJCNN. 1-8 (2020).
K. Paliwal, K. Wojcicki, and B. Shannon, "The importance of phase in speech enhancement," Speech Commun. 53, 465-494 (2011).

상세보기
K. Tan and D. Wang, "Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement," Proc. IEEE ICASSP. 6865-6869 (2019).
Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, "Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement," Proc. Interspeech, 2472-2476 (2020).
S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, "How does batch normalization help optimization?," Proc. NeurIPS. 1-11 (2018).
C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dybey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, and J. Gehrke, "The interspeech 2020 deep noise suppression challenge: Dataset, subjective testing framework, and challenge results," arXiv preprint arXiv:2005.13981 (2020).
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, "Timit acoustic phonetic continuous speech corpus," Linguistic Data Consortium (1993).
A. Varga and H. J. M. Steeneken, "Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems," Speech commun. 12, 247-251 (1993).

상세보기
E. Vincent, J. Barker, S. Watanabe, J. Le Roux, F. Nesta, and M. Matassoni, "The second 'chime'speech separation and recog-nition challenge: Datasets, tasks and baselines," Proc. IEEE ICASSP. 126-130 (2013).
J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third 'chime' speech separation and recognition challenge: Dataset, taskand baselines," Proc. ISRU. 504-511 (2015).
A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and coders," Proc. IEEE ICASSP. 749-752 (2001).
C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time-frequency weighted noisy speech," IEEE Trans. on Audio, Speech, and Lang. Process. 19, 2125-2136 (2011).

상세보기

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

주파수 영역 심층 신경망 기반 음성 향상을 위한 실수 네트워크와 복소 네트워크 성능 비교 평가
Performance comparison evaluation of real and complex networks for deep neural network-based speech enhancement in the frequency domain 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

표/그림 (8)

표/그림 (8)

참고문헌 (15)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

주파수 영역 심층 신경망 기반 음성 향상을 위한 실수 네트워크와 복소 네트워크 성능 비교 평가 Performance comparison evaluation of real and complex networks for deep neural network-based speech enhancement in the frequency domain 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

표/그림 (8) 모든 표/그림 보기

표/그림 (8) 슬라이드로 보기

참고문헌 (15)

이 논문을 인용한 문헌

저자의 다른 논문 :

박영철 (88)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

주파수 영역 심층 신경망 기반 음성 향상을 위한 실수 네트워크와 복소 네트워크 성능 비교 평가
Performance comparison evaluation of real and complex networks for deep neural network-based speech enhancement in the frequency domain 원문보기

초록
AI-Helper

표/그림 (8)

표/그림 (8)