[논문]복소 스펙트럼 기반 음성 향상의 성능 향상을 위한 time-frequency self-attention 기반 skip-connection 기법 연구

정재희; 김우일

doi:10.7776/ask.2023.42.2.094

복소 스펙트럼 기반 음성 향상의 성능 향상을 위한 time-frequency self-attention 기반 skip-connection 기법 연구
A study on skip-connection with time-frequency self-attention for improving speech enhancement based on complex-valued spectrum 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.42 no.2, 2023년, pp.94 - 101

초록
AI-Helper

음성 향상에서 많이 사용되는 U-Net과 같이 인코더와 디코더로 구성된 심층 신경망 모델은 skip-connection을 통해 인코더의 특징을 디코더에 연결하는 구조로 구성되어 있다. Skip-connection은 디코더에서 향상된 스펙트럼을 재구성하는데 도움을 주며 인코더를 통해 손실된 정보를 보완해줄 수 있다. 이때 skip-connection을 통해 연결되는 인코더의 특징과 디코더의 특징의 의미는 서로 다르다. 본 논문에서는 복소 스펙트럼 기반 음성 향상의 성능 향상을 위해 디코더에 연결되는 인코더의 특징을 디코더 특징의 의미에 가깝게 변환해주도록 skip-connection에 Self-Attention(SA)을 적용하는 방안을 연구하였다. SA는 시퀀스-시퀀스 문제에서 출력 시퀀스를 생성할 때, 입력 시퀀스의 가중 산술 평균을 이용하여 결정적인 부분을 집중해서 볼 수 있도록 하는 기법으로, 음성 향상 분야에서도 이를 적용함으로써 성능 향상에 효과적임을 입증하는 연구가 진행되었다. SA를 skip-connection에 적용하기 위해 인코더 특징과 디코더 특징을 이용하는 총 3가지의 방법에 대해 연구하였다. TIMIT 데이터베이스를 이용한 음성 향상 실험 결과, 제안하는 방법이 기존 skip-connection으로만 연결된 Deep Complex U-Net(DCUNET)과 비교하여 모든 성능 평가 지표에서 향상된 결과를 보였다.

Abstract ▼ AI-Helper

A deep neural network composed of encoders and decoders, such as U-Net, used for speech enhancement, concatenates the encoder to the decoder through skip-connection. Skip-connection helps reconstruct the enhanced spectrum and complement the lost information. The features of the encoder and the decoder connected by the skip-connection are incompatible with each other. In this paper, for complex-valued spectrum based speech enhancement, Self-Attention (SA) method is applied to skip-connection to transform the feature of encoder to be compatible with the features of decoder. SA is a technique in which when generating an output sequence in a sequence-to-sequence tasks the weighted average of input is used to put attention on subsets of input, showing that noise can be effectively eliminated by being applied in speech enhancement. The three models using encoder and decoder features to apply SA to skip-connection are studied. As experimental results using TIMIT database, the proposed methods show improvements in all evaluation metrics compared to the Deep Complex U-Net (DCUNET) with skip-connection only.

주제어

표/그림 (8)

그림 Fig. 1. Process of mask-based speech enhancement using complex-valued spectrum.
그림 Fig. 2. A structure of DCUNET model.
그림 Fig. 3. (Color available online) A structure of TFSA.
그림 Fig. 4. A structure of DCUNET applied SkipConv.
그림 Fig. 5. (Color available online) A structure of DCUNET applied TFSA. The input of TFSA uses only encoder features (Skip TFSA Enc).
그림 Fig. 6. (Color available online) A structure of DCUNET applied TFSA. The key and value in TFSA use encoder features, and the query uses decoder features (Skip TFSA DE).
그림 Fig. 7. (Color available online) A structure of DCUNET applied TFSA. The input of TFSA uses the sum of encoder and decoder features (Skip TFSA Sum).
표 Table 1. The result of speech enhancement.

참고문헌 (20)

J. Lim and A. Oppenheim, "All-pole modeling of？degraded speech," IEEE Trans. on Acoustics, Speech,？and Signal Process. 26, 197-210 (1978).
R. Martin, "Spectral subtraction based on minimum？statistics," Proc. EUSIPCO, 1182-1185 (1994).
D. L. Wang and J. Chen, "Supervised speech separation based on deep learning: An overview," IEEE/ACM Trans. on Audio, Speech, and Lang. Process.？26, 1702-1726 (2018).
H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and？K. Lee, "Phase-aware speech enhancement with deep？complex u-net," Proc. ICLR, 1-20 (2019).
C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S.？Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh,？Y. Bengio, and C. J. Pal, "Deep complex networks,"？Proc. ICLR, 1-19 (2018).
K. Paliwal, K. Wojcicki, and B. Shannon, "The importance of phase in speech enhancement," Speech？Communication, 53, 465-494 (2011).
Y. Wang and D. L. Wang, "A deep neural network for？time-domain signal reconstruction," Proc. IEEE ICASSP,？4390-4394 (2015).
H. Wang, X. Zhang, and D. L. Wang, "Attention-based fusion for bone-conducted and air-conducted？speech enhancement in the complex domain," Proc.？IEEE ICASSP, 7757-7761 (2022).
S. Zhao, B. Ma, K. N. Watcharasupat, and W. S. Gan,？"FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement,"？Proc. IEEE ICASSP, 9281-9285 (2022).
V. Kothapally and J. H. Hansen, "Complex-valued？time-frequency self-attention for speech dereverberation," Proc. Interspeech, 2543-2547 (2022).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L.？Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,？"Attention is all you need," Proc. NIPS, 6000-6010 (2017).
C. Tang, C. Luo, Z. Zhao, W. Xie, and W. Zeng, "Joint？time-frequency and time domain learning for speech？enhancement," Proc. 29th IJCAI, 3816-3822 (2021).
V. Kothapally, W. Xia, S. Ghorbani, J. H. Hansen, W.？Xue, and J. Huang, "Skipconvnet: Skip convolutional？neural network for speech dereverberation using？optimally smoothed spectral mapping," Proc. Interspeech, 3935-3939 (2020).
M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury,？and C. Pal, "The importance of skip connections in？biomedical image segmentation," Proc. DLMIA, 179-187 (2016).
T. Tong, G. Li, X. Liu, and Q. Gao, "Image super-resolution using dense skip connections," Proc. IEEE？ICCV, 4799-4807 (2017).
Y. Luo and N. Mesgarani, "Conv-tasnet: surpassing？ideal time-frequency magnitude masking for speech？separation," IEEE/ACM Trans. on Audio, Speech,？and Lang. Process. 27, 1256-1266 (2019).
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus,？D. S. Pallett, and N. L. Dahlgren, "Acoustic-phonetic？continuous speech corpus CD-ROM NIST speech disc？1-1.1," DARPA TIMIT, NIST Interagenct/Internal Rep., (NISTIR) 4930, 1993.
E. Vincent, R. Gribonval, and C. Fevotte, "Performance？measurement in blind audio source separation," IEEE？Trans. on Audio, Speech, and Lang. Process. 14, 1462-1469 (2006).
A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P.？Hekstra, "Perceptual evaluation of speech quality？(PESQ)-a new method for speech quality assessment？of telephone networks and codecs," Proc. IEEE？ICASSP, 749-752 (2001).
C. H. Taal, R. C. Hendriks, and R. Heusdens, "A？short-time objective intelligibility measure for timefrequency weighted noisy speech," Proc. IEEE ICASSP,？4214-4217 (2010).

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증