[논문]딥러닝 기반 한국어 실시간 TTS 기술 비교

권철홍

doi:10.17703/jcct.2021.7.1.640

초록
AI-Helper

딥러닝 기반 종단간 TTS 시스템은 텍스트에서 스펙트로그램을 생성하는 Text2Mel 과정과 스펙트로그램에서 음성신호를 합성하는 보코더 등 두 가지 과정으로 구성되어 있다. 최근 TTS 시스템에 딥러닝 기술을 적용함에 따라 합성음의 명료도와 자연성이 사람의 발성과 유사할 정도로 향상되고 있다. 그러나 기존의 방식과 비교하여 음성을 합성하는 추론 속도가 매우 느리다는 단점을 갖고 있다. 최근 제안되고 있는 비-자기회귀 방식은 이전에 생성된 샘플에 의존하지 않고 병렬로 음성 샘플을 생성할 수 있어 음성 합성 처리 속도를 개선할 수 있다. 본 논문에서는 비-자기회귀 방식을 적용한 Text2Mel 기술인 FastSpeech, FastSpeech 2, FastPitch와, 보코더 기술인 Parallel WaveGAN, Multi-band MelGAN, WaveGlow를 소개하고, 이를 구현하여 실시간 처리 여부를 검증한다. 실험 결과 구한 RTF로 부터 제시된 방식 모두 실시간 처리가 충분히 가능함을 알 수 있다. 그리고 WaveGlow를 제외하고 학습 모델 크기가 수십에서 수백 MB 정도로, 메모리가 제한되어 있는 임베디드 환경에 적용 가능함을 알 수 있다.

Abstract ▼ AI-Helper

The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized...

The deep learning based end-to-end TTS system consists of Text2Mel module that generates spectrogram from text, and vocoder module that synthesizes speech signals from spectrogram. Recently, by applying deep learning technology to the TTS system the intelligibility and naturalness of the synthesized speech is as improved as human vocalization. However, it has the disadvantage that the inference speed for synthesizing speech is very slow compared to the conventional method. The inference speed can be improved by applying the non-autoregressive method which can generate speech samples in parallel independent of previously generated samples. In this paper, we introduce FastSpeech, FastSpeech 2, and FastPitch as Text2Mel technology, and Parallel WaveGAN, Multi-band MelGAN, and WaveGlow as vocoder technology applying non-autoregressive method. And we implement them to verify whether it can be processed in real time. Experimental results show that by the obtained RTF all the presented methods are sufficiently capable of real-time processing. And it can be seen that the size of the learned model is about tens to hundreds of megabytes except WaveGlow, and it can be applied to the embedded environment where the memory is limited.

주제어

표/그림 (4)

표 표 1. 세 가지 Text2Mel 방식의 RTF Table 1. Real time factor of three Text2Mel methods
표 표 2. 세 가지 보코더의 RTF Table 2. Real time factor of three vocoders
표 표 3. 세 가지 TTS 시스템의 RTF Table 3. Real time factor of three TTS systems
표 표 4. TTS 기술의 학습 모델 크기 Table 4. Model sizes of TTS methods

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

본 논문에서는 비-자기회귀방식을이용한TTS기술을소개하고, 이기 술들을 결합하여 한국어 TTS 시스템을구성하고성능을비교하여, 제시한 TTS 기술들이 실시간 처리가 가능한지를검증한다.
본논문에서는한국어실시간TTS시스템설계를위해Text2Mel 과정과보코더의최첨단기술중에서실시간처리가가능한기술을소개하고, 이를구현하여실시간처리여부를검증하였다. 텍스트에서스펙트로그램을 생성하는 Text2Mel 기술에는 FastSpeech, FastSpeech2와FastPitch를, 스펙트로그램에서합성음을 생성하는 보코더 기술은 Parallel WaveGAN, Multi-bandMelGAN과WaveGlow를제시하였다.

제안 방법

Text 2M el 학습 과정은 텍스트 와이에대응하는음성 DB에서 추출한 Ground-truth 멜-스펙트로 그램을 입력데이터로 하여 모델을 학습하고, data-checked="false">추론과정은이모델을이용하여임의의텍스트를입력으로멜-스펙트 로그램을생성하여출력한다.
같다. 먼저음성신호에1, 024크기의해밍창함수를 적용하여 프레임 단위로나누고, 256개샘플마다창함수를 이동하며 1, data-checked="false">024크기의이산푸리에변환을구한다.푸리에변환에절댓값을취하여구한선형진폭스펙트로그램 에서80차원의멜-스펙트로그램을구한뒤, 로그 값으로 변환하여 사용한다.
보코더학습과정은입력데이터로음성신호와이에 대응하는 Ground-truth 멜-스펙트로그램을 이용하여 모델을 학습하고, data-checked="false">추론과정eText2Mel추론과정에서생성한멜-스펙트로그램을입력으로합성음성신호 를출력한다.Text2Mel과정과보코더의학습및추론 과정에서 사용한 음성 DB는 앞 절에서 기술한 음성 DB를동일하게사용한다.
이 절에서는 텍스트에서 비-자기회귀방식으로스펙트로그램을 생성하는 Text2Mel 기술인 FastSpeech, FastSpeech2, FastPitch와, 역시 비-자기회귀방식으로스펙트로그램에서 합성음을 생성하는 보코더 기술인 Parallel WaveGAN, Multi-bandMelGAN, WaveGlow 를소개한다.
텍스트에서스펙트로그램을 생성하는 Text2Mel 기술에는 FastSpeech, FastSpeech2와FastPitch를, 스펙트로그램에서합성음을 생성하는 보코더 기술은 Parallel WaveGAN, Multi-bandMelGAN과WaveGlow를제시하였다. 실시간처리가가능한지여부를보기위해 RTF를구했고, 이RTF로부터제시한방식모두충분히실시간으로처리가능함을알 수 있다.
먼저음성신호에1, 024크기의해밍창함수를 적용하여 프레임 단위로나누고, 256개샘플마다창함수를 이동하며 1, data-checked="false">024크기의이산푸리에변환을구한다.푸리에변환에절댓값을취하여구한선형진폭스펙트로그램 에서80차원의멜-스펙트로그램을구한뒤, 로그 값으로 변환하여 사용한다.

대상 데이터

13, 000개 발화 중에서 학습용 데이터로 12, 950개를, 검증용데이터로50개를사용한다.테스트용문장은녹 음한문장에포함되지않는25개문장을별도로구성하 여성능평가에사용한다.
검증용데이터로50개를사용한다.테스트용문장은녹 음한문장에포함되지않는25개문장을별도로구성하 여성능평가에사용한다.

성능/효과

텍스트에서스펙트로그램을 생성하는 Text2Mel 기술에는 FastSpeech, FastSpeech2와FastPitch를, 스펙트로그램에서합성음을 생성하는 보코더 기술은 Parallel WaveGAN, Multi-bandMelGAN과WaveGlow를제시하였다. 실시간처리가가능한지여부를보기위해 RTF를구했고, 이RTF로부터제시한방식모두충분히실시간으로처리가능함을알 수 있다. 그리고학습모델크기를 보면 WaveGlow를 제외하고 수십에서 수백 MB정도로메모리가제한되어있는임베디드환경에 적용가능함을알 수 있다.
표 3에 제시된 세 가지 TTS 시스템의 합성음에대해수행한청취평가결과를종합해보면, 합성음성의 품질이 녹음음성과 약간의 차이를 보이나, data-checked="false">세가지시스템모두사람이발성한수준으로명료하고자연스러 운합성음을생성한다는것을 알 수 있다.

후속연구

본논문의실시간처리여부 는컴퓨터환경에서시뮬레이션을하여얻은결과이다. 향후에, data-checked="false">본논문에서제시한TTS시스템을임베디드환경에구현하는방식과성능검증에대해연구를진행 할계획이다.

참고문헌 (15)

A. J. Hunt and A. W. Black, "Unit selection in a concatenative speech synthesis system using a large speech database", Proceedings of the International Conference on Acoustics, Speech, Signal Processing, pp. 373-376, 1996
T. Yoshimura, K. Tokuda, T. Masuko, T, Kobayashi, T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis". Proceedings of the Eurospeech 1999, pp. 2347-2350, 1999
Y. Wan, R J, Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, R. A. Saurous, "Tacotron: Towards end-to-end speech synthesis", arXiv preprint, https://arxiv.org/pdf/1703.10135.pdf, 2017 Apr.
J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, Y. Wu, "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions", arXiv preprint, https://arxiv.org/pdf/1712.05884.pdf, 2018 Feb.
N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, M. Zhou, "Neural speech synthesis with transformer network", arXiv preprint, https://arxiv.org/pdf/1809.08895.pdf, 2019 Jan.
Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu, "FastSpeech: Fast, robust and controllable text to speech", arXiv preprint, https://arxiv.org/pdf/1905.09263.pdf, 2019 Nov.
A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.Senior, K. Kavukcuoglu, "WaveNet: A generative model for raw audio," arXiv preprint, https://arxiv.org/pdf/1609.03499.pdf, 2016 Sep.
C. H. Kwon, "Performance comparison of state-of-the-art vocoder technology based on deep learning in a Korean TTS system", The Journal of the Convergence on Culture Technology (JCCT), Vol. 6, No. 2, pp. 509-514, 2020

원문보기 상세보기
N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, K. Kavukcuoglu. "Efficient neural audio synthesis", arXiv preprint. https://arxiv.org/pdf/1802.08435.pdf, 2018, Feb.
Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, T. Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech", arXiv preprint, https://arxiv.org/pdf/2006.04558.pdf, 2020 Oct.
A. Lancucki, "FastPitch: Parallel text-to- speech with pitch prediction", arXiv preprint, https://arxiv.org/pdf/2006.06873.pdf, 2020 June
R. Yamamoto, E. W. Song, J. M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram", arXiv preprint, https://arxiv.org/pdf/1910.11480.pdf, 2020 Feb.
K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville, "MelGAN: Generative adversarial networks for conditional waveform synthesis", arXiv preprint, https://arxiv.org/pdf/1910.06711.pdf, 2018 Dec.
G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, L. Xie1, "Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech", arXiv preprint, https://arxiv.org/pdf/2005.05106.pdf, 2020 Nov.
R. Prenger, R. Valle, B. Catanzaro, "WaveGlow: A flow-based generative network for speech synthesis", arXiv preprint. https://arxiv.org/pdf/1811.00002.pdf, 2018 Oct.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

딥러닝 기반 한국어 실시간 TTS 기술 비교
Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

표/그림 (4)

표/그림 (4)

AI 본문요약
AI-Helper

문제 정의

제안 방법

대상 데이터

성능/효과

후속연구

참고문헌 (15)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

딥러닝 기반 한국어 실시간 TTS 기술 비교 Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

표/그림 (4) 모든 표/그림 보기

표/그림 (4) 슬라이드로 보기

AI 본문요약 엑셀 다운로드 AI-Helper

문제 정의

제안 방법

대상 데이터

성능/효과

후속연구

참고문헌 (15)

이 논문을 인용한 문헌

저자의 다른 논문 :

권철홍 (29)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

딥러닝 기반 한국어 실시간 TTS 기술 비교
Comparison of Korean Real-time Text-to-Speech Technology Based on Deep Learning 원문보기

초록
AI-Helper

표/그림 (4)

표/그림 (4)

AI 본문요약
AI-Helper