[논문]음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech

윤혜빈

doi:10.13064/ksss.2021.13.3.071

[국내논문] 음질 및 속도 향상을 위한 선형 스펙트로그램 활용 Text-to-speech
Text-to-speech with linear spectrogram prediction for quality and speed improvement 원문보기

말소리와 음성과학 = Phonetics and speech sciences, v.13 no.3, 2021년, pp.71 - 78

윤혜빈 (고려대학교 영어영문학과)

초록
AI-Helper

인공신경망에 기반한 대부분의 음성 합성 모델은 고음질의 자연스러운 발화를 생성하기 위해 보코더 모델을 사용한다. 보코더 모델은 멜 스펙트로그램 예측 모델과 결합하여 멜 스펙트로그램을 음성으로 변환한다. 그러나 보코더 모델을 사용할 경우에는 많은 양의 컴퓨터 메모리와 훈련 시간이 필요하며, GPU가 제공되지 않는 실제 서비스 환경에서 음성 합성이 오래 걸린다는 단점이 있다. 기존의 선형 스펙트로그램 예측 모델에서는 보코더 모델을 사용하지 않으므로 이 문제가 발생하지 않지만, 대신에 고품질의 음성을 생성하지 못한다. 본 논문은 뉴럴넷 기반 보코더를 사용하지 않으면서도 양질의 음성을 생성하는 Tacotron 2 & Transformer 기반의 선형 스펙트로그램 예측 모델을 제시한다. 본 모델의 성능과 속도 측정 실험을 진행한 결과, 보코더 기반 모델에 비해 성능과 속도 면에서 조금 더 우세한 점을 보였으며, 따라서 고품질의 음성을 빠른 속도로 생성하는 음성 합성 모델 연구의 발판 역할을 할 것으로 기대한다.

Abstract ▼ AI-Helper

Most neural-network-based speech synthesis models utilize neural vocoders to convert mel-scaled spectrograms into high-quality, human-like voices. However, neural vocoders combined with mel-scaled spectrogram prediction models demand considerable computer memory and time during the training phase and are subject to slow inference speeds in an environment where GPU is not used. This problem does not arise in linear spectrogram prediction models, as they do not use neural vocoders, but these models suffer from low voice quality. As a solution, this paper proposes a Tacotron 2 and Transformer-based linear spectrogram prediction model that produces high-quality speech and does not use neural vocoders. Experiments suggest that this model can serve as the foundation of a high-quality text-to-speech model with fast inference speed.

Keyword

참고문헌 (17)

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. Retrieved from https://arxiv.org/abs/1701.07875
Chen, J., Tan, X., Luan, J., Qin, T., & Liu, T. Y. (2020). HiFiSinger: Towards high-fidelity neural singing voice synthesis. Retrieved from https://arxiv.org/abs/2009.01776
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.

상세보기
Hsu, P., Wang, C., Liu, A. T., & Lee, H. (2020). Towards robust neural vocoding for speech generation: A survey. Retrieved from https://arxiv.org/abs/1912.02461
Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., & Zhou, M. (2019). Neural speech synthesis with transformer network. Retrieved from https://arxiv.org/abs/1809.08895
Perraudin, N., Balazs, P., & Sondergaard, P. L. (2013, October). A fast Griffin-Lim algorithm. Proceedings of the 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4). New Paltz, NY.
Prenger, R., Valle, R., & Catanzaro, B. (2018). WaveGlow: A flow-based generative network for speech synthesis. Retrieved from https://arxiv.org/abs/1811.00002
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019, December). FastSpeech: Fast, robust and controllable text to speech. Proceedings of the 33rd Annual Conference on Neural Information Processing Systems(pp. 3156-3164). Vancouver, BC.
Sharma, A., Kumar, P., Maddukuri, V., Madamshetti, N., Kishore, K. G., Kavuru, S. S. S., Raman, B., ... Roy, P. P. (2020). Fast Griffin Lim based waveform generation strategy for text-to-speech synthesis. Multimedia Tools and Applications, 79(41), 30205-30233.

상세보기
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2018, April). Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4779-4783). Calgary, AB.
Song, W., Xu, G., Zhang, Z., Zhang, C., He, X., & Zhou, B. (2020, October). Efficient WaveGlow: An improved WaveGlow vocoder with enhanced speed. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 225-229). Shanghai, China.
Tachibana, H., Uenoyama, K., & Aihara, S. (2018, April). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4784-4788). Calgary, AB.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., ... Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017, August). Tacotron: Towards end-to-end speech synthesis. Proceedings of the 18th Annual Conference of the International Speech Communication Association (pp. 4006-4010). Stockholm, Sweden.
Zhu, X., Beauregard, G. T., & Wyse, L. (2006, July). Real-time iterative spectrum inversion with look-ahead. Proceedings of the 2006 IEEE International Conference on Multimedia and Expo (pp. 229-232). Toronto, ON.

활용도 분석정보

상세보기

다운로드

내보내기

활용도 Top5 논문

해당 논문의 주제분야에서 활용도가 높은 상위 5개 콘텐츠를 보여줍니다.
더보기 버튼을 클릭하시면 더 많은 관련자료를 살펴볼 수 있습니다.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증