[논문]End-to-end 비자기회귀식 가속 음성합성기

김위백; 남호성

doi:10.13064/ksss.2021.13.4.047

End-to-end 비자기회귀식 가속 음성합성기
End-to-end non-autoregressive fast text-to-speech 원문보기

말소리와 음성과학 = Phonetics and speech sciences, v.13 no.4, 2021년, pp.47 - 53

초록
AI-Helper

Autoregressive한 TTS 모델은 불안정성과 속도 저하라는 본질적인 문제를 안고 있다. 모델이 time step t의 데이터를 잘못 예측했을 때, 그 뒤의 데이터도 모두 잘못 예측하는 것이 불안정성 문제이다. 음성 출력 속도 저하 문제는 모델이 time step t의 데이터를 예측하려면 time step 1부터 t-1까지의 예측이 선행해야 한다는 조건에서 발생한다. 본 연구는 autoregression이 야기하는 문제의 대안으로 end-to-end non-autoregressive 가속 TTS 모델을 제안한다. 본 연구의 모델은 Tacotron 2 - WaveNet 모델과 근사한 MOS, 더 높은 안정성 및 출력 속도를 보였다. 본 연구는 제안한 모델을 토대로 non-autoregressive한 TTS 모델 개선에 시사점을 제공하고자 한다.

Abstract ▼ AI-Helper

Autoregressive Text-to-Speech (TTS) models suffer from inference instability and slow inference speed. Inference instability occurs when a poorly predicted sample at time step t affects all the subsequent predictions. Slow inference speed arises from a model structure that forces the predicted samples from time steps 1 to t-1 to predict the sample at time step t. In this study, an end-to-end non-autoregressive fast text-to-speech model is suggested as a solution to these problems. The results of this study show that this model's Mean Opinion Score (MOS) is close to that of Tacotron 2 - WaveNet, while this model's inference speed and stability are higher than those of Tacotron 2 - WaveNet. Further, this study aims to offer insight into the improvement of non-autoregressive models.

주제어

참고문헌 (18)

Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., ... Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. Retrieved from https://arxiv.org/abs/1702.07825
Cho, K. (2013). Boltzmann machines and denoising autoencoders for image denoising. Retrieved from https://arxiv.org/abs/1301.3468
Dvorak, J. L. (2011). Moving wearables into the mainstream: Taming the Borg. New York, NY: Springer.
Griffin, D., & Lim, J. (1983, April). Signal estimation from modified short-time Fourier transform. Proceedings of the 8th International Conference on Acoustics, Speech, and Signal Processing (pp. 804-807). Boston, MA.
Holmes, J., & Holmes, W. (2002). Speech synthesis and recognition. London, UK: CRC Press.
Kumar, K., Kumar, R., de Boissiere, T. Gestin, L., Teoh, W. Z., Sotelo, J., de Brebisson, A., ... Courville, A. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. Retrieved from https://arxiv.org/abs/1910.06711
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Retrieved from https://arxiv.org/abs/1905.09263
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., ... Wu, Y. (2017). Natural TTS synthesis by conditioning Wavenet on mel spectrogram predictions. Retrieved from https://arxiv.org/abs/1712.05884
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Retrieved from https://arxiv.org/abs/1409.3215
Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. Retrieved from https://arxiv.org/abs/1710.08969
Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. Retrieved from https://arxiv.org/abs/2005.05957
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Retrieved from https://arxiv.org/abs/1706.03762
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., ... Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from https://arxiv.org/abs/1609.03499
van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., ... Hassabis, D. (2017). Parallel WaveNet: Fast high-fidelity speech synthesis. Retrieved from https://arxiv.org/abs/1711.10433
Wang, T., Liu, X., Tao, J., Yi, J., Fu, R., & Wen, Z. (2020, October). Non-autoregressive end-to-end TTS with coarse-to-fine decoding. Proceedings of the 21st Annual Conference of the International Speech Communication Association (pp. 3984-3988). Shanghai, China.
Wang, Y., Skerry-Ryan, RJ., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., ... Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from https://arxiv.org/abs/1703.10135
Yamamoto, R., Song, E., & Kim, J. M. (2019). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. Retrieved from https://arxiv.org/abs/1910.11480
Yarrington, D. (2007). Synthesizing speech for communication devices. In K. Greenebaum, & R. Barzel (Eds.), Audio anecdotes: Tools, tips and techniques for digital audio (Vol. 3, pp. 143-155). Wellesley, MA: AK Peters.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

End-to-end 비자기회귀식 가속 음성합성기
End-to-end non-autoregressive fast text-to-speech 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (18)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

End-to-end 비자기회귀식 가속 음성합성기 End-to-end non-autoregressive fast text-to-speech 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (18)

이 논문을 인용한 문헌

저자의 다른 논문 :

남호성 (4)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

End-to-end 비자기회귀식 가속 음성합성기
End-to-end non-autoregressive fast text-to-speech 원문보기

초록
AI-Helper