[논문]한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구

최연주; 정영문; 김영관; 서영주; 김회린

doi:10.13064/ksss.2018.10.1.039

한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구
An end-to-end synthesis method for Korean text-to-speech systems 원문보기

말소리와 음성과학 = Phonetics and speech sciences, v.10 no.1, 2018년, pp.39 - 48

최연주 (한국과학기술원 전기및전자공학부) , 정영문 (한국과학기술원 전기및전자공학부) , 김영관 (한국과학기술원 전기및전자공학부) , 서영주 (한국과학기술원 전기및전자공학부) , 김회린 (한국과학기술원)

Abstract ▼ AI-Helper

A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated in each module accumulate passing through each module. An end-to-end TTS system could avoid such problems by synthesizing voice signals directly from an input string. In this study, we implemented an end-to-end Korean TTS system using Google's Tacotron, which is an end-to-end TTS system based on a sequence-to-sequence model with attention mechanism. We used 4392 utterances spoken by a Korean female speaker, an amount that corresponds to 37% of the dataset Google used for training Tacotron. Our system obtained mean opinion score (MOS) 2.98 and degradation mean opinion score (DMOS) 3.25. We will discuss the factors which affected training of the system. Experiments demonstrate that the post-processing network needs to be designed considering output language and input characters and that according to the amount of training data, the maximum value of n for n-grams modeled by the encoder should be small enough.

주제어

질의응답

핵심어	질문	논문에서 추출한 답변
	디코더의 역할은?	디코더는 특정 time step 프레임의 스펙트로그램을 입력으로 받고, 다음 time step 프레임의 스펙트로그램을 출력한다. 본 연구에서는 Bahdanau et al.
	음편 선정(unit selection) 방식은 많은 양의 데이터를 사용함에도 불구하고, 연결한 두 음편 사이의 경계가 부자연스럽다는 문제, 주어진 문장에 대해 항상 똑같은 발화만이 가능하다는 문제 등이 존재한다. 이러한 한계점들을 극복하고자 제안된 것은?	이러한 한계점들을 극복하고자 통계적 파라미터 방식 음성 합성(statistical parametric speech synthesis) 시스템이 제안되었다.대표적인 예로 은닉 마르코프 모델(HMM, hidden Markov model) 기반 TTS 시스템(HTS, HMM-based speech synthesis)이 있다.
	Text-to-speech(TTS) 시스템이란?	Text-to-speech(TTS) 시스템이란 텍스트가 입력되어서 그에 대응하는 음성으로 변환되어 출력되는 시스템으로, 음성 합성 시스템이라고도 불린다. 여기서 중요한 점은 출력되는 합성음이 실제 사람이 말하는 것처럼 충분히 자연스러워야 한다는 점이다.

참고문헌 (27)

Arik, S., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017a). Deep Voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning (pp. 195-204). Sydney, AU. 6-11 August, 2017.
Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017b). Deep Voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems 30 (pp. 2966-2974). Long Beach, CA. 4-9 December, 2017.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Retrieved from http://arxiv.org/abs/1409.0473 [Computing Research Repository] on January 9, 2018.
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41-48). 14-18 June, 2009.
Cho, K., Van Mrrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved from http://arxiv.org/abs/1406.1078 [Computing Research Repository] on January 9, 2018.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from http://arxiv.org/abs/1412.3555 [Computing Research Repository] on January 9, 2018.
Collins, J., Sohl-Dickstein, J., & Sussillo, D. (2017). Capacity and trainability in recurrent neural networks. Proceedings of the 5th International Conference on Learning Representations. Retrieved from https://openreview.net/forum?idBydARw9ex on January 9, 2018.
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.

상세보기
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778). 26 June-1 July, 2016.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.

상세보기
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 373-376). 7-10 May, 1996.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (pp. 448-456). 2 Mar, 2015.
Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1303-1306). 21-24 April, 1997.
Lee, J., Cho, K., & Hoffman, T. (2016). Fully character-level neural machine translation without explicit segmentation. Retrieved from http://arxiv.org/abs/1610.03017 [Computing Research Repository] on January 9, 2018.
Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from http://arxiv.org/abs/1609.03499 [Computing Research Repository] on January 9, 2018.
Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. Retrieved from http://arxiv.org/abs/1710.07654 [Computing Research Repository] on January 9, 2018.
Rabiner, L., & Schafer, R. (2011). Theory and applications of digital speech processing. New Jersey: Pearson.
Raffel, C., Luong, M.-T., Liu, P., Weiss, R., & Eck, D. (2017). Online and linear-time attention by enforcing monotonic alignments. Proceedings of the 34th International Conference on Machine Learning (pp. 2837-2846). 6-11 August, 2017.
Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. Retrieved from http://arxiv.org/abs/1712.05884 [Computing Research Repository] on March 1, 2018.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
Srivastava, R., Greef, K., & Schmidhuber, J. (2015). Highway networks. Retrieved from http://arxiv.org/abs/1505.00387 [Computing Research Repository] on January 9, 2018.
Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (pp. 3104-3112). 8-13 December, 2014.
Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on hidden markov models. Proceedings of IEEE, 101(5), 1234-1252.
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems 28 (pp. 2773-2781). 7-12 December, 2015.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from http://arxiv.org/abs/1703.10135 [Computing Research Repository] on January 9, 2018.
Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 218-223). Sunnyvale, CA. 13-15 September, 2016.

저자의 다른 논문 :

LOADING...

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구
An end-to-end synthesis method for Korean text-to-speech systems 원문보기

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (27)

이 논문을 인용한 문헌

저자의 다른 논문 :

연구과제 타임라인

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구 An end-to-end synthesis method for Korean text-to-speech systems 원문보기

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (27)

이 논문을 인용한 문헌

저자의 다른 논문 :

김영관 (3) 서영주 (3)

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

한국어 text-to-speech(TTS) 시스템을 위한 엔드투엔드 합성 방식 연구
An end-to-end synthesis method for Korean text-to-speech systems 원문보기