최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기말소리와 음성과학 = Phonetics and speech sciences, v.10 no.1, 2018년, pp.39 - 48
최연주 (한국과학기술원 전기및전자공학부) , 정영문 (한국과학기술원 전기및전자공학부) , 김영관 (한국과학기술원 전기및전자공학부) , 서영주 (한국과학기술원 전기및전자공학부) , 김회린 (한국과학기술원)
A typical statistical parametric speech synthesis (text-to-speech, TTS) system consists of separate modules, such as a text analysis module, an acoustic modeling module, and a speech synthesis module. This causes two problems: 1) expert knowledge of each module is required, and 2) errors generated i...
핵심어 | 질문 | 논문에서 추출한 답변 |
---|---|---|
디코더의 역할은? | 디코더는 특정 time step 프레임의 스펙트로그램을 입력으로 받고, 다음 time step 프레임의 스펙트로그램을 출력한다. 본 연구에서는 Bahdanau et al. | |
음편 선정(unit selection) 방식은 많은 양의 데이터를 사용함에도 불구하고, 연결한 두 음편 사이의 경계가 부자연스럽다는 문제, 주어진 문장에 대해 항상 똑같은 발화만이 가능하다는 문제 등이 존재한다. 이러한 한계점들을 극복하고자 제안된 것은? | 이러한 한계점들을 극복하고자 통계적 파라미터 방식 음성 합성(statistical parametric speech synthesis) 시스템이 제안되었다.대표적인 예로 은닉 마르코프 모델(HMM, hidden Markov model) 기반 TTS 시스템(HTS, HMM-based speech synthesis)이 있다. | |
Text-to-speech(TTS) 시스템이란? | Text-to-speech(TTS) 시스템이란 텍스트가 입력되어서 그에 대응하는 음성으로 변환되어 출력되는 시스템으로, 음성 합성 시스템이라고도 불린다. 여기서 중요한 점은 출력되는 합성음이 실제 사람이 말하는 것처럼 충분히 자연스러워야 한다는 점이다. |
Arik, S., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J., Sengupta, S., & Shoeybi, M. (2017a). Deep Voice: Real-time neural text-to-speech. Proceedings of the 34th International Conference on Machine Learning (pp. 195-204). Sydney, AU. 6-11 August, 2017.
Arik, S., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017b). Deep Voice 2: Multi-speaker neural text-to-speech. Advances in Neural Information Processing Systems 30 (pp. 2966-2974). Long Beach, CA. 4-9 December, 2017.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Retrieved from http://arxiv.org/abs/1409.0473 [Computing Research Repository] on January 9, 2018.
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum learning. Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41-48). 14-18 June, 2009.
Cho, K., Van Mrrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Retrieved from http://arxiv.org/abs/1406.1078 [Computing Research Repository] on January 9, 2018.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from http://arxiv.org/abs/1412.3555 [Computing Research Repository] on January 9, 2018.
Collins, J., Sohl-Dickstein, J., & Sussillo, D. (2017). Capacity and trainability in recurrent neural networks. Proceedings of the 5th International Conference on Learning Representations. Retrieved from https://openreview.net/forum?idBydARw9ex on January 9, 2018.
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778). 26 June-1 July, 2016.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 373-376). 7-10 May, 1996.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (pp. 448-456). 2 Mar, 2015.
Kawahara, H. (1997). Speech representation and transformation using adaptive interpolation of weighted spectrum: Vocoder revisited. Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 1303-1306). 21-24 April, 1997.
Lee, J., Cho, K., & Hoffman, T. (2016). Fully character-level neural machine translation without explicit segmentation. Retrieved from http://arxiv.org/abs/1610.03017 [Computing Research Repository] on January 9, 2018.
Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.
Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). WaveNet: A generative model for raw audio. Retrieved from http://arxiv.org/abs/1609.03499 [Computing Research Repository] on January 9, 2018.
Ping, W., Peng, K., Gibiansky, A., Arik, S., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. Retrieved from http://arxiv.org/abs/1710.07654 [Computing Research Repository] on January 9, 2018.
Rabiner, L., & Schafer, R. (2011). Theory and applications of digital speech processing. New Jersey: Pearson.
Raffel, C., Luong, M.-T., Liu, P., Weiss, R., & Eck, D. (2017). Online and linear-time attention by enforcing monotonic alignments. Proceedings of the 34th International Conference on Machine Learning (pp. 2837-2846). 6-11 August, 2017.
Shen, J., Pang, R., Weiss, R., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R., Saurous, R., Agiomyrgiannakis, Y., & Wu, Y. (2017). Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. Retrieved from http://arxiv.org/abs/1712.05884 [Computing Research Repository] on March 1, 2018.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
Srivastava, R., Greef, K., & Schmidhuber, J. (2015). Highway networks. Retrieved from http://arxiv.org/abs/1505.00387 [Computing Research Repository] on January 9, 2018.
Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (pp. 3104-3112). 8-13 December, 2014.
Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on hidden markov models. Proceedings of IEEE, 101(5), 1234-1252.
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., & Hinton, G. (2015). Grammar as a foreign language. Advances in Neural Information Processing Systems 28 (pp. 2773-2781). 7-12 December, 2015.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q., Agiomyrgiannakis, Y., Clark, R., & Saurous, R. (2017). Tacotron: Towards end-to-end speech synthesis. Retrieved from http://arxiv.org/abs/1703.10135 [Computing Research Repository] on January 9, 2018.
Wu, Z., Watts, O., & King, S. (2016). Merlin: An open source neural network speech synthesis system. Proceedings of the 9th ISCA Speech Synthesis Workshop (pp. 218-223). Sunnyvale, CA. 13-15 September, 2016.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.