[논문]한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성

안성만; 정여진; 이재준; 양지헌

doi:10.13088/jiis.2017.23.2.071

한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성
Korean Sentence Generation Using Phoneme-Level LSTM Language Model 원문보기

지능정보연구 = Journal of intelligence and information systems, v.23 no.2, 2017년, pp.71 - 88

안성만 (국민대학교 경영학부) , 정여진 (국민대학교 경영학부) , 이재준 (국민대학교 데이터사이언스학과) , 양지헌 (국민대학교 데이터사이언스학과)

초록
AI-Helper

언어모델은 순차적으로 입력된 자료를 바탕으로 다음에 나올 단어나 문자를 예측하는 모델로 언어처리나 음성인식 분야에 활용된다. 최근 딥러닝 알고리즘이 발전되면서 입력 개체 간의 의존성을 효과적으로 반영할 수 있는 순환신경망 모델과 이를 발전시킨 Long short-term memory(LSTM) 모델이 언어모델에 사용되고 있다. 이러한 모형에 자료를 입력하기 위해서는 문장을 단어 혹은 형태소로 분해하는 과정을 거친 후 단어 레벨 혹은 형태소 레벨의 모형을 사용하는 것이 일반적이다. 하지만 이러한 모형은 텍스트가 포함하는 단어나 형태소의 수가 일반적으로 매우 많기 때문에 사전 크기가 커지게 되고 이에 따라 모형의 복잡도가 증가하는 문제가 있고 사전에 포함된 어휘 외에는 생성이 불가능하다는 등의 단점이 있다. 특히 한국어와 같이 형태소 활용이 다양한 언어의 경우 형태소 분석기를 통한 분해과정에서 오류가 더해질 수 있다. 이를 보완하기 위해 본 논문에서는 문장을 자음과 모음으로 이루어진 음소 단위로 분해한 뒤 입력 데이터로 사용하는 음소 레벨의 LSTM 언어모델을 제안한다. 본 논문에서는 LSTM layer를 3개 또는 4개 포함하는 모형을 사용한다. 모형의 최적화를 위해 Stochastic Gradient 알고리즘과 이를 개선시킨 다양한 알고리즘을 사용하고 그 성능을 비교한다. 구약성경 텍스트를 사용하여 실험을 진행하였고 모든 실험은 Theano를 기반으로 하는 Keras 패키지를 사용하여 수행되었다. 모형의 정량적 비교를 위해 validation loss와 test set에 대한 perplexity를 계산하였다. 그 결과 Stochastic Gradient 알고리즘이 상대적으로 큰 validation loss와 perplexity를 나타냈고 나머지 최적화 알고리즘들은 유사한 값들을 보이며 비슷한 수준의 모형 복잡도를 나타냈다. Layer 4개인 모형이 3개인 모형에 비해 학습시간이 평균적으로 69% 정도 길게 소요되었으나 정량지표는 크게 개선되지 않거나 특정 조건에서는 오히려 악화되는 것으로 나타났다. 하지만 layer 4개를 사용한 모형이 3개를 사용한 모형에 비해 완성도가 높은 문장을 생성했다. 본 논문에서 고려한 어떤 시뮬레이션 조건에서도 한글에서 사용되지 않는 문자조합이 생성되지 않았고 명사와 조사의 조합이나 동사의 활용, 주어 동사의 결합 면에서 상당히 완성도 높은 문장이 발생되었다. 본 연구결과는 현재 대두되고 있는 인공지능 시스템의 기초가 되는 언어처리나 음성인식 분야에서 한국어 처리를 위해 다양하게 활용될 수 있을 것으로 기대된다.

Abstract ▼ AI-Helper

Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.

주제어

AI 본문요약
AI-Helper

문제 정의

본 논문에서는 텍스트를 문자보다 더 작은 단위인 자음과 모음으로 이루어진 음소 단위로 분할하여 사전을 구성하고 이를 기반으로 문장 생성 시스템을 구축한다. 이 경우 사전 크기가 현저히 작아지기 때문에 임베딩 계층을 통해 사전크기보다 작은 차원의 공간으로 데이터를 매핑하는 과정을 생략할 수 있고 앞서 말한 모형 복잡도 증가, 어휘의 다양성 제한 등의 단점을 보완할 수 있다.

제안 방법

본 논문에서는 한글 텍스트의 음소 단위 문장 생성모델을 제안하였다. 다양한 최적화 알고리즘과 LSTM layer의 개수에 따른 모형의 성능을 비교하였다. SGD 알고리즘을 사용할 때 다른 알고리즘들에 비해 확연히 큰 validation loss와 perplexity 값을 보이며 예측의 불확실성이 크게 나타났고 나머지 알고리즘들은 대체적으로 비슷한 불확실성을 보였다.
이 경우 사전 크기가 현저히 작아지기 때문에 임베딩 계층을 통해 사전크기보다 작은 차원의 공간으로 데이터를 매핑하는 과정을 생략할 수 있고 앞서 말한 모형 복잡도 증가, 어휘의 다양성 제한 등의 단점을 보완할 수 있다. 본 연구에서는 3개 혹은 4개의 LSTM 계층을 사용하여 모델을 수립하고 다양한 최적화 알고리즘을 적용해 모델을 학습시킨다.
본 논문에서는 불완전한 문장을 음소 단위로 분해하여 입력하여 뒤에 나올 음소를 예측하는 방식으로 문장을 생성하는 모형을 구축한다. 실험 데이터 상의 문장을 음소 단위의 문자로 분할하여 유일한 문자들로 이루어진 사전을 구축하고, n 개의 문자로 이루어진 벡터 c = (c1, c2, ..., c_n)^T를 입력하였을 때 사전의 각 문자가 c_n+1 이 될 확률을 계산하여 가장 확률이 높은 문자를 출력한다.

대상 데이터

실험에서 사용된 데이터는 구약성경 텍스트이다. 전체 데이터를 음소 단위로 분할하는 전처리과정을 거쳐 자음, 모음, 숫자, 괄호, 따옴표 등이 포함된 74개의 문자로 이루어진 데이터를 구축하였다.
전체 데이터를 음소 단위로 분할하는 전처리과정을 거쳐 자음, 모음, 숫자, 괄호, 따옴표 등이 포함된 74개의 문자로 이루어진 데이터를 구축하였다. 연속된 20개의 음소로 이루어진 입력 데이터와 뒤이어 나오는 21번째 음소로 이루어진 출력 데이터로 총 1,023,411개의 입출력 데이터를 구성하였고 이를 70:15:15의 비율로 training, validation, test set으로 나누었다. 모든 실험은 Intel Xeon CPU 1개(16 코어)와 NVIDIA GeForce GTX 1080 GPU 1개가 장착된 PC에서 Theano(Theano Development Team, 2016)를 백그라운드로 설정한 Keras(Chollet, 2015) 패키지를 활용하여 진행되었다.
Validation loss와 perplexity가 가장 작은 알고리즘을 최적 모형으로 간주하여 layer 3개 모형에서는 nadam 알고리즘, layer 4개 모형에서는 adam 알고리즘을 사용한 결과를 최종으로 선택하였다. 학습된 결과를 정성적으로 판단하기 위해 ‘여호와여 주는 나의’를 입력값으로 설정하여 1000개의 음소를 뒤이어 생성하였다. 생성된 결과에서 처음으로 장 이름(예를들면, 창1:1)이 나타나고 해당 절이 끝나는 부분까지를 발췌하여 Table 4에 수록하였다.

이론/모형

최종 epoch 수는 Table 1과 2에 나타나 있다. 앞서 소개한 7개의 서로 다른 최적화 알고리즘을 사용하여 각 3개와 4개의 LSTM layer를 포함한 모형을 학습하였다. 각 LSTM layer는 512개의 output unit으로 구성하였고 과적합을 방지하기 위해(Srivastava et al.

성능/효과

Layer가 3개인 모형과 4개인 모형을 비교했을 때 모형학습을 위해 소비되는 시간이 layer가 4개인 모형을 사용할 때 평균적으로 69% 가량 길었을 뿐 아니라 validation loss나 perplexity와 같은 정량지표가 거의 변화가 없거나 SGD 알고리즘을 사용할 때는 오히려 더 악화되는 것을 볼 수 있었다. 하지만 구약성경 텍스트를 사용하여 실제로 모형을 통해 생성된 문장들을 비교해 봤을 때 layer가 4개인 모형이 3개인 모형에 비해 비교적 완성도 높은 결과를 보여주었다.
다양한 최적화 알고리즘과 LSTM layer의 개수에 따른 모형의 성능을 비교하였다. SGD 알고리즘을 사용할 때 다른 알고리즘들에 비해 확연히 큰 validation loss와 perplexity 값을 보이며 예측의 불확실성이 크게 나타났고 나머지 알고리즘들은 대체적으로 비슷한 불확실성을 보였다. 미세한 차이기는 하지만 최적화 과정에서 학습률과 gradient의 계산 방법을 함께 개선한 알고리즘인 nadam, adam, adamax가 학습률만 개선한 알고리즘들 보다 우월한 것으로 나타났다.
Layer가 3개인 모형과 4개인 모형을 비교했을 때 모형학습을 위해 소비되는 시간이 layer가 4개인 모형을 사용할 때 평균적으로 69% 가량 길었을 뿐 아니라 validation loss나 perplexity와 같은 정량지표가 거의 변화가 없거나 SGD 알고리즘을 사용할 때는 오히려 더 악화되는 것을 볼 수 있었다. 하지만 구약성경 텍스트를 사용하여 실제로 모형을 통해 생성된 문장들을 비교해 봤을 때 layer가 4개인 모형이 3개인 모형에 비해 비교적 완성도 높은 결과를 보여주었다.

후속연구

이러한 문장 자동생성 모형은 현재 대두되고 있는 다양한 인공지능 시스템의 기초가 되는 음성인식과 언어처리 분야의 기반이 되는 기술로서 영어권에 비해 아직 연구가 부족한 한국어 기반 시스템을 만드는데 활용될 수 있을 것으로 기대된다. 본 논문이 제시하는 모형을 보다 발전시켜 word-to-character, character-to-word 단계를 사용하여 음소 단위의 모형과 단어 단위 모형의 장점을 결합한 모형을 사용하는 등의 좀더 발전된 기법을 사용한다면 더욱 자연어에 가까운 문장을 생성하는 모형을 구축할 수 있을 것으로 보인다.
이러한 문장 자동생성 모형은 현재 대두되고 있는 다양한 인공지능 시스템의 기초가 되는 음성인식과 언어처리 분야의 기반이 되는 기술로서 영어권에 비해 아직 연구가 부족한 한국어 기반 시스템을 만드는데 활용될 수 있을 것으로 기대된다. 본 논문이 제시하는 모형을 보다 발전시켜 word-to-character, character-to-word 단계를 사용하여 음소 단위의 모형과 단어 단위 모형의 장점을 결합한 모형을 사용하는 등의 좀더 발전된 기법을 사용한다면 더욱 자연어에 가까운 문장을 생성하는 모형을 구축할 수 있을 것으로 보인다.

질의응답

핵심어	질문	논문에서 추출한 답변
	언어모델은 무엇인가?	언어모델은 순차적으로 입력된 자료를 바탕으로 다음에 나올 단어나 문자를 예측하는 모델로 언어처리나 음성인식 분야에 활용된다. 최근 딥러닝 알고리즘이 발전되면서 입력 개체 간의 의존성을 효과적으로 반영할 수 있는 순환신경망 모델과 이를 발전시킨 Long short-term memory(LSTM) 모델이 언어모델에 사용되고 있다.
	언어모델은 어느 분야에 활용되는가?	언어모델은 순차적으로 입력된 자료를 바탕으로 다음에 나올 단어나 문자를 예측하는 모델로 언어처리나 음성인식 분야에 활용된다. 최근 딥러닝 알고리즘이 발전되면서 입력 개체 간의 의존성을 효과적으로 반영할 수 있는 순환신경망 모델과 이를 발전시킨 Long short-term memory(LSTM) 모델이 언어모델에 사용되고 있다.
	제안된 음소 레벨의 LSTM 언어모델은 어느 분야 활용될 것으로 기대되는가?	본 논문에서 고려한 어떤 시뮬레이션 조건에서도 한글에서 사용되지 않는 문자조합이 생성되지 않았고 명사와 조사의 조합이나 동사의 활용, 주어 동사의 결합 면에서 상당히 완성도 높은 문장이 발생되었다. 본 연구결과는 현재 대두되고 있는 인공지능 시스템의 기초가 되는 언어처리나 음성인식 분야에서 한국어 처리를 위해 다양하게 활용될 수 있을 것으로 기대된다.

참고문헌 (33)

Ahn, S. "Deep Learning Architectures and Applications." Journal of Intelligence and Information Systems, Vol. 22, No. 2 (2016), 127-142.

원문보기 상세보기
Bojanowski, P., Joulin, A., and Mikolov, T. "Alternative Structures for Character-Level RNNs." arXiv:1511.06303 (2015).
Cauchy, A. "Methode generale pour la resolution des systemes d'equations simultanees." Comp. Rend. Sci. Paris, Vol.25 (1847), 536-538.
Chollet, F. "Keras." Available at https://github.com/fchollet/keras (downloaded 1 December, 2016).
Chung, J., Cho, K., and Bengio, Y. "A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation." arXiv:1603.06147 (2016).
Olah, Christopher. "Understanding LSTM Networks." Colah's Blog. Available at http://colah.github.io/posts/2015-08-Understan ding-LSTMs/ (downloaded 1 December, 2016).
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. "Large Scale Distributed Deep Networks." In Advances in neural information processing systems, (2012), 1223-1231.
Dozat, T. "Incorporating Nesterov Momentum into Adam." Technical report, Stanford University, Available at http://cs229.stanford. edu/proj2015/054report.pdf (2015).
Duchi, J., Hazan, E., and Singer, Y. "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research, Vol. 12 (2011), 2121- 2159.
Gers, F. A. and Schmidhuber, E. "LSTM Recurrent Networks Learn Simple Context-Free and Context-Sensitive Languages." IEEE Transactions on Neural Networks, Vol. 12, No. 6 (2001), 1333-1340.

상세보기
Goodfellow, I., Bengio, Y., and Courville, A. "Deep Learning." MIT Press, Massachusetts, 2016.
Hinton, G., Srivastava, N., and Swersky, K. "Neural networks for machine learning." Coursera, video lectures, Available at https://www.coursera.org/learn/neural-networks (downloaded 1 December, 2016).
Hochreiter, S. and Schmidhuber, J. "Long Short-Term Memory." Neural Computation, Vol. 9, No. 8 (1997), 1735-1780.

상세보기
Hutter, M. "The Human Knowledge Compression Prize." Available at http://prize.hutter1.net/ (2006).
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. "Exploring the Limits of Language Modeling." arXiv:1602.02410 (2016).
Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. "Character-Aware Neural Language Models." arXiv:1508.06615 (2015).
Kim, Y.-h., Hwang, Y.-k., Kang, T.-g., and Jung, K.-m. "LSTM Language Model Based Korean Sentence Generation." The Journal of Korean Institute of Communications and Information Sciences, Vol. 41, No. 5 (2016), 592-601.

원문보기 상세보기
Kingma, D. and Ba, J. "Adam: A Method for Stochastic Optimization." arXiv:1412.6980 (2014).
Lankinen, M., Heikinheimo, H., Takala, P., and Raiko, T. "A Character-Word Compositional Neural Language Model for Finnish." arXiv:1612.03266 (2016).
Lee, D., Oh, Kh., and Choi, H.-J. "Measuring the Syntactic Similarity between Korean Sentences Using RNN." In Proceedings of Korea Computer Congress (2016a), 792-794.
Lee, J., Cho, K., and Hofmann, T. "Fully Character-Level Neural Machine Translation without Explicit Segmentation." arXiv:1610. 03017 (2016b).
Ling, W., Luis, T., Marujo, L., Astudillo, R. F., Amir, S., Dyer, C., Black, A. W., and Trancoso, I. "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation." arXiv: 1508.02096 (2015).
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S. "Recurrent Neural Network Based Language Model." In Proceedings of Interspeech (2010), 1045-1048.
Mikolov, T. and Zweig, G. "Context Dependent Recurrent Neural Network Language Model." SLT (2012), 234-239.
Polyak, B. T. "Some Methods of Speeding Up the Convergence of Iteration Methods." USSR Computational Mathematics and Mathematical Physics, Vol. 4, No. 5 (1964), 1-17.

상세보기
Rissanen, J. and Langdon, G. G. "Arithmetic Coding." IBM Journal of research and development, Vol.23, No. 2 (1979), 149-162.

상세보기
Socher, R. and Mundra, R. S. "CS 224D: Deep Learning for NLP1." Available at http://cs224d.stanford.edu/ (downloaded 1 December, 2016).
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, Vol. 15, No. 1 (2014), 1929-1958.
Sundermeyer, M., Schlu ？ter, R., and Ney, H. "LSTM Neural Networks for Language Modeling." In Proceedings of Interspeech (2012), 194-197.
Sutskever, I. and Martens, J. "Generating Text with Recurrent Neural Networks." In Proceedings of the 28th International Conference on Machine Learning (2011), 1017-1024.
Theano Development Team. "Theano: A Python Framework for Fast Computation of Mathematical Expressions." arXiv:1605. 02688 (2016).
Ward, D. J., Blackwell, A. F., and MacKay, D. J. "Dasher-a Data Entry Interface Using Continuous Gestures and Language Models." In Proceedings of the 13th annual ACM symposium on User interface software and technology (2000), 129-137.
Zeiler, M. D. "ADADELTA: An Adaptive Learning Rate Method." arXiv:1212.5701 (2012).

저자의 다른 논문 :

LOADING...

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증