[논문]LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기

이태석; 강승식

doi:10.30693/smj.2018.7.4.17

초록
AI-Helper

자동 띄어쓰기 특성을 효과적으로 처리할 수 있는 LSTM(Long Short-Term Memory Neural Networks) 기반의 RNN 모델을 제시하고 적용한 결과를 분석하였다. 문장이 길거나 일부 노이즈가 포함된 경우에 신경망 학습이 쉽지 않은 문제를 해결하기 위하여 입력 데이터 형식과 디코딩 데이터 형식을 정의하고, 신경망 학습에서 드롭아웃, 양방향 다층 LSTM 셀, 계층 정규화 기법, 주목 기법(attention mechanism)을 적용하여 성능을 향상시키는 방법을 제안하였다. 학습 데이터로는 세종 말뭉치 자료를 사용하였으며, 학습 데이터가 부분적으로 불완전한 띄어쓰기가 포함되어 있었음에도 불구하고, 대량의 학습 데이터를 통해 한글 띄어쓰기에 대한 패턴이 의미 있게 학습되었다. 이것은 신경망에서 드롭아웃 기법을 통해 학습 모델의 오버피팅이 되지 않도록 함으로써 노이즈에 강한 모델을 만들었기 때문이다. 실험결과로 LSTM sequence-to-sequence 모델이 재현율과 정확도를 함께 고려한 평가 점수인 F1 값이 0.94로 규칙 기반 방식과 딥러닝 GRU-CRF보다 더 높은 성능을 보였다.

Abstract ▼ AI-Helper

We proposed a LSTM-based RNN model that can effectively perform the automatic spacing characteristics. For those long or noisy sentences which are known to be difficult to handle within Neural Network Learning, we defined a proper input data format and decoding data format, and added dropout, bidire...

We proposed a LSTM-based RNN model that can effectively perform the automatic spacing characteristics. For those long or noisy sentences which are known to be difficult to handle within Neural Network Learning, we defined a proper input data format and decoding data format, and added dropout, bidirectional multi-layer LSTM, layer normalization, and attention mechanism to improve the performance. Despite of the fact that Sejong corpus contains some spacing errors, a noise-robust learning model developed in this study with no overfitting through a dropout method helped training and returned meaningful results of Korean word spacing and its patterns. The experimental results showed that the performance of LSTM sequence-to-sequence model is 0.94 in F1-measure, which is better than the rule-based deep-learning method of GRU-CRF.

주제어

질의응답

핵심어	질문	논문에서 추출한 답변
	자동 띄어쓰기는 어떤 작업인가?	자동 띄어쓰기(automatic word spacing or word segmentation)는 중국어나 일본어처럼 띄어쓰기를 하지 않는 언어에서 자연어 처리를 위해 해야 하는 기본 작업이다. 따라서 이들 언어권에서는 문장에서 단어를 구분하는 연구가 많이 진행되었다.
	중국어의 단어 분리는 어떤 방식을 사용하는가?	중국어의 단어 분리(word segmentation)는 시퀀스 라벨링 문제(sequence labeling problem)로 보고 문자 단위로 처리하는 방식을 사용하고 있다. 문장에서 기준 위치를 이동하면서 고정길이 전후문자(context window)들로부터 단어를 조합하는 이진트리 구조에서 게이트 순환 신경망을 통해 조합하는 방식에서 LSTM 신경망(Long Short-Term Memory Neural
	실제 문장에 대해서 띄어쓰기 오류를 제거하는 자동 띄어쓰기 단계의 도입이 필요한 이유는 무엇인가?	어절 단위로 띄어 쓰는 한국어의 경우 상대적으로 중요성이 낮지만, 문자인식이나 음성인식의 경우 노이즈 등의 이유로 공백을 인식하지 못하는 오류가 빈번히 발생한다. 이 경우 띄어쓰기가 되지 않은 문장의 띄어쓰기 교정 성능이 자연어처리 성능에 큰 영향을 준다. 따라서 실제 문장에 대해서 띄어쓰기 오류를 제거하는 자동 띄어쓰기 단계의 도입이 필요하다[1, 2].

참고문헌 (17)

Van Khien Phan, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, "Text Detection based on Edge Enhanced Contrast Extremal Region and Tensor Voting in Natural Scene Images," Smartmedia Journal, vol.6, no. 4, pp.32-40., Dec. 2017.

원문보기 상세보기
Abhijeet Boragule, Guee Sang Lee, "Text Line Segmentation of Handwritten Documents by Area Mapping," Smartmedia Journal, vol.4, no. 3, pp.44-49., Sep. 2015.

원문보기 상세보기
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing, "Gated recursive neural network for chinese word segmentation," In Proceedings of the 53rd Annual Metting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1744-1753, Jul. 2015.
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang, "Long short-term memory neural networks for chinese word segmentation," In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197-1206, Sep. 2015.
Deng Cai and Hai Zhao, "Neural Word Segmentation Learning for Chinese," Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409-420, Aug. 2016.
Peilu Wang, Yao Qian, Hai Zhao, Frank K. Soong, Lei He, and Ke Wu, "Learning distributed word representations for bidirectional lstm recurrent neural network," In Proceeding of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologise, pp. 527-533, Jun. 2016.
강승식, "음절 bigram를 이용한 띄어쓰기 오류의 자동 교정," 음성과학, 제8권, 제2호, 83-90쪽, 2001년 6월

원문보기 상세보기
심광섭, "CRF를 이용한 한국어 자동 띄어쓰기," 인지과학, 제22권, 제2호, 217-233쪽, 2011년 6월
이창기, 김현기, "Structural SVM 을 이용한 한국어 자동 띄어쓰기," 한국정보과학회 2012 한국컴퓨터종합학술대회 논문집, 제39권, 제1호(B), 270-272쪽, 2012년 6월
황현선, 이창기, "딥러닝을 이용한 한국어 자동 띄어쓰기," 한국컴퓨터종합학술대회, 738-740쪽, 2016년 6월
Ilya Sutskever, Oriol Vinyals and Quoc V. Le, "Sequence to Sequence Learning with Neural Networks," arXiv preprint, arXiv:1409.3215, Dec. 2014.
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever and Geoffrey Hinton, "Grammar as a Foreign Language," arXiv preprint, arXiv:1412.7449, Jun. 2015.
Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate," arXiv preprint, arXiv:1409.0473, May 2014.
Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overtting," Journal of Machine Learing Research pp. 1929-1958, Jan. 2014.
Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E. Hinton, "Layer Normalization," arXiv preprint, arXiv:1607.06450, Jul. 2016.
Matthew D. Zeiler, "ADADELTA an adaptive learning rate method," arXiv preprint, arXiv:1212.5701, Dec. 2012.
Chin-Yew Lin, "ROUGE: A Package for Automatic Evaluation of Summaries," In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Jul. 2004.

이 논문을 인용한 문헌

저자의 다른 논문 :

LOADING...

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기
LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (17)

이 논문을 인용한 문헌

저자의 다른 논문 :

연구과제 타임라인

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기 LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (17)

이 논문을 인용한 문헌

저자의 다른 논문 :

강승식 (27)

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기
LSTM based sequence-to-sequence Model for Korean Automatic Word-spacing 원문보기

초록
AI-Helper