[논문]BERT-Fused Transformer 모델에 기반한 한국어 형태소 분석 기법

이창재; 나동열

doi:10.3745/ktsde.2022.11.4.169

[국내논문] BERT-Fused Transformer 모델에 기반한 한국어 형태소 분석 기법
Korean Morphological Analysis Method Based on BERT-Fused Transformer Model 원문보기

정보처리학회논문지. KIPS transactions on software and data engineering. 소프트웨어 및 데이터 공학, v.11 no.4, 2022년, pp.169 - 178

이창재 (연세대학교 소프트웨어학부) , 나동열 (연세대학교 소프트웨어학부)

초록
AI-Helper

형태소는 더 이상 분리하면 본래의 의미를 잃어버리는 말의 최소 단위이다. 한국어에서 문장은 공백으로 구분되는 어절(단어)의 조합이다. 형태소 분석은 어절 단위의 문장을 입력 받아서 문맥 정보를 활용하여 형태소 단위로 나누고 각 형태소에 적절한 품사 기호를 부착한 결과를 생성하는 것이다. 한국어 자연어 처리에서 형태소 분석은 가장 핵심적인 태스크다. 형태소 분석의 성능 향상은 한국어 자연어 처리 태스크의 성능 향상에 직결된다. 최근 형태소 분석은 주로 기계 번역 관점에서 연구가 진행되고 있다. 기계 번역은 신경망 모델 등으로 어느 한 도메인의 시퀀스(문장)를 다른 도메인의 시퀀스(문장)로 바꾸는 것이다. 형태소 분석을 기계 번역 관점에서 보면 어절 도메인에 속하는 입력 시퀀스를 형태소 도메인 시퀀스로 변환하는 것이다. 본 논문은 한국어 형태소 분석을 위한 딥러닝 모델을 제안한다. 본 연구에서 사용하는 모델은 기계 번역에서 높은 성능을 기록한 BERT-fused 모델을 기반으로 한다. BERT-fused 모델은 기계 번역에서 대표적인 Transformer 모델과 자연어 처리 분야에 획기적인 성능 향상을 이룬 언어모델인 BERT를 활용한다. 실험 결과 형태소 단위 F1-Score 98.24의 성능을 얻을 수 있었다.

Abstract ▼ AI-Helper

Morphemes are most primitive units in a language that lose their original meaning when segmented into smaller parts. In Korean, a sentence is a sequence of eojeols (words) separated by spaces. Each eojeol comprises one or more morphemes. Korean morphological analysis (KMA) is to divide eojeols in a given Korean sentence into morpheme units. It also includes assigning appropriate part-of-speech(POS) tags to the resulting morphemes. KMA is one of the most important tasks in Korean natural language processing (NLP). Improving the performance of KMA is closely related to increasing performance of Korean NLP tasks. Recent research on KMA has begun to adopt the approach of machine translation (MT) models. MT is to convert a sequence (sentence) of units of one domain into a sequence (sentence) of units of another domain. Neural machine translation (NMT) stands for the approaches of MT that exploit neural network models. From a perspective of MT, KMA is to transform an input sequence of units belonging to the eojeol domain into a sequence of units in the morpheme domain. In this paper, we propose a deep learning model for KMA. The backbone of our model is based on the BERT-fused model which was shown to achieve high performance on NMT. The BERT-fused model utilizes Transformer, a representative model employed by NMT, and BERT which is a language representation model that has enabled a significant advance in NLP. The experimental results show that our model achieves 98.24 F1-Score.

주제어

표/그림 (13)

그림 Fig. 1. Architecture of Korean Morphological Analysis Model
그림 Fig. 2. Example of Korean Morphological Analysis (Prediction)
그림 Fig. 3. Korean Morphological Analysis Model (Prediction)
그림 Fig. 4. Example of Input to BERT of Korean Morphological Analysis Model
표 Table 1. Parameter Values on Training with frozen BERT
표 Table 2. Parameter Values on Fine-tuning with BERT
표 Table 3. Number of Sentences/Eojeols/Morphemes
표 Table 4. Transformer+frozen BERT Loss/Validation F1-Score
표 Table 5. Transformer+BERT Loss/Validation F1-Score
표 Table 6. Validation F1-Score on Each Method of Search
표 Table 7. Examples of Morphological Analysis Errors
표 Table 8. Datasets for Each Model
표 Table 9. Test F1-Score of Each Model

참고문헌 (20)

D. Ra, M. Cho, and Y. Kim, "Enhancing a Korean part-of-speech tagger based on a maximum entropy model," Journal of the Korean Data Analysis Society, Vol.9, No.4, pp.1623-1638, 2007.
K. Cho, et al., "Learning phrase representations using RNN Encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.1724-1734, 2014.
I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in Neural Information Processing Systems, pp.3104-3112, 2014.
D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in Proceedings of the International Conference on Learning Representations, San Diego, California, 2015.
T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.1412-1421, 2015.
A. Vaswani, et al., "Attention is all you need," in Advances in Neural Information Processing Systems, pp.6000-6010, 2017.
J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of NAACL-HLT, Minneapolis, Minnesota, pp.4171-4186, 2019.
J. Zhu, et al., "Incorporating BERT into neural machine translation," in Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
Q. Wang, et al., "Learning Deep Transformer Models for Machine Translation," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp.1810-1822, 2019.
T. Nguyen and J. Salazar, "Transformers without tears: Improving the normalization of self-attention," in Proceedings of the 16th International Workshop on Spoken Language Translation, 2019.
A. Graves, "Sequence transduction with recurrent neural networks," in Proceedings of the 29th International Conference on Machine Learning Workshop on Representation Learning, Edinburgh, Scotland, 2012.
M. Freitag and Y. Al-Onaizan, "Beam search strategies for neural machine translation," in Proceedings of the First Workshop on Neural Machine Translation, Vancouver, Canada, pp.56-60, 2017.
E. Battenberg, et al., "Exploring neural transducers for end-to-end speech recognition," in Proceedings of 2017 IEEE Automatic Speech Recognition and Understanding Workshop, Okinawa, Japan, pp.206-213, 2017.
H. S. Hwang and C. K. Lee, "Korean morphological analysis using sequence-to-sequence learning with copying mechanism," in Proceedings of the Korea Computer Congress 2016, pp.443-445, 2016.
J. Li, E. H. Lee, and J.-H. Lee, "Sequence-to-sequence based morphological analysis and part-of-speech tagging for Korean language with convolutional features," Journal of Korean Institute of Information Scientists and Engineers, Vol.44, No.1, pp.57-62, 2017.
S.-W. Kim and S.-P. Choi, "Research on joint models for Korean word spacing and POS (Part-Of-Speech) tagging based on bidirectional LSTM-CRF," Journal of Korean Institute of Information Scientists and Engineers, Vol.45, No.8, pp.792-800, 2018.
B. Choe, I.-h. Lee, and S.-g. Lee, "Korean morphological analyzer for neologism and spacing error based on sequence-to-sequence," Journal of Korean Institute of Information Scientists and Engineers, Vol.47, No.1, pp.70-77, 2020.
J. Min, S.-H. Na, J.-H. Shin, and Y.-K. Kim, "Stack pointer network for Korean morphological analysis," in Proceedings of the Korea Computer Congress 2020, pp.371-373, 2020.
Y. Choi and K. J. Lee, "Performance analysis of Korean morphological analyzer based on transformer and BERT," Journal of Korean Institute of Information Scientists and Engineers, Vol.47, No.8, pp.730-741, 2020.
J. Y. Youn and J. S. Lee, "A pipeline model for Korean morphological analysis and part-of-speech tagging using sequence-to-sequence and BERT-LSTM," in Proceedings of the 32nd Annual Conference on Human & Cognitive Language Technology, pp.414-417, 2020.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증