[논문]Word2vec 학습 자질을 사용한 새로운 한글 개체명 인식 모델 제안

한남기

Word2vec 학습 자질을 사용한 새로운 한글 개체명 인식 모델 제안 원문보기

한남기 (연세대학교 대학원 문헌정보학과 국내석사)

초록 ▼
AI-Helper

1991년 이래, 정보 검색 분야에서 기계에 의한 자동 개체명 인식은 필수적인 요소로 인식되어 많은 연구가 진행되어 왔다. 한글 문헌에서의 개체명 인식 역시 예외가 아니었는데, 기존의 한글 개체명 인식 연구들은 학습 자질을 크게 고려하지 않았다는 한계가 존재하였다. 본 연구는 기존 연구들의 한계점을 극복하기 위해 최근 해외에서 활발하게 연구되고 있는 단어 표상(Word Representation) 자질을 사용해 한글 개체명 인식 모델의 성능 향상을 시도하였다.
한글 개체명 인식 모델을 다룬 선행 연구들은 규칙 기반 방법론, 통계 기반 방법론을 통해 개체명 인식을 시도하였다. 그러나 규칙 기반 방법론은 연구자의 수작업이 필수적이라는 점, 주제 분야에 따라 정확도가 떨어진다는 한계점이, 통계 기반 방법론은 연구들마다 기계 학습 알고리즘만이 다를 뿐 학습 자질에는 큰 차이가 없다는 한계점이 있었다.
한편, 해외에서는 지도 학습에 사용할 수 있는 학습 자질에 대한 연구가 활발하게 이루어지고 있는데, 그 중 비지도 학습을 통해 생성된 단어 표상은 다양한 지도 학습의 성능을 향상시킬 수 있다는 것이 보고되어 있다. 그 중에서도 word2vec 모델을 통해 학습된 학습 자질은 다양한 분야와 언어의 개체명 인식 모델 성능을 개선시킬 수 있다는 선행 연구들이 존재하였다.
본 연구는 기존 연구들에서 사용한 학습 자질을 비교군으로 삼아, word2vec 자질을 추가한 실험군과 성능을 비교하여 실제로 한글 개체명 인식 모델에 word2vec 학습 자질이 성능 개선을 가져오는지를 확인하는 실험을 수행하였다. 또한, word2vec에서 제공하는 학습 매개변수들을 활용하여, 여러 word2vec 학습 자질을 만든 후 이것이 한글 개체명 인식 모델의 성능에 어떤 영향을 미치는지도 비교하였다.
본 연구의 실험은 다음과 같은 절차를 통해 진행되었다. 먼저 기계 학습에 사용할 데이터로 21세기 세종계획에서 제공하는 현대 문어 말뭉치를 사용하여, 말뭉치 내에서 분석된 채로 제공되는 어휘, 품사 자질을 추출하였다. 개체명 태깅은 21세기 세종계획 내의 전자사전을 통해 기계적으로 수행하였으며, 학습에 사용할 사전 자질로는 한국 위키피디아 자료를 사용하였다. 이상의 작업을 통해 비교군에 사용할 학습 자료를 생성하였다.
다음으로, word2vec 공개 라이브러리를 통해 실험군에 사용할 벡터 기반 학습 자질을 산출하였다. 이때, word2vec에서 설정 가능한 매개변수들을 통해 20차원, 50차원, 100차원 및 학습 범위 3, 5, 7의 학습 자질들을 생성하였다. 이들을 사용하여 다양한 실험군을 생성, word2vec으로 학습된 자질들이 한글 개체명 인식 모델의 성능에 어떻게 영향을 미치는지를 측정하였다. 이렇게 작성된 학습 자료들은 조건부 무작위장 알고리즘을 통해 한글 개체명 인식 모델로 학습되었으며, 10-묶음 교차 평가를 사용해 성능 평가를 수행하였다.
실험 결과, word2vec 자질을 사용하지 않은 비교군에 비해 word2vec 자질을 사용한 실험군의 개체명 인식 성능이 좋은 것으로 나타났으며, F값 기준으로 최대 24.47%까지 성능이 향상되었다. 또한, 학습된 자질의 차원 수가 많을수록 개체명 인식 모델의 성능이 개선되었으며, 학습 범위는 개체명 인식 모델이 다루는 유형에 따라 다양한 최적 학습 범위가 제시되었다. 이 결과들은 대체로 word2vec을 사용한 선행 연구들에서 제시한 것과 같은 경향을 보였다.
연구 결과, word2vec 학습 자질을 사용하였을 경우 한글 개체명 인식 모델의 성능을 향상시킬 수 있는 것으로 나타났으며, 매개변수를 다양하게 설정하여 성능 향상 수준을 더 증가시킬 수 있는 것으로 보고되었다. 본 연구는 기존의 한글 개체명 인식 연구에서 많이 다뤄지지 않았던 학습 자질의 중요성을 재조명하고, 한글 개체명 인식 모델의 성능을 향상시킬 수 있는 방법을 제안하였다는데 그 의의가 있다. 또한, 본 연구에서 제시한 방법론을 통해 실제 정보 검색 연구에서 사용할 수 있는 한글 개체명 인식 모델의 발전에 기여할 수 있다는 점에서도 그 가치를 찾을 수 있다.

Abstract ▼ AI-Helper

Since 1991, the research about Named Entity Recognition (NER) is recognized as an essential element in the field of Information Retrieval, so many studies have been conducted. The studies about NER in Korean literature are no exception, but existing studies have a limitation that they did not consider machine-learning features. The purpose of this study is to use the word representation feature to improve the performance of the Korean NER model.
Previous studies dealing with the Korean NER model have attempted to this task through a rules-based methodology and statistics-based methodology. However, a Korean NER model used rules-based methodology have the limitation that a task making rule of entity recognition manually is needed, and generated model can't treat several subject areas. And the model used statistics-based methodology have considered their machine-learning algorithms, but machine-learning features which have an effect on the performance of model have not been considered.
About machine-learning features, there are a lot of foreign researches which try to improve performance of supervised learning, especially in the domain of Natural Language Processing (NLP). Previous studies have been represented that a feature calculated by unsupervised learning can be efficient feature for supervised learning in domain of NLP. In particular, a vector-based feature made by word2vec algorithm have been reported that this feature can increase the efficiency of NER models in a variety of fields and languages.
In this study, the performance of control group, which use only the machine-learning features of previous studies, is compared with treatment group, which use the features of control group and a vector-based feature by word2vec. This experiment is conducted to identify usefulness of the word2vec feature in Korean NER. In addition, There are various parameters provided by the word2vec. It is also tested whether those parameters effects the performance of Korean NER model.
This experiment is conducted as follows. The 21st Century Sejong Modern Korean Corpora is used for machine-learning. The vocabulary and part-of-speech tag feature are extracted from this corpora. Named entity tagging is done automatically via electronic dictionaries provided by Sejong Copora. In addition, dictionary feature is made from Korean Wikipedia data. The Korean NER model as control group uses above all features.
Next, the vector-based feature is calculated by word2vec library in Java. In this paper, dimension parameter and window parameter are modified for learning various NER models. It is for investigating how effect the performance of Korean NER model. In addition to features used in control group, treatment group uses vector-based feature. all groups is learned through Conditional Random Field(CRF) algorithms, and their performance is evaluated by 10-fold cross-validation.
In result, the performance of the treatment group showed better than control group in all aspects. In particular, the difference in performance is up to 24.47% when compared on the basis of F measure. In addition, as number of dimension parameter are increased, the performance of Korean NER model are improved. However, in the case of window parameter, the performance of Korean NER model is neither proportional nor in inverse proportion.
This study revealed that using machine-learning feature made by word2vec can improve the performance of the Korean NER model. This paper refocus the significance of machine-learning feature that were not covered in the previous Korean NER researches. In addition, this study contribute to improvement of performance of Korean NER model and the development of Korean NER model that can be used in the actual search study.

주제어

학위논문 정보

저자	한남기
학위수여기관	연세대학교 대학원
학위구분	국내석사
학과	문헌정보학과
지도교수	송민
발행연도	2016
총페이지	vii, 57장
키워드	한글 개체명 인식 단어 표상 named entity recognition Korean named entity recognition 개체명 인식 word representation word2vec
언어	kor
원문 URL	http://www.riss.kr/link?id=T14003936&outLink=K
정보원	한국교육학술정보원

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Word2vec 학습 자질을 사용한 새로운 한글 개체명 인식 모델 제안 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Word2vec 학습 자질을 사용한 새로운 한글 개체명 인식 모델 제안 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

초록 ▼
AI-Helper