[보고서]빅데이터를 활용한 준지도학습 기반의 한글 자연어처리 엔진 개발 및 응용

강필성

빅데이터를 활용한 준지도학습 기반의 한글 자연어처리 엔진 개발 및 응용
Development and Application of a Semi-supervised learning-based Korean Natural Language Processing Engine based on Big Data 원문보기

보고서 정보
주관연구기관	고려대학교 Korea University
연구책임자	강필성
보고서유형	최종보고서
발행국가	대한민국
언어	한국어
발행년월	2017-05
과제시작연도	2016
주관부처	과학기술정보통신부 Ministry of Science and ICT
등록번호	TRKO201800004840
과제고유번호	1711036527
사업명	개인연구지원
DB 구축일자	2018-05-05
키워드	텍스트마이닝.한글 자연어 처리.감성 분석.빅데이터.준지도학습.데이터마이닝.기계학습.패턴인식.Text Mining.Korean Natural Language Processing.Sentiment Analysis.Big Data.Semi-supervised Learning.Data Mining.Machine Learning.Pattern Recognition.
DOI	https://doi.org/10.23000/TRKO201800004840

초록 ▼

연구의 목적 및 내용
본 연구에서는 준지도학습 기반의 한글 자연어처리 엔진을 개발하고 그 효과를 검증하는 것을 목표로 함. 이를 위하여 다양한 형태의 한글 데이터를 수집하여 말뭉치를 구축하고 이를 바탕으로 준지도학습 기반의 단어 인식기 및 개체명 인식기를 개발함. 또한 높은 정확도를 나타내는 도메인별 어휘 감성 스코어 산출 기법을 개발하고 개발된 방법론들을 바탕으로 다양한 활용 사례를 탐색하고자 함. 마지막으로 개발된 기법들과 말뭉치를 공유함으로써 관련 분야 연구의 활성화에 기여하고자 함

연구결과
본 과제의 수행을 통해 다음과 같은 연구 성과를 나타냄. 첫째, 다양한 데이터 수집의 관점에서 뉴스, 사용자 리뷰, 소셜 네트워크 등의 데이터 원천으로부터 총 7,000만건에 이르는 문서를 수집하고 이를 각 분야와 목적에 맞게 정리하여 공개를 진행 중에 있음. 둘째, 준지도학습 기반의 자연어 처리 엔진과 관련해서는 비지도학습 기반으로 단어추출을 하기 위한 토크나이저로 두 가지(L-Tokenizer와 Max Score Tokenizer)를 개발하였고, 기존의 지도 학습 기반의 단어 인식기에 비해 우수한 인식 성능을 나타내는 것을 확인함. 또한, 워드 임베딩을 활용한 비지도 객체명 인식기를 개발하였으며 기존의 CRF 기반의 객체명 인식기에 비해 높은 정확도를 나타내는 인식기 개발에 성공함. 셋째, 어휘 감성 스코어 산출과 관련하여 리뷰 평점을 활용한 통계 기반의 감성 스코어 산출 방법론과 워드임베딩 및 그래프 기반 준지도학습을 활용한 감성 스코어 산출 방법론을 개발함. 넷째, 한글 텍스트 분석 활용사례 발굴과 관련해서는 키워드 추출기를 이용한 트렌드 분석, 소셜 네트워크 데이터를 활용한 영화 흥행 예측, 뉴스 기사의 관련성 및 고유성 측정, 주가 등락에 대한 뉴스 기사 Attention 모델 개발 등의 의미 있는 활용 사례를 발굴함. 마지막으로 개발된 자연어처리 엔진 소스코드와 말뭉치를 깃허브에 공개함으로써 관련 연구자들이 연구 결과물을 자유롭게 활용할 수 있도록 지원함

연구결과의 활용계획
본 과제의 연구 결과물은 관련 분야의 연구자들이 손쉽게 사용할 수 있도록 파이썬 패키지화를 하여 공개할 예정임. 또한 개발된 감성 스코어 산출 기법을 다양한 분야로 확장하여 분야별 감성 어휘 사전을 구축하는데 적극적으로 활용할 계획임. 또한 뉴스 기사, 이메일, 소셜 네트워크 등의 텍스트 데이터로부터 스팸성 내용을 검출하는 스팸 필터링 시스템을 구축하는 데도 개발된 자연어 처리 엔진이 사용될 수 있을 것으로 기대함.

(출처 : 한글요약문 4p)

Abstract ▼

Purpose& contents
This research project aims at developing semi-supervised learning-based natural language processing engines for Korean and verify their effectiveness by applying them to various real-world problems. To do so, we first constructed large corpus for different domains such as news articles, review comments, and social network service (SNS) posts. Based on these corpus, we developed semi-supervised learning-based word recognition algorithms and named entity recognition algorithms. In addition, we developed domain-specific sentiment lexicon evaluation methods and applied them to news article polarity classification and box office forecasting. Finally we aimed at sharing the corpus and developed algorithm source codes in a public repository to help researchers who are interested in text mining with Korean.

Result
The outputs of this research projects can be summarized as follows. First, from the perspective of various data collection, we constructed five Korean text corpus by collecting more than 70 million documents from diverse sources such as news articles, user reviews on movies and cell phones, and social network services. We also preprocessed the collected data to make it public through github, which is a public archive sharing codes and data. Second, two semi-supervised learning-based word recognition algorithms are developed, named L-Tokenizer and Max Score Tokenizer, which showed better recognition performances compared with supervised learning-based benchmark methods. In addition, we developed a named entity recognition algorithm based on word embedding with a higher recognition precision compared with the conventional CRF-based method. Third, we developed two word sentiment evaluation methods, one of which is based on purely statistiacal method while the other of which is based on graph based semi-supervised learning. Fourth, to discover the successful applications of Korean text mining, we applied the developed algorithms to box office forecasting, evaluation of relevance and uniqueness of news articles, text attention for stock price movement. Finally, we made all source codes and corpus data public so that whoever is interested in Korean text mining can freely used the results of this research project.

Expected Contribution
The results of this research project is wrapped as a python package to help people attempting to use the developed algorithms. We also plan to extend the domain of sentiment score evaluation algorithms to construct various domain specific Korean sentiment dictionary. In addition, we will develop a spam filtering system based on the developed algorithms to identify those spam texts that are prevalent in news articles, e-mail contents, social network services, etc.

(출처 : SUMMARY 5p)

목차 Contents

표지 ... 1
목차 ... 2
연구계획 요약문 ... 3
연구결과 요약문 ... 4
한글요약문 ... 4
SUMMARY ... 5
연구내용 및 결과 ... 6
1. 연구개발과제의 개요 ... 6
2. 국내외 기술개발 현황 ... 8
3. 연구수행 내용 및 결과 ... 9
4. 목표달성도 및 관련분야에의 기여도 ... 20
5. 연구결과의 활용계획 ... 21
6. 연구과정에서 수집한 해외 과학기술정보 ... 21
7. 주관연구책임자 대표적 연구실적 ... 22
8. 참고문헌 ... 23
9. 연구성과 ... 24
10. 국가과학기술지식정보서비스에 등록한 연구시설‧장비 현황 ... 31
11. 연구개발과제 수행에 따른 연구실 등의 안전조치 이행실적 ... 31
12. 기타사항 ... 31
별첨1 대 표 연 구 성 과 ... 32
별첨2 세부 목표 관련 증빙 ... 48
끝페이지 ... 53

과제명(ProjectTitle) :	-
연구책임자(Manager) :	-
과제기간(DetailSeriesProject) :	-
총연구비 (DetailSeriesProject) :	-
키워드(keyword) :	-
과제수행기간(LeadAgency) :	-
연구목표(Goal) :	-
연구내용(Abstract) :	-
기대효과(Effect) :	-

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 제목(한글), 저자명(한글), 발행일자, 전자원문, 초록(한글), 초록(영문) 관리번호, 제목(한글), 제목(영문), 저자명(한글), 저자명(영문), 주관연구기관(한글), 주관연구기관(영문), 발행일자, 총페이지수, 주관부처명, 과제시작일, 보고서번호, 과제종료일, 주제분류, 키워드(한글), 전자원문, 키워드(영문), 입수제어번호, 초록(한글), 초록(영문), 목차
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

빅데이터를 활용한 준지도학습 기반의 한글 자연어처리 엔진 개발 및 응용
Development and Application of a Semi-supervised learning-based Korean Natural Language Processing Engine based on Big Data 원문보기

초록 ▼

Abstract ▼

목차 Contents

연구자의 다른 보고서 :

참고문헌 (25)

연구과제 타임라인

관련 콘텐츠

원문 보기

이 보고서와 함께 이용한 콘텐츠

연관된 기능

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

빅데이터를 활용한 준지도학습 기반의 한글 자연어처리 엔진 개발 및 응용 Development and Application of a Semi-supervised learning-based Korean Natural Language Processing Engine based on Big Data 원문보기

초록 ▼

Abstract ▼

목차 Contents

연구자의 다른 보고서 :

강필성 (4)

참고문헌 (25)

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

관련 콘텐츠

원문 보기

이 보고서와 함께 이용한 콘텐츠

연관된 기능

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

빅데이터를 활용한 준지도학습 기반의 한글 자연어처리 엔진 개발 및 응용
Development and Application of a Semi-supervised learning-based Korean Natural Language Processing Engine based on Big Data 원문보기