[논문]한국어 대용량 말뭉치의 철자 오류 탐지 및 교정 방법

원혜진

한국어 대용량 말뭉치의 철자 오류 탐지 및 교정 방법 원문보기

원혜진 (국민대학교 일반대학원 컴퓨터공학과 컴퓨터공학전공 국내석사)

초록 ▼
AI-Helper

최근 자연어 처리는 대규모의 학습 말뭉치와 대용량 모델을 기반으로 하는 다양한 연구들이 제안되고 있다. 자연어 처리 모델의 일반화 성능을 향상하기 위해서 대용량의 학습 말뭉치가 필요하기 때문에 학습 말뭉치 및 데이터에 대한 필요성과 수요가 날로 증가하고 있다. 하지만 학습 말뭉치의 규모와 더불어 잘못된 라벨링이나 오탈자처럼 데이터의 질 또한 학습 모델에 주요한 영향을 끼칠 수 있다는 연구들이 발표되면서 학습 데이터의 질에 대한 연구의 필요성이 요구되고 있다.
본 논문에서는 Multi-Pass 기반의 한국어 철자 오류 탐지 및 교정 방법을 제안하고 다양한 교정 방법들을 복합적으로 사용한 모델의 성능을 비교 및 분석하였다. 제안하는 탐지 및 교정 방법은 기존의 교정 방법들과 달리 입력 문장에서 철자 오류를 우선으로 탐지하고 탐지된 철자 오류에 대해 음절, 어절, 문맥 수준의 철자 교정 방법을 적용한다. 철자 오류 탐지에는 한국어 인코딩을 이용한 방법과 저빈도 음절 탐지 방법을 활용하였다. 음절, 어절, 문맥 수준의 철자 교정을 위해 각각 Trigram 언어 모델, Word2Vec, Copy Mechanism을 적용한 트랜스포머를 사용하였으며 각 수준의 모델들 조합을 통해 복합 교정 모델로 철자 교정을 수행하였다.
학습이나 검증에 사용되는 공개된 한국어 철자 교정 말뭉치가 없기 때문에 단일 및 복합 교정 모델의 성능을 평가하기 위해 KCC150에서 철자 오류가 의심되는 1,000개의 문장을 추출하여 학습 말뭉치를 구축하고 학습 및 검증에 사용하였다. 또한 단일 및 복합 교정 모델들의 성능을 비교 및 분석하기 위하여 혼동 행렬을 사용하였으며 정밀도, 재현율 , F1 점수를 이용해 성능을 측정하였다.
단일 및 복합 교정 모델의 비교 및 분석 결과, 교정 모델을 복합적으로 사용하는 경우 모델들 간의 간섭이 생길 수 있음을 확인하였으며 문맥 수준의 교정 모델이 전반적으로 철자 오류 교정 모델의 성능 향상에 도움을 줄 수 있음을 확인하였다.

Abstract ▼ AI-Helper

Recently, various studies have been proposed for natural language processing based on large scale learning corpus and large capacity models. The need and demand for learning corpus and data are increasing day by day because a large amount of learning corpus is required to improve the generalization performance of natural language processing models. However, as studies have published that the quality of data, such as mislabeling or typos, can significantly impact the learning model and the size of the learning corpus, the need for research on the quality of learning data is required.
In this paper, we proposed a multi-pass based Korean spelling error detection and correction method, and the performance of a model using various correction methods was compared. Unlike conventional typo error correction methods, the proposed method detects spelling errors in input sentences and corrections spelling errors applied syllable, phrase, and context-level. The Korean encoding method and a statistically based low-frequency syllable method were used for spelling error detection. In addition, the Trigram language model, Word2Vec, and Transformer with Copy Mechanism were used for spelling correction at the syllable, word, and context levels, respectively, and spelling correction was performed as a composite correction model through a combination of models at each level.
In order to evaluate the performance of single and complex models due to the absence of an open Korean spelling correction corpus used for learning or verifying, we extracted 1,000 sentences suspected of spelling errors in the KCC150 corpus and used them for learning and verifying. In addition, a confusion matrix was used to compare the performance of single and complex calibration models, and performance was measured using precision, recall, and F1 score.
The comparison of single and complex models confirmed that combining models could cause interference between models. However, the context-level model could help improve the overall performance of spelling error correction models.

학위논문 정보

저자	원혜진
학위수여기관	국민대학교 일반대학원
학위구분	국내석사
학과	컴퓨터공학과 컴퓨터공학전공
지도교수	강승식
발행연도	2021
총페이지	vi, 63
언어	kor
원문 URL	http://www.riss.kr/link?id=T16066473&outLink=K
정보원	한국교육학술정보원

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

한국어 대용량 말뭉치의 철자 오류 탐지 및 교정 방법 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

한국어 대용량 말뭉치의 철자 오류 탐지 및 교정 방법 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

초록 ▼
AI-Helper