[논문]특허문서 분류를 위한 딥러닝 모델 학습 및 평가에 관한 연구

김경호

특허문서 분류를 위한 딥러닝 모델 학습 및 평가에 관한 연구 원문보기

김경호 (건양대학교 대학원 의료인공지능학과 의료인공지능 국내석사)

초록 ▼
AI-Helper

협력적 특허 분류(Cooperative Patent Classification, CPC)는 특허문서를 분류하는 단위이다. 그리고 분류의 세분화 정도에 분류 레벨이 Section, Class, Subclass, Maingroup, Subgroup으로 나눌 수 있다. 현재 우리나라뿐만 아니라 전 세계의 주요 특허청에서도 CPC를 기준으로 특허문서를 분류하고 있다.
문제는 특허출원 심사 및 분류 건수가 늘어남에 따라서 그와 관련한 제반의 비용 또한 증가하고 있다. 이러한 문제점을 해결하기 위해 기존 연구에서도 CPC 단위를 기준으로 특허문서를 자동으로 분류하는 연구를 진행하였다.
기존 연구에서는 단일 모델에 여러 가지 특허문서를 분류하도록 학습하는 다중 클래스 다중 라벨 방법론을 적용하였다. 데이터 셋이 잘 구성이 된다면 문제가 없지만 분류할 CPC가 늘어남에 따라 데이터 셋이 불균형하게 구성될 가능성을 커질 수밖에 없고, 그러므로 분류 모델의 성능 또한 저하될 수밖에 없다.
본 연구에서는 기존 연구보다 분류 모델의 성능을 개선하기 위해 단일 모델 다중 클래스 다중 라벨 방법론이 아닌 다중 모델 이진 분류 방법론을 적용하였다. 즉 하나의 분류 모델에게 여러 가지 레이블을 분류하도록 학습하는 것이 아니라, 하나의 분류 대상의 참과 거짓만을 판단하게 하도록 분류 모델을 학습하고 구성하는 것이다.
연구를 진행하기 위해서 분류기로 사용할 모델은 한국어로 기본적인 단어사전이 Pre-training 되어 있는 KoBERT와 KoELECTRA 딥러닝 언어모델을 선정하였다. 그리고 딥러닝 언어모델을 추가로 학습할 데이터는 KIPRIS Plus에서 제공하는 Open API를 사용하여 수집을 진행하였으며 해당 데이터로 Pre-training에 사용할 단어사전과 Fine-tuning에 사용할 데이터를 구축하였다. 단어사전의 경우 Mecab의 형태소 분석기를 사용해 구성하였다.
연구 결과 이진 분류 모델의 정확도가 최대 98.26%, F1 Score가 96.81에 도달하는 좋은 분류 모델이 도출되어 기존 연구에서의 정확도보다 높은 것을 확인하였다.
추후 연구의 보완을 통해 정확도가 떨어지는 모델에 대해서는 단어사전을 보강할 예정이며, 모든 분류 모델의 성능이 어느 정도 좋은 수준에 이르게 된다면 단어사전을 합쳐 기존 연구처럼 단일모델 다중 클래스 다중 라벨 방법론을 적용해 연구를 진행할 예정이다.

Abstract ▼ AI-Helper

A cooperative patent classification (CPC) is a unit that classifies patent documents. In addition, the classification level can be divided into Section, Class, Subclass, Maing group, and Subgroup in the degree of classification. Currently, not only Korea but also major patent offices around the world classify patent documents based on CPC.
The problem is that as the number of patent applications reviewed and classified increases, all related costs are also increasing. In order to solve this problem, existing studies have also conducted a study to automatically classify patent documents based on the CPC unit.
Existing studies have applied a multi-class multi-label methodology that learns to classify multiple patent documents into a single model. If the dataset is well configured, there is no problem, but as the CPC to be classified increases, the likelihood of the dataset being configured disproportionately increases, and thus the performance of the classification model is also reduced.
In this work, we apply a multi-model binary classification methodology rather than a single-model multi-class multi-label methodology to improve the performance of classification models over existing studies. In other words, it is not to learn one classification model to classify multiple labels, but to learn and organize the classification model to judge only the true and false of one classification target.
The KoBERT and KoELECTRA deep learning language models with basic word dictionaries pre-trained in Korean were selected as models to be used as classifiers to conduct the study. In addition, the data to further learn the deep learning language model was collected using the Open API provided by KIPRIS Plus, and with that data, word dictionaries for pre-training and data for fine-tuning were constructed. In the case of a word dictionary, it was constructed using Mecab's morpheme analyzer.
As a result of the study, a good classification model with an accuracy of up to 98.26% and an F1 Score of 96.81 was derived, confirming that the accuracy of the binary classification model was higher than that of previous studies.
In the future, word dictionaries will be reinforced for models with poor accuracy, and if the performance of all classification models reaches some good level, word dictionaries will be combined to apply a single model multi-class multi-label methodology like previous studies.

주제어

학위논문 정보

저자	김경호
학위수여기관	건양대학교 대학원
학위구분	국내석사
학과	의료인공지능학과 의료인공지능
지도교수	박종욱
발행연도	2023
키워드	특허문서 이진분류 딥러닝 KoBERT KoELECTRA Patent documents CPC Binary classification Deep Learning
언어	kor
원문 URL	http://www.riss.kr/link?id=T16857585&outLink=K
정보원	한국교육학술정보원

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

특허문서 분류를 위한 딥러닝 모델 학습 및 평가에 관한 연구 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

특허문서 분류를 위한 딥러닝 모델 학습 및 평가에 관한 연구 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

초록 ▼
AI-Helper