[논문]향상된 특징 추출 기반의 극한 다중 레이블 특허 분류 기법

김민상

향상된 특징 추출 기반의 극한 다중 레이블 특허 분류 기법
Efficient Classification Method of Extreme Multi-label Patents using Enhanced Feature Extraction 원문보기

김민상 (숭실대학교 대학원 융합소프트웨어학(일원) 국내석사)

초록 ▼
AI-Helper

매년 Patent Cooperation Treaty(PCT)를 통해 출원되는 특허의 수가 증가하고 있다. 각 국가는 국제 표준화를 위해 산업 발전에 따라 자체 분류체계를 발전시키고 대응한다. 특허 분류는 특허 문서를 분석하여 사람이 직접 달게 되는데 특허의 수가 증가함에 따라 딥러닝을 이용하여 특허 분류 문제를 해결한다. 하지만 특허는 불균형한 데이터의 분포를 이루고 있어 특허 분류에 어려움이 있다.
BERT의 등장 이전엔 Word Embedding과 Convolution Neural Network(CNN)을 활용한 모델이 나왔다. Word Embedding은 다의어, 동음이의어를 문맥에 따라 구분할 수 없다는 단점이 존재하였다. 이를 극복한 모델인 BERT가 등장한 이후로 BERT를 fine-tuning한 PatentBERT와 LAHA에서 제안한 방식을 BERT에 적용한 Label-aware Attention BERT가 연구되었다. 하지만, 두 모델 모두 BERT의 최대 길이 제한으로 인하여 충분한 특허 기술 내용 담지 못하기 때문에 만족할 만한 성능이 나오지 못하였다. 청구항에는 기술 내용이 작성된 독립항, 독립항보다 더 자세히 작성된 종속항이 있는데 PatentBERT는 독립항, 종속항 둘 다 사용하였지만, 종속항에는 독립항의 일부 내용도 포함되어 있어서 특허에 존재하는 많은 기술 내용을 보지 못한다.
본 논문은 특허의 초록과 독립항을 두 가지 특징을 추출하는 모듈(Extract Module 1, Extract Module 2)과 추출된 두 개의 특징을 핵심적인 특징으로 합치는 모듈(Ensemble Module)을 제안하였다.
4가지 데이터 셋(I&T1430, I&T1409, KSIC564, KNSCC188)으로 실험을 통해 제안한 모델이 불균형한 데이터인 I&T1430와 I&T1409에서 모든 지표에서 우수한 성능을 보였고, 데이터 셋 KSIC564, KNSCC188에서 모든 모델의 성능이 비슷하다는 것을 보였다. 이에 따라 제안하는 모델은 불균형한 데이터 셋에서 높은 성능을 보인다는 것을 확인할 수 있었다.
Ablation Study를 통해 두 가지 특징 추출하는 모듈을 사용하는 방법이 가장 성능이 우수하다는 것을 확인할 수 있었고 제안하는 모델이 극한 다중 레이블 분류 문제에서 성능이 우수하다는 것을 확인하였다.

Abstract ▼ AI-Helper

Every year, the number of patents filed through Patent Cooperation Treatment (PCT) is increasing. Each country develops and responds to its own classification system according to industrial development for international standardization. Patent classification analyzes patent documents and is directly attached by humans, and as the number of patents increases, deep learning is used to solve the patent classification problem. However, patents have a disproportionate distribution of data, making it difficult to classify patents.
Prior to the advent of BERT, models using Word Embedded and Convolution Neural Network (CNN) emerged. Word Embeddeding had the disadvantage of not being able to distinguish polysemy and homonyms according to context. Since the emergence of BERT, a model that overcame this, PatentBERT that fine-tuned BERT and Label-aware Attention BERT that applied the method proposed by LAHA to BERT have been studied. However, both models did not produce satisfactory performance because they did not contain sufficient patent technology content due to the maximum length limit of BERT. There are independent and dependent terms in which the description was written, and PatentBERT used both independent and dependent terms, but the dependent terms also include some of the independent terms, so many of the technical contents present in the patent are not seen.
This paper proposed a module (Extract Module 1 and Extract Module 2) that extracts two features of the patent's abstract and independent terms and an Ensemble Module that combines the extracted two features into key features.
The model proposed in the experiment with four datasets (I&T1430, I&T1409, KSIC564, and KNSCC188) showed excellent performance in all indicators in the unbalanced data I&T1430 and I&T1409, and the performance of all models in the datasets KSIC564 and KNSCC188. Accordingly, it was confirmed that the proposed model exhibits high performance on an unbalanced dataset.
Ablation Study confirms that using two feature extraction modules performs best and that the proposed model performs well in extreme multi-label classification problems.

주제어

학위논문 정보

저자	김민상
학위수여기관	숭실대학교 대학원
학위구분	국내석사
학과	융합소프트웨어학(일원)
지도교수	이상준
발행연도	2023
총페이지	44
키워드	특허 분류 극한 다중 분류
언어	kor
원문 URL	http://www.riss.kr/link?id=T16784793&outLink=K
정보원	한국교육학술정보원

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

향상된 특징 추출 기반의 극한 다중 레이블 특허 분류 기법
Efficient Classification Method of Extreme Multi-label Patents using Enhanced Feature Extraction 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

향상된 특징 추출 기반의 극한 다중 레이블 특허 분류 기법 Efficient Classification Method of Extreme Multi-label Patents using Enhanced Feature Extraction 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

향상된 특징 추출 기반의 극한 다중 레이블 특허 분류 기법
Efficient Classification Method of Extreme Multi-label Patents using Enhanced Feature Extraction 원문보기

초록 ▼
AI-Helper