[논문]이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교

박근우; 정인경

doi:10.5351/kjas.2019.32.3.349

초록
AI-Helper

이분형 자료의 분류에서 자료의 불균형 정도가 심한 경우 분류 결과가 좋지 않을 수 있다. 이런 문제 해결을 위해 학습 자료를 변형시키는 등의 연구가 활발히 진행되고 있다. 본 연구에서는 이러한 이분형 자료의 분류문제에서 불균형을 다루기 위한 방법들 중 표본재추출 방법들을 비교하였다. 이를 통해 자료에서 희소계급의 탐지를 보다 효과적으로 하는 방법을 찾고자 하였다. 모의실험을 통하여 여러 오버샘플링, 언더샘플링, 오버샘플링과 언더샘플링 혼합방법의 총 20가지를 비교하였다. 분류문제에서 대표적으로 쓰이는 로지스틱 회귀분석, support vector machine, 랜덤포레스트 모형을 분류기로 사용하였다. 모의실험 결과, 정확도가 0.5 이상이면서 민감도가 높았던 표본재추출 방법은 random under sampling (RUS)였다. 그 다음으로 민감도가 높았던 방법은 오버샘플링 ADASYN (adaptive synthetic sampling approach)이었다. 이를 통해 RUS 방법이 희소계급값을 찾기 위한 방안으로는 적합했다는 것을 알 수 있었다. 몇 가지 실제 자료에 적용한 결과도 모의실험의 결과와 비슷한 양상을 보였다.

Abstract ▼ AI-Helper

A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance i...

A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.

주제어

표/그림 (11)

그림 Figure 2.1. Changes in data set after applying various over-sampling methods.
그림 Figure 2.2. Changes in data set after applying various CNN-based under-sampling methods.
그림 Figure 2.3. Changes in data set after applying various under-sampling methods.
그림 Figure 2.4. Changes in data set after applying two combined methods.
표 Table 2.1. Misclassiﬁcation table
그림 Figure 3.1. Sensitivity, accuracy, ACU, and F1-score of logistic regression for simulation 3.
그림 Figure 3.2. Sensitivity, accuracy, ACU, and F1-score of SVM for simulation 3.
그림 Figure 3.3. Sensitivity, accuracy, ACU, and F1-score of random forest for simulation 3.
그림 Figure 3.4. An example of original data set and changed data set after applying the NM2 method when the rare class values were distributed in two extremes.
그림 Figure 4.1. Sensitivity, accuracy, ACU, and F1-score of logistic regression, SVM, and random forest for so-lar ﬂare m0 data.
표 Table 4.1. Example data sets

질의응답

핵심어	질문	논문에서 추출한 답변
	RUS 방법의 장점은?	따라서 논문에서 찾고자 하는 민감도를 높이는 방법이 RUS라고 할 수 있다. 임의로 자료를 선택하는 것이기 때문에 특별한 가정 없이 사용할 수 있는 장점이 있다.
	표본재추출(resampling) 방법이란?	원 자료를 이용해 새로운 학습 자료를 이용하는 것은 원 자료의 값을 없애거나 가상의 값을 만드는 방식이다. 이를 표본재추출(resampling) 방법이라 한다.
	분류문제에서 대표적으로 쓰이는 분석 방법은?	모의실험을 통하여 여러 오버샘플링, 언더샘플링, 오버샘플링과 언더샘플링 혼합방법의 총 20가지를 비교하였다. 분류문제에서 대표적으로 쓰이는 로지스틱 회귀분석, support vector machine, 랜덤포레스트 모형을 분류기로 사용하였다. 모의실험 결과, 정확도가 0.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교
Comparison of resampling methods for dealing with imbalanced data in binary classification problem 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

표/그림 (11)

표/그림 (11)

질의응답

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교 Comparison of resampling methods for dealing with imbalanced data in binary classification problem 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

표/그림 (11) 모든 표/그림 보기

표/그림 (11) 슬라이드로 보기

질의응답

이 논문을 인용한 문헌

저자의 다른 논문 :

정인경 (9)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

이분형 자료의 분류문제에서 불균형을 다루기 위한 표본재추출 방법 비교
Comparison of resampling methods for dealing with imbalanced data in binary classification problem 원문보기

초록
AI-Helper

표/그림 (11)

표/그림 (11)