[논문]불균형 자료에서 랜덤 포레스트에 기반한 분류 방법의 성능 비교

문선영

[학위논문] 불균형 자료에서 랜덤 포레스트에 기반한 분류 방법의 성능 비교
Performance comparison of classification methods based on the random forest in class imbalanced data 원문보기

문선영 (고려대학교 의학통계학협동과정 국내석사)

초록 ▼
AI-Helper

목적 : 많은 분야의 자료에서 관심 사건이 매우 드물게 발생하기 때문에 반응 변수의 계급이 매우 불균형한 분포(소수계급, 다수계급)를 보인다. 하지만 일반적인 분류 알고리즘은 계급의 균형 분포를 가정하고 전체적인 오분류율을 최소화하는 것을 목적으로 한다. 따라서 소수계급에 관심이 있는 계급 불균형 자료에 이러한 일반적인 분류 ...

목적 : 많은 분야의 자료에서 관심 사건이 매우 드물게 발생하기 때문에 반응 변수의 계급이 매우 불균형한 분포(소수계급, 다수계급)를 보인다. 하지만 일반적인 분류 알고리즘은 계급의 균형 분포를 가정하고 전체적인 오분류율을 최소화하는 것을 목적으로 한다. 따라서 소수계급에 관심이 있는 계급 불균형 자료에 이러한 일반적인 분류 알고리즘을 사용할 경우 소수계급의 분류 정확도가 매우 떨어지게 된다. 이러한 계급 불균형 문제를 보완하기 위한 여러 가지 방법이 제시되었다. 본 연구의 목적은 계급 불균형 문제를 보정할 수 있는 여러 방법들의 분류 성능을 계급 불균형 정도(IR)에 따라 비교하는 것이다.

방법 : 본 연구에서는 랜덤 포레스트를 기본 분류 알고리즘으로 사용하고, 계급 불균형 문제를 보완할 수 있는 여러 방법들 중에서 랜덤 포레스트에 기반한 알고리즘을 사용하였다. 표본의 분포를 조정하는 자료 수준의 접근 방법인 ROS, RUS 및 SMOTE 방법을 통해 수정한 표본에 랜덤 포레스트를 적용하였고, 알고리즘을 수정하는 알고리즘 수준의 접근 방법으로는 Weighted Random Forest, Balanced Random Forest, Isolation Forest 방법을 사용하였다. 또한 계급 불균형 자료의 분류에서는 소수계급의 보다 정확한 분류에 관심이 있기 때문에 일반적인 모형 성능 평가 지표인 정확도를 사용할 수 없다. 따라서 본 연구에서는 F1 measure와 AU-ROC를 성능 평가 지표로 사용하였다.

결과 : 시뮬레이션 결과 모든 IR에서 F1 measure와 AU-ROC 지표 모두 공간분할을 통한 이상 탐지 알고리즘인 Isolation Forest 방법이 좋은 성능을 보였다. 추가적으로 표본의 수를 늘리면 두 지표 모두 약간의 증가를 보여 계급 불균형 자료의 분류에 표본 수 역시 영향을 미침을 확인할 수 있었다. 또한, 전체적인 추세를 보면 각 방법 간의 감소폭에는 차이가 있었지만, 불균형 정도가 심해질수록 모든 방법에서 두 지표의 값은 감소하였다.

결론 : 0 또는 1의 값을 갖는 반응 변수의 분포가 매우 불균형한 경우 일반적인 분류 알고리즘의 소수계급 분류 정도는 다소 떨어진다. 하지만 단일 계급 내 불균형이나 표본의 크기 등 계급 불균형 자료에 영향을 미치는 여러 가지 요인들이 있으므로 자료의 구조에 맞는 알맞은 방법을 선택하여 사용하는 것이 중요하다.

Abstract ▼ AI-Helper

Objectives : Because of the very rare occurrence of events of interest in many fields, the class of response variables shows a highly imbalanced distribution (minority class, majority class). However, general classification algorithms assume a balanced distribution of classes and aim at minimizing the overall misclassification rate. Thus, the use of this general classification algorithms for class imbalance data leads to a very poor classification accuracy for minority class that we are interested. Several methods have been suggested to overcome this class imbalanced problem. The purpose of this study is to compare the classification performance of various methods that can correct the class imbalanced problem according to the degree of class imbalance(IR).

Methods : In this study, random forest is used as the basic classification algorithm and random forest based algorithms are used among the various methods to solve the class imbalanced problem. A random forest was applied to modified sample through ROS, RUS, and SMOTE, which are data-level approaches that adjust the distribution of the sample. Also the algorithm-level approaches such as Weighted Random Forest, Balanced Random Forest, Isolation Forest were used. Additionally, classification of class imbalance data can not use ‘accuracy’, which is a general model performance evaluation index, for comparing performance of models because it is interested in more accurate classification of minority class than majority class. Therefore, in this study, F1 measure and AU-ROC were used as performance evaluation index.

Results : Simulation results show that the Isolation Forest method, which is an anomaly detection algorithm based on spatial division, shows the best performance for both the F1 measure and the AU-ROC index in all IRs. In addition, as the number of samples increases, both indicators show a slight increase, indicating that the number of samples in the classification of class imbalance also affects the size of samples. Also, overall trends show that there was a difference in the decrease of each method, but as the degree of imbalance increased, the values of both indicators decreased in all methods.

Conclusion : If the distribution of response variables with a value of 0 or 1 is highly imbalanced, proportion of correct classification of minority classes in the general classification algorithm is somewhat lower. However, there are many factors that affect the class imbalanced problem such as within imbalance or sample size, so it is important to select the appropriate method for the structure of the data.

학위논문 정보

저자	문선영
학위수여기관	고려대학교
학위구분	국내석사
학과	의학통계학협동과정
지도교수	안형진
발행연도	2018
총페이지	vi, 48장
언어	kor
원문 URL	http://www.riss.kr/link?id=T14704686&outLink=K
정보원	한국교육학술정보원

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

[학위논문] 불균형 자료에서 랜덤 포레스트에 기반한 분류 방법의 성능 비교
Performance comparison of classification methods based on the random forest in class imbalanced data 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

[학위논문] 불균형 자료에서 랜덤 포레스트에 기반한 분류 방법의 성능 비교 Performance comparison of classification methods based on the random forest in class imbalanced data 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

[학위논문] 불균형 자료에서 랜덤 포레스트에 기반한 분류 방법의 성능 비교
Performance comparison of classification methods based on the random forest in class imbalanced data 원문보기

초록 ▼
AI-Helper