[논문]기계학습방법을 이용한 순서형 결측자료 대체의 성능 비교

손세림

기계학습방법을 이용한 순서형 결측자료 대체의 성능 비교
Performance comparison of imputation methods using machine learning techniques for ordinal missing data 원문보기

손세림 (고려대학교 의학통계학협동과정 국내석사)

초록 ▼
AI-Helper

목적 : 자료를 수집하는 과정에서 결측은 다양한 원인으로 발생한다. 결측이 존재하는 자료를 그대로 분석에 이용한다면 편향이 발생할 수 있다. 이때 일반적으로 많이 사용하는 방법은 결측값을 채워 넣는 대체를 하는 것이다. 대체과정에서 중요하게 고려해야 하는 것은 변수의 형태이다. 순서형 변수의 경우 대표적으로 순서형 로지스틱 방법이 있다. 하지만 이 방법은 모수적인 방법으로 한계점이 있어 이것을 극복할 수 있는 비모수적인 방법이 필요하다. 따라서 본 논문에서는 순서형 로지스틱과 비모수적인 방법인 기계학습방법을 이용하여 결측자료를 대체한 후 방법 간의 성능을 비교, 평가하였다.

방법 : 본 논문은 순서형 결측자료의 대체를 위해 총 세 가지 방법을 비교하였다. 먼저 모수적 방법으로는 순서형 로지스틱을 이용하였다. 기계학습방법 중에서는 순서의 의미를 반영하여 분류하는 순서형 의사결정나무를 이용하여 결측자료를 대체하였다. 또한, 순서의 의미를 배제한 후 명목형 변수의 형태로 랜덤 포레스트를 사용하여 대체하였다. 결과는 경험적 편향, 경험적 평균제곱오차, Somers’ D통계량, 선형 가중 카파 일치도, 이차 가중 카파 일치도, 예측률, 카테고리별 비율의 총 7가지 기준을 이용하여 비교하였다.

결과 : 대체 결과, 카테고리가 5개일 때 절편이 모형에 미치는 영향이 작은 경우 모든 기준에서 기계학습 기법의 결과가 더 좋았다. 반면 절편이 모형에 미치는 영향이 큰 경우에는 예측력의 측면에서는 랜덤 포레스트의 결과가 더 좋았지만, 경험적 편향과 경험적 평균제곱오차의 측면에서는 순서형 로지스틱의 결과가 더 좋았다. 카테고리가 3개인 경우에도 유사한 결과를 보였다. 절편의 영향이 작은 경우에는 기계학습방법의 결과가 더 좋았으며 절편의 영향이 큰 경우 결측률에 따라 다른 결과를 보였다. 결측률이 상대적으로 낮은 경우 모든 기준에서 랜덤 포레스트가 더 좋았으나 결측률이 높은 경우 경험적 편향과 경험적 평균제곱오차 기준에서 보면 순서형 로지스틱의 결과가 더 좋았다.

결론 : 순서형 결측자료의 대체에 있어 적용 방법은 카테고리의 수, 결측률, 모형에 적용된 모수의 크기에 따라 다른 방법이 적용될 수 있다. 절편의 영향이 작은 경우 카테고리의 수와 결측률에 상관없이 기계학습 방법의 사용이 추천된다. 평가 기준을 경험적 편향이나 경험적 평균제곱오차에 초점을 둘 것인지, 예측력에 초점을 둘 것인지에 따라 순서형 의사결정나무와 랜덤 포레스트에서 중에서 선택할 수 있다. 절편의 영향이 큰 경우 평가 기준에 따라 방법이 달라진다. 경험적 편향이나 경험적 평균제곱오차에 초점을 둔다면 순서형 로지스틱 방법을, 예측력에 초점을 둔다면 랜덤 포레스트를 사용하는 것이 추천된다.

Abstract ▼ AI-Helper

Objectives : In the process of collecting data, missing occurs in a variety of ways. If missing data are used for analysis in the same way, bias may occur. In this case, the most commonly used method is to impute missing values. An important consideration in the process of imputation is the form of variables. For ordinal response variables, there is cumulative logistic method. However, this method has limitations in a parametric method and a nonparametric method is needed to overcome this. Therefore, in this paper, we compare and evaluate the performance of the method after replacing the missing data using the cumulative logistic and nonparametric method of machine learning techniques.

Methods : This paper compares three methods for the imputation of ordinal missing data. First, we use the ordinal logistic method, which is most commonly used for ordinal variables. In the machine learning techniques, ordinal decision tree is used to impute missing data. In addition, random forest was used in the form of nominal variables after excluding the meaning of order. The results were compared using seven criteria: empirical bias, empirical mean squared error, Somers' D statistic, linear weighted kappa agreement, quadratic weighted kappa agreement, predictive ratio, and ratios by category.

Results : As a result of imputation of the ordinal logistic and machine learning techniques, the results of machine learning techniques were better in all criteria if the intercept had a small effect on the model when the category was five. On the other hand, if the intercept had a large effect on the model, random forest was better in accuracy, but in terms of empirical bias and empirical mean square error the result of the ordinal logistic was better. Similar results were also shown when three categories were used. The results of the machine learning techniques were better when the effect of the intercept was small and the results were different depending on missing rate when the effect of the intercept was large. Random forest was better in all criterion with low missing rate, but the ordinal logistic was better in empirical bias and empirical mean square error with high missing rate.

Conclusion : In order to impute the ordinal missing data, different methods can be applied depending on the number of categories, the missing rate, and the effect of parameters applied to the model. If the effect of intercepts is small, the use of machine learning techniques is recommended irrespective of the number of categories and the rate of missing. Depending on whether the evaluation criteria are focused on empirical bias and empirical mean square error or accuracy, it is possible to choose from an ordinal decision tree and random forest. If the effect of the intercept is large, the method used depends on the evaluation criteria. When focusing on empirical bias and empirical mean square error, a cumulative logistic method is recommended and a random forest is recommended if it focuses on accuracy.

학위논문 정보

저자	손세림
학위수여기관	고려대학교
학위구분	국내석사
학과	의학통계학협동과정
지도교수	안형진
발행연도	2019
총페이지	vi, 49장
언어	kor
원문 URL	http://www.riss.kr/link?id=T15063102&outLink=K
정보원	한국교육학술정보원

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

기계학습방법을 이용한 순서형 결측자료 대체의 성능 비교
Performance comparison of imputation methods using machine learning techniques for ordinal missing data 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

기계학습방법을 이용한 순서형 결측자료 대체의 성능 비교 Performance comparison of imputation methods using machine learning techniques for ordinal missing data 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

기계학습방법을 이용한 순서형 결측자료 대체의 성능 비교
Performance comparison of imputation methods using machine learning techniques for ordinal missing data 원문보기

초록 ▼
AI-Helper