[논문]한국에서 구글 트렌드를 활용한 결막염 환자 수 발생 예측

김수진

[학위논문] 한국에서 구글 트렌드를 활용한 결막염 환자 수 발생 예측
Predicting the occurrence of patients with conjunctivitis in Korea using Google trends 원문보기

김수진 (고려대학교 보건정책 및 병원관리학과 국내석사)

초록 ▼
AI-Helper

목적 : 의료 데이터로써의 빅데이터 활용 가능성을 검증하기 위해, 구글 트렌드 데이터를 활용한 결막염의 환자 수 예측 연구를 수행하고자 한다. 이를 통해 구글 트렌드 데이터의 전처리 방법론 및 분석 방법을 제시하고, 머신러닝 적용을 통해 비선형 데이터의 패턴을 찾는 방법에 대해 논의하고자 한다. 이를 통해 빅데이터의 의료 데이터로써의 도입 가능성을 확인하고자 한다.

방법 : 2010년 1월부터 2017년 5월까지의 건강보험심사평가원 보건의료빅데이터 개방 시스템의 결막염(질병코드 H10)의 입원 외래별 월별 환자 수 추이 누계를 종속변수로 활용하였으며, 보건의료빅데이터 개방시스템의 기준에 맞추어 구글 트렌드 데이터를 수집하였다. 종속변수의 예측을 위하여 문헌 고찰을 통해 추출한 키워드에 대해 34개의 구글 트렌드 데이터를 수집하여, 독립변수로 활용하였다. 수집한 독립변수들에 대해 VIF 및 상관계수를 통해 독립변수의 다중공선성을 검증하고, 변수들을 통합 및 제거하여 최종 독립변수를 선정하였다. 본 방법을 통해 최종 선정된 독립변수는 알레르기, 대기오염, 황사, 미세먼지, 수영장 총 5개이며, 각각의 변수는 해당 키워드에 대한 대표성을 띌 수 있도록, 여러 변수를 통합한 대표 값으로 산출하였다. 이를 중심으로, 선형회귀분석, 서포트 벡터 회귀분석, 의사결정나무 회귀분석, 랜덤 포레스트 회귀분석, 그래디언트 회귀분석에 적용하여 분석하였다. 추정한 모형의 적합도를 측정하기 위해 와 RMSE를 활용하였다. 통계 분석은 IBM SPSS statistics 21과 Python을 이용하였다.

결과 : 모든 학습 데이터에 대한 랜덤포레스트의 는 0.64, RMSE는 59444.48로 가장 높은 성능을 보임을 확인하였다. 그래디언트 부스팅(= 0.51 / RMSE = 64833.54)과 의사결정 나무(= 0.43/ RMSE = 69757.59)의 순으로 높은 정확도를 보임을 확인하였다. 서포트 벡터 머신의 경우, 본 모델이 데이터에 대한 설명이 불가함을 확인하여 분석에서 제외하였다 ( < 0). 본 분석 외에도, 학습데이터의 길이별 분석 및 테스트 데이터의 대부분의 분류에서 랜덤포레스트, 그래디언트 부스팅, 의사결정 나무의 순으로 적합함을 확인하였다. 모델의 예측 정확도를 확인하기 위해 1년의 테스트 기간에 대한 예측 값과 실제 환자 수를 비교/분석하여 0.822 정도의 상관관계를 보임을 확인하였다. 또한, 랜덤포레스트 회귀분석의 특성 중요도를 통해 독립변수 중 수영장과 황사가 결막염 환자 발생 예측에 가장 큰 영향을 미침을 확인하였다.

결론 : 본 연구에서는 구글 트렌드로 국내 결막염 환자 수를 예측하고, 그 값이 2차 자료로써의 대체 가능성에 대해 확인하였다. 또한 빅데이터 특성을 활용해서 질병발생 예측력을 더 높일 수 있는 방안에 대해 고려하였다. 값이 0.64로 높은 정확도로 예측할 수 있었다. 의료 통계 데이터 없이 구글 트렌드와 같은 빅데이터의 활용과 머신러닝 기법을 통해 환자 수의 예측과 같은 비선형의 의료 통계 문제를 해결 할 수 있음을 확인하였다. 따라서, 지속적인 빅데이터의 특성을 활용한 연구를 수행하기 위해서는 여러 정책적인 보완점 및 전략이 필요할 것이다.

Abstract ▼ AI-Helper

Abstract

Predicting the occurrence of patients with conjunctivitis in Korea using Google Trends

Soo Jin Kim

Department of Health Policy and Hospital Management
Graduate School of Public Health, Korea University
(Supervising Professor: Hyeong Sik Ahn, MD., PhD.)

Objectives: In order to verify the possibility of using big data as medical data, we conducted prediction research on the number of patients with conjunctivitis using Google Trends. We are proposing a preprocessing and analysis method of Google Trends. We then discuss how to find patterns of nonlinear data by applying machine learning.
This might show the possibility of introducing big data as medical data.

Methods: We used the cumulative number of conjunctivitis (disease code H10) patients per month for outpatient admissions, from January 2010 to May 2017, in the Health Insurance Review Assessment Service (HIRA) health care big data opening system, as a dependent variable. The Google Trends data period was the same as the dependent variable. We collected data for 34 key words on Google Trends, which was based on the key words extracted from literature reviews, as independent variables for the prediction of dependent variables. The multi collinearity of the independent variables was verified through the VIF and correlation coefficients for the collected variables, and the variables were integrated and removed. The final remaining independent variables were allergy, air pollution, yellow dust, fine dust, and swimming pools, and each variable was calculated as a representative value of several variables in order to represent relevant key words. The results of the regression analysis are summarized as follows: First, the linear regression analysis, the Support Vector Machine regression analysis, the Decision Tree regression analysis, the Random Forest regression analysis, and the Gradient Boosting regression analysis were used. R² and RMSE were used in order to measure the fit of the estimated model. All data was analyzed using IBM SPSS statistics 21 and Python programs.

Results: Random Forest (R²=0.64 / RMSE=59444.48) for all learning data showed the highest performance. We found that the accuracy was high in order of Gradient Boosting (R²=0.51 /RMSE=64833.54) and Decision Trees (R²=0.43 / RMSE=69757.59). In the case of the Support Vector Machine, this model was excluded from the analysis by confirming that the data could not be explained (R²<0). In addition to this analysis, we found that most of the classification of test data and analysis of length of learning data were suitable in order of Random Forest, Gradient Boosting, and Decision Trees. In order to confirm the prediction accuracy of the model, it was found that the predicted value for the 1-year test period and the actual number of patients were compared/analyzed and a correlation of 0.822 was shown. Also, the feature importance of the Random Forest regression analysis found that swimming pools and yellow dust had the greatest influence on predicting the occurrence of conjunctivitis patients.

Conclusion : In this study, we predicted the number of Korean conjunctivitis patients using Google Trends and found the possibility of substitution as secondary data. We also considered the possibility of using big data characteristics to increase the power to predict disease occurrence. It was predicted with high accuracy (R²=0.64). It was shown that non-linear medical statistical problems, such as prediction of patient numbers can be solved through utilization of big data, such as Google Trends and machine learning, without medical statistical data. Therefore, various policy supplement points and strategies are needed in order to carry out research using the continuous characteristics of big data.

Key words : Big Data, Machine Learning, Conjunctivitis, Google Trends, Prediction

학위논문 정보

저자	김수진
학위수여기관	고려대학교
학위구분	국내석사
학과	보건정책 및 병원관리학과
지도교수	안형식
발행연도	2018
총페이지	vi, 64장
언어	kor
원문 URL	http://www.riss.kr/link?id=T14924207&outLink=K
정보원	한국교육학술정보원

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

[학위논문] 한국에서 구글 트렌드를 활용한 결막염 환자 수 발생 예측
Predicting the occurrence of patients with conjunctivitis in Korea using Google trends 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

[학위논문] 한국에서 구글 트렌드를 활용한 결막염 환자 수 발생 예측 Predicting the occurrence of patients with conjunctivitis in Korea using Google trends 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

[학위논문] 한국에서 구글 트렌드를 활용한 결막염 환자 수 발생 예측
Predicting the occurrence of patients with conjunctivitis in Korea using Google trends 원문보기

초록 ▼
AI-Helper