[논문]워드 임베딩 기법을 활용한 범주형 데이터의 군집분석 : 고차원 데이터를 중심으로

조현

워드 임베딩 기법을 활용한 범주형 데이터의 군집분석 : 고차원 데이터를 중심으로
Cluster Analysis on Categorical Data using Word Embedding Method : Focused on a High Dimensional Data 원문보기

조현 (국민대학교 일반대학원 데이터사이언스전공 데이터사이언스전공 국내석사)

초록 ▼
AI-Helper

머신 러닝 분야의 대표적인 비지도 학습 방법 중 하나인 군집분석은 데이터를 서로 유사한 집단끼리 묶어주는 분석으로, 마케팅, 공학, 의학 등 다양한 분야에서 활용되고 있다(Wilson et al, 2011). 오늘날 군집 분석과 관련된 연구는 꾸준히 진행되고 있는데, 데이터가 수치형 일 때 활용될 수 있는 연구가 주를 이루고 있으며 범주형 데이터의 군집분석과 관련된 연구는 활발하게 진행되고 있지 않다(Mingoti et al, 2012).
이에 본 연구에서는 범주형 데이터의 군집분석 시, 텍스트 분석에서 주로 사용되고 있는 워드 임베딩 기법을 활용하여 데이터를 수치형으로 변환을 한 뒤 수치형 데이터에 대한 군집분석 방법을 적용하는 방법을 제시하고자 한다.
워드 임베딩은 현재 가장 많이 사용 되고 있는 기법인 Word2vec, FastText, Glove 기법을 각각 적용하였고, 각 기법 적용시 어떠한 성능의 차이를 보이는지를 비교분석 하였다.
또한 제시하는 모형의 성능을 기존의 범주형 데이터의 군집분석 모형과 비교해 보면서 모형의 우수성을 검증하였고, 이때 가장 많이 알려진 방법인 K-mode, ROCK 등 의 방법과 비교분석 하였다.
데이터의 구조에 따른 모형의 성능의 변화를 파악하기 위해 시뮬레이션을 통해 다양한 조건의 데이터를 생성한 뒤 각 데이터 조건별 모형의 성능을 비교 하였고, 나아가 실제 데이터에서도 모형이 잘 군집하는지를 평가하기 위하여 실제 데이터를 통해 모형의 성능을 평가 하였다.

Abstract ▼ AI-Helper

Clustering algorithms is technique for grouping similar data and have been used in a variety of fileds such as engineering, medicine, marketing, etc. There are lot of study about clustering analysis but, majority of studies are about algorithms for nimerical data.
In this study, we propose a method that transform categorical data to numerical data using word embedding. We used three word embedding model(Skip-gram, FastText, Glove) and compared the performance with algorithms for categorical data(K-mode, ROCK).
To determine the performance of the model depending on the structure of the data, we generated data with different conditions and furthermore, evaluated performance of the model through real data. we used Silhouette score and Adjusted Rand score for performance evaluation.
By the Simulation, We Compared performance of the model by the number of categories and the number of data and as a result, embedding using the glove shown the best performance except where the number of categories is high and the number of data is low.
We compared performance of the model through the real hospital care data and performance was good in order of K-means using glove embedding, K-mode, K-means using word2vec, K-means using FastText.

주제어

학위논문 정보

저자	조현
학위수여기관	국민대학교 일반대학원
학위구분	국내석사
학과	데이터사이언스전공 데이터사이언스전공
지도교수	정여진
발행연도	2019
총페이지	vi, 32 p.
키워드	군집분석 워드 임베딩 word2vec FastText Glove K-means K-mode
언어	kor
원문 URL	http://www.riss.kr/link?id=T15372496&outLink=K
정보원	한국교육학술정보원

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

워드 임베딩 기법을 활용한 범주형 데이터의 군집분석 : 고차원 데이터를 중심으로
Cluster Analysis on Categorical Data using Word Embedding Method : Focused on a High Dimensional Data 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

워드 임베딩 기법을 활용한 범주형 데이터의 군집분석 : 고차원 데이터를 중심으로 Cluster Analysis on Categorical Data using Word Embedding Method : Focused on a High Dimensional Data 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

워드 임베딩 기법을 활용한 범주형 데이터의 군집분석 : 고차원 데이터를 중심으로
Cluster Analysis on Categorical Data using Word Embedding Method : Focused on a High Dimensional Data 원문보기

초록 ▼
AI-Helper