[논문]텍스트 마이닝을 활용한 「토끼전」 이본 분석

이정림

텍스트 마이닝을 활용한 「토끼전」 이본 분석
Analysis of Tokkijeon Versions Using Text-Mining 원문보기

이정림 (부산대학교 대학원 문헌정보학과 국내석사)

초록 ▼
AI-Helper

4차 산업혁명 시대에는 방대한 양의 데이터가 생성, 유통되고 있다. 이로 인해 빅데이터를 신속하게 관리, 분석하는 데이터마이닝 기술이 발전하게 되었다. 기존의 고전소설의 이본 계열에 대한 연구는 연구자가 직접 텍스트를 분석하는 정성적인 연구방법으로 연구자의 시간과 노력이 많이 요구된다. 이 연구는 전통적인 원문서지학의 학문적 바탕에 텍스트 마이닝의 신기술의 방법론을 융합하여 서지학 연구의 새로운 접근 방식을 제안하고자 한다.
이 연구의 목적은 텍스트 마이닝을 활용하여 「토끼전」 이본의 출현 단어를 분석하고, 이본의 분류를 시도함으로써 이본 분류 연구의 새로운 가능성을 시험해 보고자 하는 것이다. 기존 이본 분류 연구와 비교를 통해 이 연구의 타당성이 검증된다면, 텍스트 마이닝을 통한 이본 분석 방법이 서지학 연구의 새로운 방법론으로써 활용될 수 있을 것이다.
『토끼전 전집』에 수록되어 있는 64종의 이본을 대상으로 하여 R로 텍스트 마이닝 분석을 수행하였다. 이본의 출현 단어 빈도 분석, 개체명에 따른 출현 단어 분석, 이본의 내용 유사도(유클리드 거리) 측정을 통한 계층적 군집 분석을 수행하였다. 마지막으로 텍스트 마이닝을 활용한 이본 분류와 기존 연구자의 이본 분류와의 비교를 통해 텍스트 마이닝을 활용한 이본 분류가 기존의 정성적 연구 방법만큼 신뢰성과 타당성을 가질 수 있는지 검토하였다.
「토끼전」 이본의 출현 빈도 상위 100개의 단어를 살펴보면, 인물과 장소를 가리키는 단어가 많음을 알 수 있다. 「토끼전」 이본에서 출현 빈도 상위 100개의 단어 중 인물명사가 차지하는 비율은 44.47%로 가장 높았으며, 다음으로 관직관련 명사, 장소명사의 출현 빈도가 높게 나타났다. 「토끼전」 이본 중 판소리 창본에 해당하는 작품들에서는 판소리 고유의 특징과 관련되는 단어(거동보소, 장단명)가 자주 등장함을 알 수 있다. 또한 이본의 표기 형식(한글본, 한문본, 국한문혼용본)을 단어 빈도 분석을 통해 파악할 수 있다.
계층적 군집 분석 결과, 64종의 「토끼전」 이본은 7개의 군집을 형성하고 있었다. 군집 분석 결과의 타당성을 검증하기 위해 분석 결과를 연구방법과 연구대상이 유사한 김동건의 이본 계열 연구와 비교 분석하였다. 김동건의 이본 계열 중 ‘수궁가’ 계열, ‘심정순 창본’ 계열, ‘신재효본’ 계열, ‘수궁록’ 계열, ‘토별산수록’ 계열, ‘경판본’ 계열, ‘토긔젼’ 계열, ‘토생전’ 계열은 군집 분석의 결과와 모두 일치하여 뚜렷한 군집을 형성하고 있는 것으로 나타났다. 김동건 연구의 ‘중산망월전’ 계열, ‘별토가’ 계열의 경우는 하나의 군집으로 묶이지 않고, 2개의 군집으로 산재하여 분포함을 파악할 수 있었다. 계열간 혼재가 심해서 특정한 계열로 묶일 수 없는 계열의 경우, 하나의 군집에 포함되거나(군집 3-4), 독립적인 군집을 형성하고(군집 4) 있음을 알 수 있었다. 이를 통해 텍스트 마이닝을 활용한 이본의 군집 분석 역시 기존 연구자의 연구와 상당부분 일치하다는 결과가 도출되었다.
텍스트 마이닝을 활용한 이본 분석을 통해 보다 객관적인 계열화 연구가 가능할 것이다. 특히 기존의 분류에 따르면 특정 계열로 분류하기 힘들거나 연구자마다 상이하게 분류가 되는 작품의 계열화에 도움이 될 것이다. 또한 단어 빈도 분석을 통해 이본 텍스트의 각편 또는 전체의 분포적 특성을 파악할 수 있을 것이다. 여기에 워드클라우드나 덴드로그램 같은 시각화 툴을 이용하여 텍스트에 대한 분석을 제시한다면 이용자의 정보접근성이 높아질 것으로 예상된다. 마지막으로, 방대한 양으로 축적된 다른 고전소설 이본 텍스트에도 텍스트 마이닝 기법을 활용한 분석이 가능할 것이다.
이 연구는 텍스트 마이닝을 활용한 정량적 연구 방법을 이본 분류 연구에 적용하여 서지학 연구의 새로운 방법론을 제안했다는 점에서 의미가 있다. 다만 한국 문학을 대상으로 하는 텍스트 마이닝 연구가 많지 않아 참고할 수 있는 연구가 적었고, 교착어인 한국어의 특성상 데이터 분석을 위해 명사를 추출하는 방법을 사용하는데, 작품 속에 출현한 명사만으로 작품의 내용을 유추하고 계열을 분류하는 것이 타당한지에 대한 논의가 필요하다. 향후 한국 문학을 대상으로 하는 텍스트 마이닝의 후속연구가 계속되고 한국어 텍스트 처리와 관련된 고성능의 패키지가 개발된다면 보다 정확한 연구가 가능할 것이다. 또한 고문헌을 대상으로 하는 규칙 기반의 방법이나 기계 학습 방법을 사용한 개체명 인식 연구가 후속적으로 시행된다면 고문헌 디지털 콘텐츠 개발에 도움이 될 것이다.

Abstract ▼ AI-Helper

In the era of the 4th industrial revolution, a huge amount of data is being created and distributed. As a result, data mining technology that manages and analyzes big data has developed. The previous studies on this classification of old classical novels have been studied in a qualitative research method in which the researcher directly analyzes the text. That requires a lot of time and effort. This study tried to propose a new approach to bibliographic research by fusing the new technology method ‘Text Mining’ with the academic background of traditional bibliographic studies.
The purpose of this study is to propose a new methodology of textual bibliography by analyzing the words in various versions of 「Tokkijeon」 and clustering of versions with hierarchical cluster methods which is using text mining.
Text mining analysis was performed with R for 64 texts in 『Series of Tokkijeon』. In this study, word frequency analysis, named entity analysis, and hierarchical cluster analysis by measuring of the content similarity(Euclidean distance) were performed. In addition, verified the reliability and validity of versions classification using text mining, by comparing the previous studies.
Frequency list of the top 100 words shows that there are many words refer to people and places. Among the frequency list of the top 100 words, the proportion of person nouns was the highest at 44.47%, followed by office-related nouns and place nouns. And in versions of Pansori, words related to Pansori’s narrative characteristics is frequently appeared. Furthermore, the letter format of the text can be identified through word frequency analysis.
As a result of hierarchical cluster analysis, 64 「Tokkijeon」 versions categorized into the seven clusters. Comparing it with Kim Dong-Gun’s version classification study, it was found that Sugungga’s versions, the libretto of Sim Jung-Soon’s versions, Sin Jae-Hyo’s versions, Sugungrok’s versions, Tobyulsansurok’s versions, Seoul carved book’s versions, Toguijeon’s versions, Tosaengjeon’s versions are consistent, so they formed a distinct cluster. In the case of Jungsanmangwoljeon’s versions and Byeoltoga’s versions in Kim Dong-gun's study, they were not grouped into one cluster but divided into two clusters. In the case of versions that cannot be grouped due to severe mixing between different versions, that is included in one cluster (cluster 3-4), or categorized into a independent cluster(cluster 4). As a result of comparison, it was found that 「Tokkijeon」 versions cluster analysis using text mining and the previous studies were substantially consistent.
Through versions analysis using text mining, more objective classification research will be possible. In particular, it will be helpful in works that are difficult to classify into a specific versions or are classified differently for each researcher. Also, through word frequency analysis, it will be possible to grasp the distributional characteristics of the text. In addition, showing text analysis using visualization tools such as word cloud and dendrogram, users’ access to information will increase. Lastly, analysis using text mining techniques will be possible in other classical novel texts.
This study is meaningful in that it proposed a new methodology for bibliographic research by applying a quantitaive research method using text mining. However, there are not many text mining studies on Korean classical literature, so there are few studies that can be referenced. Also, a method of extracting nouns is used to analyze the Korean language, which is an agglutinative language, so it is necessary to discuss whether it is appropriate to infer the contents of the text and to classify the versions only with the nouns that appear in the text. In the future, follow-up studies of text mining targeting Korean literature continue and a high-performance package related to Korean text processing is developed, more accurate research will be possible. In addition, the named entity recognition study using a rule-based method or a machine learning method targeting classical literature is conducted subsequently, it will be helpful in the development of digital contents of classical literature.

주제어

학위논문 정보

저자	이정림
학위수여기관	부산대학교 대학원
학위구분	국내석사
학과	문헌정보학과
지도교수	송정숙
발행연도	2022
총페이지	vi, 80 장
키워드	텍스트 마이닝 군집 분석 토끼전 이본
언어	kor
원문 URL	http://www.riss.kr/link?id=T16465071&outLink=K
정보원	한국교육학술정보원

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

텍스트 마이닝을 활용한 「토끼전」 이본 분석
Analysis of Tokkijeon Versions Using Text-Mining 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

텍스트 마이닝을 활용한 「토끼전」 이본 분석 Analysis of Tokkijeon Versions Using Text-Mining 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

텍스트 마이닝을 활용한 「토끼전」 이본 분석
Analysis of Tokkijeon Versions Using Text-Mining 원문보기

초록 ▼
AI-Helper