[논문]불용어 시소러스를 이용한 비정형 텍스트 데이터 후처리 방법론에 관한 연구

이원조

doi:10.17703/jcct.2023.9.6.935

불용어 시소러스를 이용한 비정형 텍스트 데이터 후처리 방법론에 관한 연구
A Study on Unstructured text data Post-processing Methodology using Stopword Thesaurus 원문보기

Journal of the convergence on culture technology : JCCT = 문화기술의 융합, v.9 no.6, 2023년, pp.935 - 940

이원조 (울산과학대학교 스마트제조공학과)

초록
AI-Helper

인공지능과 빅데이터 분석을 위해 웹 스크래핑으로 수집된 대부분의 텍스트 데이터들은 일반적으로 대용량이고 비정형이기 때문에 빅데이터 분석을 위해서는 정제과정이 요구된다. 그 과정은 휴리스틱 전처리 정제단계와 후처리 머시인 정제단계를 통해서 분석이 가능한 정형 데이터가 된다. 따라서 본 연구에서는 후처리 머시인 정제과정에서 한국어 딕셔너리와 불용어 딕셔너리를 이용하여 워드크라우드 분석을 위한 빈도분석을 위해 어휘들을 추출하게 되는데 이 과정에서 제거되지 않은 불용어를 효율적으로 제거하기 위한 "사용자 정의 불용어 시소러스" 적용에 대한 방법론을 제안하고 R의 워드클라우드 기법으로 기존의 "불용어 딕셔너리" 방법의 문제점을 보완하기 위해 제안된 "사용자 정의 불용어 시소러스" 기법을 이용한 사례분석을 통해서 제안된 정제방법의 장단점을 비교 검증하여 제시하고 제안된 방법론의 실무적용에 대한 효용성을 제안한다.

Abstract ▼ AI-Helper

Most text data collected through web scraping for artificial intelligence and big data analysis is generally large and unstructured, so a purification process is required for big data analysis. The process becomes structured data that can be analyzed through a heuristic pre-processing refining step and a post-processing machine refining step. Therefore, in this study, in the post-processing machine refining process, the Korean dictionary and the stopword dictionary are used to extract vocabularies for frequency analysis for word cloud analysis. In this process, "user-defined stopwords" are used to efficiently remove stopwords that were not removed. We propose a methodology for applying the "thesaurus" and examine the pros and cons of the proposed refining method through a case analysis using the "user-defined stop word thesaurus" technique proposed to complement the problems of the existing "stop word dictionary" method with R's word cloud technique. We present comparative verification and suggest the effectiveness of practical application of the proposed methodology.

주제어

표/그림 (11)

그림 그림 1. 제안 텍스트 데이터의 데이터 정제모델 Figure 1. Data purification model for proposed text data
그림 그림 2. R 워드클라우드 분석을 위한 기본 환경설정 Figure 2. Basic environment settings for R word cloud analysis
그림 그림 3. 한국어 딕셔너리를 이용한 어휘 추출 Figure 3. Vocabulary extraction using Korean dictionary
그림 그림 4. 소스코드 내 불용어 기입 방법을 이용한 제거 Figure 4. Removal using stopword entry method in source code
그림 그림 5. 범용 불용어 딕셔너리 방법을 이용한 제거 Figure 5. Removal using the universal stopword dictionary method
그림 그림 6. 사용자 정의 불용어 시소러스 방법을 이용한 제거 Figure 6. Removal of user-defined stop words using thesaurus method
그림 그림 7. 빈도수 상위 30개 어휘의 추출 Figure 7. Extraction of top 30 frequent vocabulary words
표 표 1. 정제방법의 장단점 비교표 Table 1. Comparison table of advantages and disadvantages of purification methods
그림 그림 8. 사용자 정의 불용어 시소러스 데이터 셋 Figure 8 Dataset from a custom stopword thesaurus
그림 그림 9. 취임사 연설문 시각화 분석결과 Figure 9. Visualization analysis results of the inauguration speech
그림 그림10. 광복절 기념사 연설문 시각화 분석결과 Figure10. Visualization analysis results of the Liberation Day commemoration speech

참고문헌 (12)

W. Lee, A Study on the Use of Stopword Corpus？for Cleansing Unstructured Text Data, JCCT, Vol.？8, No. 6, pp.891-897, 2022. DOI: 10.17703/JCCT.2022.8.6.891

원문보기 상세보기
W. Lee, A Study on Data Cleansing Techniques for？Word Cloud Analysis of Text Data, JCCT, vol. 7,？No. 4, pp. 745-750, 2021. DOI: 10.17703/JCCT.2021.7.4.745

원문보기 상세보기
W. Lee, A Study on Word Cloud Techniques for？Analysis of Unstructured Text Data, JCCT, vol. 6,？No. 3, pp. 337-341, 2020. DOI: 10.17703/JCCT.2020.6.4.715

원문보기 상세보기
Kumar, P. Thakur, K. Gupta, and A. Pal, 2015,？Text mining approach to analyse the relation？between obesity and breast cancer data, ILNS
M. Han, Y. Kim, C. Lee, Analysis of News？Regarding New southeastem Airport Using Text？Mining Techniques, Smart Media Journal, Vol. 6,？No. 1, 2017.
J. Lee, D. Yun, S. O, C. Lee, A Big Data Analysis？of Civel Complaint Texts Using R Language,？KIICE, 2020.
Insun Lee and 1 others, Unstructured data analysis？and visualization, Korean Psychology Association,？2018.
Jongyong LEE, A Study on Tourism Analysis in？Uijeongbu Region Using Big Data, JCCT, vol. 6,？No. 1, pp. 413-419, 2020.
Sunghuk Moon, Big data environment analysis and？research on ways to secure global competitiveness,？JCCT, vol. 5 No. 2, pp. 361-367
Giseop Noh, An Analysis on Internet Information？using Real Time Search Words, JCCT, vol. 4, No.？4, pp. 337-341, 2018.
I. Chun, D. Park, Y. Kang, Python and data？science, Saengneun Publishing, pp. 222-233, 2019.
M. Chi , S. Lin, S. Chen, C. Lin, T. Lee, Morphable？word Clouds for Time-Varying Text Data？Visualization, IEEE, 2015.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증