[논문]텍스트 데이터 워드클라우드 분석을 위한 데이터 정제기법에 관한 연구

이원조

doi:10.17703/jcct.2021.7.4.745

텍스트 데이터 워드클라우드 분석을 위한 데이터 정제기법에 관한 연구
A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data 원문보기

Journal of the convergence on culture technology : JCCT = 문화기술의 융합, v.7 no.4, 2021년, pp.745 - 750

이원조 (울산과학대학교 산업경영공학과)

초록
AI-Helper

비정형 텍스트 데이터의 빅데이터 시각화 분석에서 원시 데이터는 대부분 대용량이고 비정형으로 정제하지 않고 분석기법을 적용할 수 없는 상태이다. 따라서 수집된 원시 데이터는 1차 휴리스틱 정제과정을 통해서 불필요한 데이터들을 제거하고 2차 머시인 정제과정을 통해서 불용어를 제거한다. 그리고 어휘의 빈도수를 계산하여 워드클라우드 기법으로 시각화하고 핵심 이슈들을 추출하여 정보화하고 그 결과를 분석한다. 본 연구에서는 파이썬 워드클라우드에서 외부 불용어 Set(DB)를 사용한 새로운 불용어 정제기법을 제안하고 실무 사례분석을 통하여 이 기법의 문제점과 효용성을 도출한다. 그리고 이 검증 결과를 통해 제안된 정제기법을 적용한 워드클라우드 분석의 실무적용에 대한 효용성을 제시한다.

Abstract ▼ AI-Helper

In Big data visualization analysis of unstructured text data, raw data is mostly large-capacity, and analysis techniques cannot be applied without cleansing it unstructured. Therefore, from the collected raw data, unnecessary data is removed through the first heuristic cleansing process and Stopwords are removed through the second machine cleansing process. Then, the frequency of the vocabulary is calculated, visualized using the word cloud technique, and key issues are extracted and informationalized, and the results are analyzed. In this study, we propose a new Stopword cleansing technique using an external Stopword set (DB) in Python word cloud, and derive the problems and effectiveness of this technique through practical case analysis. And, through this verification result, the utility of the practical application of word cloud analysis applying the proposed cleansing technique is presented.

주제어

표/그림 (11)

그림 그림 1. 기존의 비정형 데이터 정제 과정도 Figure 1. Existing unstructured data cleansing process diagram
그림 그림 2. 제안된 비정형 데이터 정제 과정도 Figure 2. Proposed unstructured data cleansing process diagram
그림 그림 3. generate 함수로 인자로 넘겨주는 실행코드 Figure 3. Executable code passed as an argument to generate function
그림 그림 4. 맷플로립을 사용한 이미지 생성을 위한 실행코드 Figure 4. Executable code for image creation using matflolip
그림 그림 5 수작업 불용어 집합 생성방법 Figure 5. How to manually create a set of Stopwords
그림 그림 6. 외부 불용어 텍스트 사전 이용 방법 Figure 6. How to use an external Stopword text dictionary
그림 그림 7. 외부 불용어 Set(DB) Figure 7. External Stopword Set(DB).
그림 그림 8 트럼프 대통령의 취임사 연설문 시각화 결과(적용 전) Figure 8. Visualization of President Trump's Inaugural Speech(Before application)
그림 그림 9. 바이든 대통령의 취임사 연설문 시각화 결과(적용 전) Figure 9. Visualization of President Biden's Inaugural Speech(Before application)
그림 그림 10. 트럼프 대통령의 취임사 연설문 시각화 결과(적용 후) Figure 10. Visualization of President Trump's Inaugural Speech(After application)
그림 그림 11. 바이든 대통령의 취임사 연설문 시각화 결과(적용 후) Figure 11. Visualization results of President Biden's inaugural speech(After application)

참고문헌 (17)

W. Lee, A Study on Word Cloud Techniques for Analysis of Unstructured Text Data, JCCT, vol. 6, No. 3, pp. 337-341, 2021.
J. Lee, D. Yun, S. O, C. Lee, A Big Data Analysis of Civel Complaint Texts Using R Language, KIICE, 2020.
I. Chun, D. Park, Y. Kang, Python and data science, Saengneun Publishing, pp. 222-233, 2019.
M. Chi, S. Lin, S. Chen, C. Lin, T. Lee, Morphable word Clouds for Time-Varying Text Data Visualization, IEEE, 2015.
Kumar, P. Thakur, K. Gupta, and A. Pal, 2015, Text mining approach to analyse the relation between obesity and breast cancer data, ILNS
M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
Jong Suk Lee and 3 others, Big data analysis of civil complaint texts using R language, 2020.
Insun Lee and 1 others, Unstructured data analysis and visualization, Korean Psychology Association, 2018.
Dongnyeok Sim, Research on ICT issue detection and analysis methodology using text data, 2020.
Software Engineering Center Webzine Materials, Big data purification process, 2019.
Giseop Noh, An Analysis on Internet Information using Real Time Search Words, JCCT, vol. 4, No. 4, pp. 337-341, 2018.
Jongyong LEE, A Study on Tourism Analysis in Uijeongbu Region Using Big Data, JCCT, vol. 6, No. 1, pp. 413-419, 2020.
Sunghuk Moon, Big data environment analysis and research on ways to secure global competitiveness, JCCT, vol. 5 No. 2, pp. 361-367
Web Mining, IT Glossary, Korea Information and Communication Technology Association
text mining, Biochemistry Encyclopedia
Sejong Oh, R data analysis for everyone, R data analysis for everyone, Hanbit Media, 2019.
Dictionary of current affairs.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증