[논문]비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구

이원조

doi:10.17703/jcct.2022.8.6.891

비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구
A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data 원문보기

Journal of the convergence on culture technology : JCCT = 문화기술의 융합, v.8 no.6, 2022년, pp.891 - 897

이원조 (울산과학대학교 산업경영공학과 (울산대학교 전자계산학과울산과학대학교 컴퓨터 IT학부))

초록
AI-Helper

빅데이터 분석에서 원시 텍스트 데이터는 대부분 다양한 비정형 데이터 형태로 존재하기 때문에 휴리스틱 전처리 정제와 컴퓨터를 이용한 후처리 정제과정을 거쳐야 분석이 가능한 정형 데이터 형태가 된다. 따라서 본 연구에서는 텍스트 데이터 분석 기법의 하나인 R 프로그램의 워드클라우드를 적용하기 위해서 수집된 원시 데이터 전처리를 통해 불필요한 요소들을 정제하고 후처리 과정에서 불용어를 제거한다. 그리고 단어들의 출현 빈도수를 계산하고 출현빈도가 높은 단어들을 핵심 이슈들로 표현해 주는 워드클라우드 분석의 사례 연구를 하였다. 이번 연구는 R의워드클라우드 기법으로 기존의 불용어 처리 방법인 "내포된 불용어 소스코드" 방법의 문제점을 개선하기 위하여 "일반적인 불용어 코퍼스"와 "사용자 정의 불용어 코퍼스"의 활용 방안을 제안하고 사례 분석을 통해서 제안된 "비정형 데이터 정제과정 모델"의 장단점을 비교 검증하여 제시하고 "제안된 외부 코퍼스 정제기법"을 이용한 워드클라우드 시각화 분석의 실무적용에 대한 효용성을 제시한다.

Abstract ▼ AI-Helper

In big data analysis, raw text data mostly exists in various unstructured data forms, so it becomes a structured data form that can be analyzed only after undergoing heuristic pre-processing and computer post-processing cleansing. Therefore, in this study, unnecessary elements are purified through pre-processing of the collected raw data in order to apply the wordcloud of R program, which is one of the text data analysis techniques, and stopwords are removed in the post-processing process. Then, a case study of wordcloud analysis was conducted, which calculates the frequency of occurrence of words and expresses words with high frequency as key issues. In this study, to improve the problems of the "nested stopword source code" method, which is the existing stopword processing method, using the word cloud technique of R, we propose the use of "general stopword corpus" and "user-defined stopword corpus" and conduct case analysis. The advantages and disadvantages of the proposed "unstructured data cleansing process model" are comparatively verified and presented, and the practical application of word cloud visualization analysis using the "proposed external corpus cleansing technique" is presented.

주제어

참고문헌 (17)

W. Lee, A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data, JCCT, vol. 7, No. 4, pp. 745-750, 2021.
W. Lee, A Study on Word Cloud Techniques for Analysis of Unstructured Text Data, JCCT, vol. 6, No. 3, pp. 337-341, 2020.
J. Lee, D. Yun, S. O, C. Lee, A Big Data Analysis of Civel Complaint Texts Using R Language, KIICE, 2020.
Kumar, P. Thakur, K. Gupta, and A. Pal, 2015, Text mining approach to analyse the relation between obesity and breast cancer data, ILNS
M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
Giseop Noh, An Analysis on Internet Information using Real Time Search Words, JCCT, vol. 4, No. 4, pp. 337-341, 2018.
I. Chun, D. Park, Y. Kang, Python and data science, Saengneun Publishing, pp. 222-233, 2019.
M. Chi , S. Lin, S. Chen, C. Lin, T. Lee, Morphab1e word Clouds for Time-Varying Text Data Visualization, IEEE, 2015.
M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
Jong Suk Lee and 3 others, Big data analysis of civil complaint texts using R language, 2020.
Insun Lee and 1 others, Unstructured data analysis and visualization, Korean Psychology Association, 2018.
Jongyong LEE, A Study on Tourism Analysis in Uijeongbu Region Using Big Data, JCCT, vol. 6, No. 1, pp. 413-419, 2020.
Sunghuk Moon, Big data environment analysis and research on ways to secure global competitiveness, JCCT, vol. 5 No. 2, pp. 361-367
Web Mining, IT Glossary, Korea Information and Communication Technology Association
text mining, Biochemistry Encyclopedia
Sejong Oh, R data analysis for everyone, R data analysis for everyone, Hanbit Media, 2019.
https://wikidocs.net/22530.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구
A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (17)

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구 A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (17)

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

비정형 텍스트 데이터 정제를 위한 불용어 코퍼스의 활용에 관한 연구
A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data 원문보기

초록
AI-Helper