[논문]WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템

김진환; 김은경

doi:10.6109/jkiice.2022.26.4.495

WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템
WCTT: Web Crawling System based on HTML Document Formalization 원문보기

한국정보통신학회논문지 = Journal of the Korea Institute of Information and Communication Engineering, v.26 no.4, 2022년, pp.495 - 502

김진환 (Department of Computer Science & Engineering, Korea University of Technology and Education) , 김은경 (School of Computer Science & Engineering, Korea University of Technology and Education)

초록
AI-Helper

오늘날 웹상의 본문 수집에 주로 이용되는 웹 크롤러는 연구자가 직접 HTML 문서의 태그와 스타일을 분석한 후 수집 채널마다 다른 수집 로직을 구현해야 하므로 유지 관리 및 확장이 어렵다. 이러한 문제점을 해결하려면 웹 크롤러는 구조가 서로 다른 HTML 문서를 동일한 구조로 정형화하여 본문을 수집할 수 있어야 한다. 따라서 본 논문에서는 태그 경로 및 텍스트 출현 빈도를 기반으로 HTML 문서를 정형화하여 하나의 수집 로직으로 본문을 수집하는 웹크롤링 시스템인 WCTT(Web Crawling system based on Tag path and Text appearance frequency)를 설계 및 구현하였다. WCTT는 모든 수집 채널에서 동일한 로직으로 본문을 수집하므로 유지 관리 및 수집 채널의 확장이 용이하다. 또한, 키워드 네트워크 분석 등을 위해 불용어를 제거하고 명사만 추출하는 전처리 기능도 제공한다.

Abstract ▼ AI-Helper

Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

주제어

표/그림 (9)

그림 Fig. 1 Example of web page converted to structured data
그림 Fig. 2 WCTT structure
표 Table. 1 Main attributes of DCI
표 Table. 2 Implementation environment of WCTT
그림 Fig. 3 User interface to register collection task
그림 Fig. 4 User interface to view a list of collection task
그림 Fig. 5 User Interface for detailed view of collection task
그림 Fig. 6 User interface to preprocess text
그림 Fig. 7 Example of preprocessed text

참고문헌 (16)

Y. J. Kim, H. S. Kim, and H. S. Kim, "Understanding the Effects of COVID-19 on the Starbucks Perception through Big Data Analytics: A Comparative Study," Culinary Science & Hospitality Research, vol. 27, no. 6, pp. 276-279, Jun. 2021.
Y. R. Suh, K. P. Koh, and J. W. Lee, "An analysis of the change in media's reports and attitudes about face masks during the COVID-19 pandemic in South Korea: a study using Big Data latent dirichlet allocation (LDA) topic modelling," Journal of the Korea Institute of Information and Communication Engineering, vol. 25, no. 5, pp. 731-740, May. 2021.

원문보기 상세보기
D. H. Han and Y. K. Koo, "Design of Action-Based Web Crawler Structural Configuration for Multi-Website Management," KIISE Transactions on Computing Practices, vol. 27, no. 2, pp. 98-103, Feb. 2021.

상세보기
J. H. Lee, "Building an SNS Crawling System Using Python," Journal of the Korea Industrial Information Systems Research, vol. 23, no. 5, pp. 61-76, Oct. 2018.

원문보기 상세보기
S. Y. Park, J. H. Moon, Y. W. Kim, and H. G. Lee, "Design of Tree Structure Based Hypertext Crawler Using Jsoup," in Proceedings of Symposium of the Korean Institute of communications and Information Sciences, vol. 65, no. 1, pp. 896-897, Jan. 2018.
C. Kohlschuer, P. Fankhauser, and W. Nejdl, "Boilerplate detection using shallow text features," in Proceedings of the third ACM international conference on Web Search and Data Mining (WSDM), New York: NY, pp. 441-450, Feb. 2010.
W. M. Song, W. S. Kim, and M. W. Kim, "Contents Extraction from HTML Documents using Text Block Context," Journal of KISS : Software and Applications, vol. 40, no. 3, pp. 155-163, 2013.
S. H. Kim and H. J. Kim, "Logistic Regression Ensemble Method for Extracting Significant Information from Social Texts," KIPS Transactions on Software and Data Engineering, vol. 6, no. 5, pp. 279-284, May. 2017.

원문보기 상세보기
J. H. Mo and J. M. Yu, "Korean Web Content Extraction using Tag Rank Position and Gradient Boosting," Journal of KIISE, vol. 44, no. 6, pp. 581-586, Jun. 2017.

원문보기 상세보기
J. Leonhardt, A. Anand, and M. Khosla, "Boilerplate Removal using a Neural Sequence Labeling Model," in Companion Proceedings of the Web Conference 2020 (WWW '20), New York: NY, pp. 226-229, 2020.
J. H. Kim and E. G. Kim, "HTML Text Extraction Using Tag Path and Text Appearance Frequency," Journal of the Korea Institute of Information and Communication Engineering, vol. 25, no. 12, pp. 1709-1715, Dec. 2021.

원문보기 상세보기
W. K. Kim, Y. H. Kim, and J. S. Park, "Digital Literacy Research Trend Analysis Using Keyword Network Analysis - 2011-2015 and 2016-2020 comparative analysis," The Korean Journal of Literacy Research, vol. 12, no. 4, pp. 93-125, 2021.

상세보기
D. H. Kim, J. W. Koo, and U. M. Kim, "Design and Implementation of Automated Twitter Data Collecting System : Focus on Environmental Data," in Proceedings of the Korea Information Processing Society Conference, Online, vol. 27, no. 1, pp. 361-364, 2020.
K. S. Yoon and Y. H. Kim, "Designing and implementing web crawling-based SNS web site," in Proceedings of the Korean Society of Computer Information Conference, Busan, vol. 26, no. 1, pp. 21-24, 2018.
W. S. Ryu, "A System Design for Real-Time Monitoring of Patient Waiting Time based on Open-Source Platform," Journal of the Korea Institute of Information and Communication Engineering, vol. 22, no. 4, pp. 575-580, Apr. 2018.

원문보기 상세보기
H. S. Kang and J. H. Yang, "Selection of the Optimal Morphological Analyzer for a Korean Word2vec Model," in Proceedings of the KIPS Conference, Busan, vol. 25, no. 2, pp. 376-379, 2018.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증