[논문]아파치 스파크 기반 검색엔진의 설계 및 구현

박기성; 최재현; 김종배; 박제원

doi:10.6109/jkiice.2017.21.1.17

초록
AI-Helper

최근 데이터의 활용가치가 높아지면서 데이터에 관한 연구가 활발히 진행되고 있다. 데이터의 수집, 저장, 활용을 위한 대표적인 프로그램으로 웹 크롤러, 데이터베이스, 분산처리 등이 있으며, 최근에는 웹 크롤러가 다양한 분야에 활용할 수 있는 유용성으로 인해 크게 각광받고 있는 실정이다. 웹 크롤러란 자동화된 방법으로 웹서버를 순회하여 웹 페이지를 분석하고 URL을 수집하는 도구라고 정의할 수 있다. 인터넷 사용량의 증가로 매일 대량으로 생성되는 웹 페이지의 처리를 위해 하둡의 맵리듀스를 기반으로 하는 분산 웹 크롤러가 많이 사용되고 있다. 그러나 맵리듀스는 사용이 어렵고 성능에 제약이 있는 단점이 있다. 이러한 맵리듀스의 한계를 보완하여 제시된 인메모리 기반 연산 플랫폼인 아파치 스파크가 그 대안이 되고 있다. 웹 크롤러의 주요용도 중 하나인 검색엔진은 웹 크롤러로 수집한 정보 중 특정 검색어에 맞는 결과를 보여준다. 검색엔진을 기존 맵리듀스 기반의 웹 크롤러 대신 스파크 기반 웹 크롤러로 구현할 경우 더욱 빠른 데이터 수집이 가능할 것이다.

Abstract ▼ AI-Helper

Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and colle...

Recently, a study on data has been actively conducted because the value of the data has become more useful. Web crawler that is program of data collection recently spotlighted because it can take advantage of the various fields. Web crawler can be defined as a tool to analyze the web pages and collects the URL by traversing the web server in an automated manner. For the treatment of Big-data, distributed Web crawler is widely used which is based on the Hadoop MapReduce. But, it is difficult to use and has constraints on the performance. Apache spark that is the In-memory computing platform is an alternative to MapReduce. The search engine which is one of the main purposes of web crawler displays the information you search by keyword gathered by web crawler. If search engines implement a spark-based web crawler instead of traditional MapReduce-based web crawler, it would be a more rapid data collection.

주제어

질의응답

핵심어	질문	논문에서 추출한 답변
	하둡이란?	빅데이터를 다루는 가장 대표적인 도구는 아파치 하둡(Apache hadoop)의 맵리듀스(MapReduce)이다. 하둡은 클러스터 컴퓨팅 프레임워크로써, 저렴한 서버를 여러 대 연결하여 데이터 처리 성능을 높이는 기술이다. 맵리듀스는 하둡의 병렬 데이터 처리 프로그래밍 모델이다.
	셀레늄이 지원하는 것에는 어떤 것들이 있는가?	셀레늄은 자바스크립트 엔진을 탑재하고 있어 웹페이지 내에 있는 자바스크립트를 연산할 수 있다. 기본적으로 파이어폭스의 플러그인으로 제공되지만 IE, 크롬, 사파리, 오페라 등 다른 브라우저도 지원한다. 또한 파이썬 뿐만 아니라 다른 개발언어들도 지원한다. 셀레늄은 테스트 코드의 실행으로 브라우저에서의 액션을테스트 할 수 있는 도구로써, 이를 이용하면 자바 스크립트 페이지를 HTML 페이지로 변환하는 것이 가능하다.
	맵리듀스의 문제점은?	현재 많은 분산 처리 시스템이 맵리듀스를 기반으로 개발되어 사용되고 있다. 그러나 맵리듀스는 코드가 복잡하여 사용이 어렵고, 고정된 단일 데이터 흐름을 가지기 때문에 복잡한 연산이 어렵다. 또한 배치 방식으로 데이터를 처리하기 때문에 용도에 제약이 많으며, 이후 나온 기술들에 비해 성능이 낮고 속도가 느리다.

참고문헌 (17)

Wikipedia. Web Crawler [Internet] Available: https://en.wikipedia.org/wiki/Web_crawler.
H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark, 1st ed. Sebastopol, CA: O'Reilly Media, pp.1-9, 2015.
F. Pant, P. Srinivasn, F. Menczer, "Crawling the Web" in Web Dynamics, 1st ed. Berlin, Germany: Springer-Verlag, pp.153-177, 2003.
M. S. Ahuja , J. Singh, and B. Varnica. "Web Crawler: Extracting the Web Data," International Journal of Computer Trends and Technology(IJCTT), vol. 13, no. 3, pp.132-137, Jul. 2014.
Pycon. Web Scraper in 30 Minutes [Online]. Available: https://www.pycon.kr/2014/program/15.
H. C. Kim and S. H. Chae. "Design and Implementation of a High Performance Web Crawler," Journal of Digital Contents Society, vol. 4, no. 2, pp.127-137, Dec. 2003.
D. M. Seo and H. M. Jung. "Intelligent Web Crawler for Supporting Big Data Analysis Services," The Journal of the Korea Contents Association, vol. 13, no. 12, pp.575-584, Dec. 2013.

원문보기 상세보기
K. Y. Kim, W. G. Lee, H. M. Yoon, S. H. Shin, and M. H. Lee. "Development of Web Crawler for Archiving Web Resources," The Journal of the Korea Contents Association, vol. 11, no. 9, pp.9-16, Sep. 2011.
S. H. Hong, "An Implementation of Smart Price Tracker System Using Web Crawling," M.S. Thesis, Seoul National University of Science and Technology, Seoul, Korea, 2015.
V. K. Vavilapalli, et al., "Apache hadoop yarn: Yet another resource negotiator," in ACM Proceedings of the 4th annual Symposium on Cloud Computing, Santa Clara: CA, 2013.
Y. K. Lee, "The Comparison Between Hadoop MapReduce and Spark Device's Machine Learning Performance," M.S. Thesis, Soongsil University, Seoul, Korea, 2015.
B. S. Kim, "Performance Evaluation of HDFS Based SQLOn-Hadoop," M.S. Thesis, Chungbuk National University, Cheongju, Korea, 2015.
J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, no. 1, pp.107-113, Jan. 2008.

상세보기
USCDataScience. Sparkler [Internet]. Available: https://github.com/USCDataScience/sparkler.
H. O. Song, A. Y. Kim, and H. K. Jung. "Implement on Search Machine using Open Source Framework," Journal of the Korea Institute of Information and Communication Engineering, vol. 19, no. 3, pp.552-557, Mar. 2015.

원문보기 상세보기
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. "Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on networked systems design and implementation, San Jose: CA, pp.2-2, 2012.
C. Klaussne, J. Nioch. (2013, September). Nutch fight! 1.7 vs 2.2.1 [Internet]. Available: http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

아파치 스파크 기반 검색엔진의 설계 및 구현
Design and Implementation of a Search Engine based on Apache Spark 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (17)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

아파치 스파크 기반 검색엔진의 설계 및 구현 Design and Implementation of a Search Engine based on Apache Spark 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (17)

이 논문을 인용한 문헌

저자의 다른 논문 :

최재현 (24) 김종배 (30) 박제원 (21)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

아파치 스파크 기반 검색엔진의 설계 및 구현
Design and Implementation of a Search Engine based on Apache Spark 원문보기

초록
AI-Helper