Yin, Fulian
(Communication University of China, College of Information Engineering, Beijing, China)
,
He, Xiating
(Communication University of China, College of Information Engineering, Beijing, China)
,
Liu, Zhixin
(Communication University of China, College of Information Engineering, Beijing, China)
For the following problems: the semi-structure information on the web pages of the video website is complicated and the utilization rate is low, the data collection efficiency of the single machine crawler is low, this paper proposed a Scrapy-based distributed crawler system for crawling semi-struct...
For the following problems: the semi-structure information on the web pages of the video website is complicated and the utilization rate is low, the data collection efficiency of the single machine crawler is low, this paper proposed a Scrapy-based distributed crawler system for crawling semi-structure information at high speed. The traditional single crawler proposed by this paper developed an improved scheme of distributed extension. In this scheme, the Scrapy-Redis distributed component and Redis database were introduced into the Scrapy framework, and the semi-structured information crawling and standardized storage strategy was set up, and Scrapy-based distributed crawler system for crawling semi-structure information at high speed was implemented. This paper verified the system by crawling video site Youku, SOHU, Tencent, iQIYI TV drama information. The experimental results showed that the crawling speed of the distributed crawler is increased by 84.53%, 88.95%, 93.05% and 100% respectively compared with that of the single machine crawler.
For the following problems: the semi-structure information on the web pages of the video website is complicated and the utilization rate is low, the data collection efficiency of the single machine crawler is low, this paper proposed a Scrapy-based distributed crawler system for crawling semi-structure information at high speed. The traditional single crawler proposed by this paper developed an improved scheme of distributed extension. In this scheme, the Scrapy-Redis distributed component and Redis database were introduced into the Scrapy framework, and the semi-structured information crawling and standardized storage strategy was set up, and Scrapy-based distributed crawler system for crawling semi-structure information at high speed was implemented. This paper verified the system by crawling video site Youku, SOHU, Tencent, iQIYI TV drama information. The experimental results showed that the crawling speed of the distributed crawler is increased by 84.53%, 88.95%, 93.05% and 100% respectively compared with that of the single machine crawler.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.