Adaptive web crawling using a statistical model
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-007/00
G06F-015/16
G06F-017/00
출원번호
US-0022054
(2004-12-22)
등록번호
US-7328401
(2008-02-05)
발명자
/ 주소
Obata,Kenji C
Meyerzon,Dmitriy
출원인 / 주소
Microsoft Corporation
대리인 / 주소
Christensen O'Connor Johnson Kindness PLLC
인용정보
피인용 횟수 :
69인용 특허 :
18
초록▼
A computer based system and method of retrieving information pertaining to documents on a computer network is disclosed. The method includes selecting a set of documents to be accessed during a Web crawl by utilizing a statistical model to determine which previously retrieved documents are most like
A computer based system and method of retrieving information pertaining to documents on a computer network is disclosed. The method includes selecting a set of documents to be accessed during a Web crawl by utilizing a statistical model to determine which previously retrieved documents are most likely to have changed since last accessed. The statistical model is continuously improving its accuracy by training internal probability distributions to reflect the actual experience with change rate patterns of the documents accessed. The decision made whether to access the document is based on the probability of change compared against a desired synchronization level, random selections, maximum limits on the amount of time since the document was last accessed, and other criterion. Once the decision to access is made, the document is checked for changes and this information is used to train the statistical model.
대표청구항▼
The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows: 1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the d
The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows: 1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising: (a) determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl, wherein determining whether to access the document with the aid of a probabilistic model comprises computing a probability that the document has changed since the document was retrieved during the previous crawl, and wherein computing the probability that the document has changed comprises: (i) calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities, wherein the training probabilities are calculated using a Poisson process, the Poisson process including a Poisson equation (e^(-r*dt)) and a complementary Poisson equation (1-e^(r*dt)); (ii) selecting an active probability indicative of a proportion of documents in a plurality of documents that are changing at various change rates, the plurality of documents including the document; (iii) training the active probability to reflect experience with the document during the plurality of previous crawls; and (iv) using the trained active probability to compute the probability that the document has changed; and b) accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl. 2. The method of claim 1, further comprising: selecting the probability that the document has changed from the previous crawl as the active probability in the current crawl; and repeating the method of claim 1 for the current crawl. 3. The method of claim 1, wherein training the active probability includes multiplying the active probability indicative of a change in the document by a training probability calculated using the probabilistic model. 4. The method of claim 1, further comprising: training a document probability distribution corresponding to the document address specification to reflect experience with the document during the plurality of previous crawls, the document probability distribution including a plurality of probabilities; determining from the document probability distribution a probability that the document has changed; and making a determination of whether to access the document in the current crawl based on the probability that the document has changed. 5. The method of claim 4, further comprising: calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities; and multiplying each probability in the document probability distribution by a corresponding training probability from the discrete random variable distribution. 6. The method of claim 1, wherein the experience with the document during the plurality of previous crawls is derived from historical information associated with the document address specification. 7. A computer-readable medium having computer-executable instructions for retrieving one document in a plurality of documents from a remote server, which when executed comprise: maintaining historical information associated with changes to the one document; initiating a crawl procedure for retrieving particular documents in the plurality of documents; and determining whether to access the one document from the remote server based on a probabilistic analysis of the historical information associated with the changes to the one document, said probabilistic analysis of the historical information being based on the probability that the one document has changed since a previous crawl, wherein the probabilistic analysis comprises computing a probability that the one document has changed since the one document was last retrieved from the remote server, and wherein computing the probability that the one document has changed since the one document was last retrieved from the remote server comprises, beginning with a probability that a pre-defined proportion of documents in the plurality of documents has changed, training the probability that the pre-defined proportion of documents has changed using the historical information associated with the one document to achieve the probability that the one document has changed, wherein computing the probability that the one document has changed also comprises: calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities, wherein the training probabilities are calculated using a Poisson process, the Poisson process including a Poisson equation (e^(-r*dt)) and a complementary Poisson equation (1-e^(-r*dt)). 8. The computer-readable medium of claim 7, further comprising making a random decision to retrieve the one document wherein the random decision is biased by the probability that the one document has changed. 9. The computer-readable medium of claim 8, wherein the random decision is further biased by a synchronization level configured to influence the random decision based on a predetermined degree of tolerance for not retrieving the one document if the document is likely to have changed. 10. The computer-readable medium of claim 8, wherein the random decision is made by a software routine adapted to simulate a flip of a coin. 11. The computer-readable medium of claim 7, wherein: the historical information associated with changes to the one document includes a time stamp for the one document, the time stamp being indicative of the time that the one document was last modified when the one document was last retrieved from the remote server; and the probabilistic analysis includes a comparison of the time stamp included in the historical information with another time stamp associated with the one document stored on the remote server. 12. The computer-readable medium of claim 11, further comprising: if the time stamp included in the historical information does not match the other time stamp associated with the one document stored on the remote server, identifying the one document for retrieval during the crawl procedure. 13. The computer-readable medium of claim 7, wherein: the historical information associated with changes to the one document includes a hash value associated with the one document, the hash value being a representation of the one document; and the probabilistic analysis includes a comparison of the hash value included in the historical information with another hash value calculated from information retrieved from the one document stored on the remote server. 14. The computer-readable medium of claim 13, if the hash value included in the historical information does not match the other hash value associated with the one document stored on the remote server, identifying the one document for retrieval during the crawl procedure.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (18)
Peterson, Leonard J.; Freedman, Steven J.; Partovi, Hadi; Endres, Raymond E.; D'Souza, David J.; Ellerman, Erik Castedo; Jiggins, Julian P., Client-side system for scheduling delivery of web content and locally managing the web content.
Eichstaedt Matthias ; Ford Daniel Alexander ; Lehman Tobin Jon ; Lu Qi ; Teng Shang-Hua, Collaborative team crawling:Large scale information gathering over the internet.
Douglass R. Judd ; Paul Gauthier ; J. Eric Baldeschwieler, Method and apparatus for retrieving documents based on information other than document content.
Monier Louis M., System for adding a new entry to a web page table upon receiving a web page including a link to another web page not having a corresponding entry in the web page table.
Najork Marc Alexander ; Heydon Clark Allan ; Wiener Janet Lynn, Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness.
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
Dengler, Patrick M.; Krishnan, Arvind K.; Singh, Jagdish; Sanchez, Lawrence M.; Shankar, Sai; Chittamuru, Satish Kumar; Pekic, Zoltan; Mondal, Nabarun; Kumar, Namendra; i Dalfó, Ricard Roma, Metadata driven user interface.
Boyan, Justin; McDonald, Glenn; Benthall, Margaret; Molnar, Ray, Methods and systems to train models to extract and integrate information from data sources.
Fredricksen, Eric Russell; Schneider, Fritz John; Dean, Jeffrey Adgate; Ghemawat, Sanjay; Provos, Niels; Harik, Georges, System and method of accessing a document efficiently through multi-tier web caching.
Fredricksen, Eric Russell; Schneider, Fritz John; Dean, Jeffrey Adgate; Ghemawat, Sanjay; Provos, Niels; Harik, Georges, System and method of accessing a document efficiently through multi-tier web caching.
Fredrickson, Eric Russell; Feng, Hanping; Kataru, Naga Sridhar; Harik, Georges, System and method of accessing a document efficiently through multi-tier web caching.
Bar Yossef, Ziv; Kanungo, Tapas; Krauthgamer, Robert, System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.