[특허]Adaptive web crawling using a statistical model

Adaptive web crawling using a statistical model 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G06F-007/00 G06F-015/16 G06F-017/00
출원번호	US-0022054 (2004-12-22)
등록번호	US-7328401 (2008-02-05)
발명자 / 주소	Obata,Kenji C Meyerzon,Dmitriy
출원인 / 주소	Microsoft Corporation
대리인 / 주소	Christensen O'Connor Johnson Kindness PLLC
인용정보	피인용 횟수 : 69 인용 특허 : 18

초록 ▼

A computer based system and method of retrieving information pertaining to documents on a computer network is disclosed. The method includes selecting a set of documents to be accessed during a Web crawl by utilizing a statistical model to determine which previously retrieved documents are most likely to have changed since last accessed. The statistical model is continuously improving its accuracy by training internal probability distributions to reflect the actual experience with change rate patterns of the documents accessed. The decision made whether to access the document is based on the probability of change compared against a desired synchronization level, random selections, maximum limits on the amount of time since the document was last accessed, and other criterion. Once the decision to access is made, the document is checked for changes and this information is used to train the statistical model.

대표청구항 ▼

The embodiments of the invention in which an exclusive property or privilege is claimed are defined as follows: 1. A computer-implemented method for selectively accessing a document during a current crawl of a server computer, the document being identified by a document address specification, the document having been retrieved during a previous crawl, the method comprising: (a) determining whether to access the document during the current crawl with the aid of a probabilistic model that is based on the probability that the document has changed since the previous crawl, wherein determining whether to access the document with the aid of a probabilistic model comprises computing a probability that the document has changed since the document was retrieved during the previous crawl, and wherein computing the probability that the document has changed comprises: (i) calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities, wherein the training probabilities are calculated using a Poisson process, the Poisson process including a Poisson equation (e^(-r*dt)) and a complementary Poisson equation (1-e^(r*dt)); (ii) selecting an active probability indicative of a proportion of documents in a plurality of documents that are changing at various change rates, the plurality of documents including the document; (iii) training the active probability to reflect experience with the document during the plurality of previous crawls; and (iv) using the trained active probability to compute the probability that the document has changed; and b) accessing the document if the determination produces an instruction indicative that the document at the document address specification should be accessed during the current crawl. 2. The method of claim 1, further comprising: selecting the probability that the document has changed from the previous crawl as the active probability in the current crawl; and repeating the method of claim 1 for the current crawl. 3. The method of claim 1, wherein training the active probability includes multiplying the active probability indicative of a change in the document by a training probability calculated using the probabilistic model. 4. The method of claim 1, further comprising: training a document probability distribution corresponding to the document address specification to reflect experience with the document during the plurality of previous crawls, the document probability distribution including a plurality of probabilities; determining from the document probability distribution a probability that the document has changed; and making a determination of whether to access the document in the current crawl based on the probability that the document has changed. 5. The method of claim 4, further comprising: calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities; and multiplying each probability in the document probability distribution by a corresponding training probability from the discrete random variable distribution. 6. The method of claim 1, wherein the experience with the document during the plurality of previous crawls is derived from historical information associated with the document address specification. 7. A computer-readable medium having computer-executable instructions for retrieving one document in a plurality of documents from a remote server, which when executed comprise: maintaining historical information associated with changes to the one document; initiating a crawl procedure for retrieving particular documents in the plurality of documents; and determining whether to access the one document from the remote server based on a probabilistic analysis of the historical information associated with the changes to the one document, said probabilistic analysis of the historical information being based on the probability that the one document has changed since a previous crawl, wherein the probabilistic analysis comprises computing a probability that the one document has changed since the one document was last retrieved from the remote server, and wherein computing the probability that the one document has changed since the one document was last retrieved from the remote server comprises, beginning with a probability that a pre-defined proportion of documents in the plurality of documents has changed, training the probability that the pre-defined proportion of documents has changed using the historical information associated with the one document to achieve the probability that the one document has changed, wherein computing the probability that the one document has changed also comprises: calculating, based on the experience with the document during a plurality of previous crawls, a discrete random variable distribution that includes a plurality of training probabilities, wherein the training probabilities are calculated using a Poisson process, the Poisson process including a Poisson equation (e^(-r*dt)) and a complementary Poisson equation (1-e^(-r*dt)). 8. The computer-readable medium of claim 7, further comprising making a random decision to retrieve the one document wherein the random decision is biased by the probability that the one document has changed. 9. The computer-readable medium of claim 8, wherein the random decision is further biased by a synchronization level configured to influence the random decision based on a predetermined degree of tolerance for not retrieving the one document if the document is likely to have changed. 10. The computer-readable medium of claim 8, wherein the random decision is made by a software routine adapted to simulate a flip of a coin. 11. The computer-readable medium of claim 7, wherein: the historical information associated with changes to the one document includes a time stamp for the one document, the time stamp being indicative of the time that the one document was last modified when the one document was last retrieved from the remote server; and the probabilistic analysis includes a comparison of the time stamp included in the historical information with another time stamp associated with the one document stored on the remote server. 12. The computer-readable medium of claim 11, further comprising: if the time stamp included in the historical information does not match the other time stamp associated with the one document stored on the remote server, identifying the one document for retrieval during the crawl procedure. 13. The computer-readable medium of claim 7, wherein: the historical information associated with changes to the one document includes a hash value associated with the one document, the hash value being a representation of the one document; and the probabilistic analysis includes a comparison of the hash value included in the historical information with another hash value calculated from information retrieved from the one document stored on the remote server. 14. The computer-readable medium of claim 13, if the hash value included in the historical information does not match the other hash value associated with the one document stored on the remote server, identifying the one document for retrieval during the crawl procedure.

이 특허에 인용된 특허 (18)

Peterson, Leonard J.; Freedman, Steven J.; Partovi, Hadi; Endres, Raymond E.; D'Souza, David J.; Ellerman, Erik Castedo; Jiggins, Julian P., Client-side system for scheduling delivery of web content and locally managing the web content.
상세보기
Eichstaedt Matthias ; Ford Daniel Alexander ; Lehman Tobin Jon ; Lu Qi ; Teng Shang-Hua, Collaborative team crawling:Large scale information gathering over the internet.
상세보기
Narendran Balakrishnan ; Rangarajan Sampath ; Yajnik Shalini, Data distribution techniques for load-balanced fault-tolerant web access.
상세보기
Houser Peter B. (Poway CA) Adler James M. (Ocean Beach CA), Electronic document verification system and method.
상세보기
Douglass R. Judd ; Paul Gauthier ; J. Eric Baldeschwieler, Method and apparatus for retrieving documents based on information other than document content.
상세보기
Katariya, Sanjeev; Jones, William P., Method and system for calculating phrase-document importance.
상세보기
Meyerzon, Dmitriy; Shoroff, Srikanth; Terek, F. Soner; Norin, Scott, Method and system for detecting duplicate documents in web crawls.
상세보기
Sanu Sankrant ; Meyerzon Dmitriy, Method of web crawling utilizing address mapping.
상세보기
Meyerzon, Dmitriy; Sanu, Sankrant, Method of web crawling utilizing crawl numbers.
상세보기
Pirolli Peter L. ; Pitkow James E., Prefetching and caching documents according to probability ranked need S list.
상세보기
Marc Alexander Najork ; Clark Allan Heydon, System and method for associating an extensible set of data with documents downloaded by a web crawler.
상세보기
Soumen Chakrabarti ; Byron Edward Dom ; Martin Henk van den Berg, System and method for focussed web crawling.
상세보기
Douglas M. Dillon, System and method for multicasting multimedia content.
상세보기
Sundaresan, Neelakantan; Yi, Jeonghee, System and method for the automatic mining of new relationships.
상세보기
Monier Louis M., System for adding a new entry to a web page table upon receiving a web page including a link to another web page not having a corresponding entry in the web page table.
상세보기
Liddy Elizabeth D. ; Yu Edmund Szu-Li, System for retrieving multimedia information from the internet using multiple evolving intelligent agents.
상세보기
Najork Marc Alexander ; Heydon Clark Allan ; Wiener Janet Lynn, Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness.
상세보기
Wiener, Janet L.; Stata, Raymond P.; Burrows, Michael, Web page connectivity server.
상세보기

이 특허를 인용한 특허 (69)

Sun, Walter; Li, Yipeng; Zhang, Xiao; Ahmed, Junaid, Adaptive crawl rates based on publication frequency.
상세보기
Kumar, Mani; Kothari, Pankaj; Sahni, Saurabh, Adaptive weighted crawling of user activity feeds.
상세보기
Milner, Marius C., Automatic proxy setting modification.
상세보기
Milner, Marius C., Automatic proxy setting modification.
상세보기
Petriuc, Mihai, Click distance determination.
상세보기
Patterson, Anna Lynn, Detecting spam documents in a phrase based information retrieval system.
상세보기
Tankovich, Vladimir; Meyerzon, Dmitriy; Poznanski, Victor, Detection of junk in search result ranking.
상세보기
Kumar, Mani; Kothari, Pankaj; Sahni, Saurabh, Determining related keywords based on lifestream feeds.
상세보기
Kumar, Mani; Kothari, Pankaj; Sahni, Saurabh, Determining related keywords based on lifestream feeds.
상세보기
Tankovich, Vladimir; Meyerzon, Dmitriy; Taylor, Michael James, Document length as a static relevance feature for ranking search results.
상세보기
Meyerzon, Dmitriy; Shnitko, Yauhen; Burges, Chris J. C.; Taylor, Michael James, Enterprise relevancy ranking using a neural network.
상세보기
Liu, Jie; Nath, Suman; Lin, Xiaozhu, Executing a fast crawl over a computer-executable application.
상세보기
Robertson, Stephen; Zaragoza, Hugo; Taylor, Michael; Larimore, Stefan Isbein; Petriuc, Mihai, Field weighting in text searching.
상세보기
Kenig, Batya; Radchenko, Constantin; Shapiro, Eitan, Incremental crawling of multiple content providers using aggregation.
상세보기
Kenig, Batya; Radchenko, Constantin; Shapiro, Eitan, Incremental crawling of multiple content providers using aggregation.
상세보기
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
상세보기
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
상세보기
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
상세보기
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
상세보기
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna L.; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
상세보기
Cao, Pei; Eiron, Nadav; Mazumdar, Soham; Patterson, Anna; Power, Russell; Zunger, Yonatan, Index server architecture using tiered and sharded phrase posting lists.
상세보기
Fontoura, Marcus; Meredith, Daniel N.; Rohde, Douglas Lee Taylor; Palekar, Mahesh S.; Shankar, Asim; Baylor, Denis Murray; Rasscevskis, Zigmars; Csomai, Andras, Indexing system.
상세보기
Fontoura, Marcus; Meredith, Daniel N.; Rohde, Douglas Lee Taylor; Palekar, Mahesh S.; Shankar, Asim; Baylor, Denis Murray; Rasscevskis, Zigmars; Csomai, Andras, Indexing system.
상세보기
Patterson, Anna L, Information retrieval system for archiving multiple document versions.
상세보기
Patterson, Anna L, Information retrieval system for archiving multiple document versions.
상세보기
Patterson, Anna L., Information retrieval system for archiving multiple document versions.
상세보기
Patterson, Anna Lynn, Information retrieval system for archiving multiple document versions.
상세보기
Patterson, Anna L., Integrated external related phrase information into a phrase-based indexing information retrieval system.
상세보기
Patterson, Anna Lynn, Integrating external related phrase information into a phrase-based indexing information retrieval system.
상세보기
Alpert, Jesse L.; Tammana, Praveen K.; Kurzion, Yair, Managing URLs.
상세보기
Alpert, Jesse L.; Tammana, Praveen K.; Kurzion, Yair, Managing URLs.
상세보기
Alpert, Jesse L., Managing items in crawl schedule.
상세보기
Dengler, Patrick M.; Krishnan, Arvind K.; Singh, Jagdish; Sanchez, Lawrence M.; Shankar, Sai; Chittamuru, Satish Kumar; Pekic, Zoltan; Mondal, Nabarun; Kumar, Namendra; i Dalfó, Ricard Roma, Metadata driven user interface.
상세보기
Villadsen, Peter; Chen, Zhaoqi; Gottumukkala, Ramakanthachary S.; Calderon, Marcos, Metadata-based eventing supporting operations on data.
상세보기
Boyan, Justin; McDonald, Glenn; Benthall, Margaret; Molnar, Ray, Methods and systems to train models to extract and integrate information from data sources.
상세보기
Morris, Robert P., Methods, systems, and computer program products for characterizing links to resources not activated.
상세보기
Patterson, Anna L., Multiple index based information retrieval system.
상세보기
Patterson, Anna L., Multiple index based information retrieval system.
상세보기
Patterson, Anna Lynn, Multiple index based information retrieval system.
상세보기
Patterson, Anna L., Phase-based personalization of searches in an information retrieval system.
상세보기
Mazumdar, Soham; Przebinda, Viktor; Zunger, Yonatan, Phrase extraction using subphrase scoring.
상세보기
Mazumdar, Soham; Przebinda, Viktor; Zunger, Yonatan, Phrase extraction using subphrase scoring.
상세보기
Patterson, Anna L., Phrase-based detection of duplicate documents in an information retrieval system.
상세보기
Patterson, Anna L., Phrase-based detection of duplicate documents in an information retrieval system.
상세보기
Patterson, Anna Lynn, Phrase-based detection of duplicate documents in an information retrieval system.
상세보기
Patterson, Anna L., Phrase-based searching in an information retrieval system.
상세보기
Patterson, Anna L., Phrase-based searching in an information retrieval system.
상세보기
Kirshenbaum, Evan R.; Suermondt, Henri J.; Lillibridge, Mark David; Yuasa, Kei; Eshghi, Kave; Forman, George, Policy applicability determination.
상세보기
Fredricksen, Eric Russell; Feng, Hanping; Kataru, Naga Sridhar; Harik, Georges, Prioritized preloading of documents to client.
상세보기
Obata, Kenji; Meyerzon, Dmitriy, Proxy server using a statistical model.
상세보기
Cao, Pei; Mazumdar, Soham, Query phrasification.
상세보기
Cao, Pei; Mazumdar, Sohem, Query phrasification.
상세보기
Meyerzon, Dmitriy; Zaragoza, Hugo, Ranking search results using biased click distance.
상세보기
Meyerzon, Dmitriy; Li, Hang, Ranking search results using feature extraction.
상세보기
Meyerzon, Dmitriy; Zaragoza, Hugo, Ranking search results using language types.
상세보기
Poznanski, Victor; Wang, Oivind; Holm, Fredrik; Bodd, Nicolai; Tankovich, Vladimir; Meyerzon, Dmitriy, Re-ranking search results.
상세보기
Blum, Stephen; Greene, Todd, Real-time distribution of messages via a network with multi-region replication in a hosted service environment.
상세보기
Fredricksen, Eric Russell; Feng, Hanping; Kataru, Naga Sridhar; Harik, Georges, Refreshing cached documents and storing differential document content.
상세보기
Auerbach, David B.; Alpert, Jesse L., Scheduling a recrawl.
상세보기
Tankovich, Vladimir; Li, Hang; Meyerzon, Dmitriy; Xu, Jun, Search results ranking using editing distance and document information.
상세보기
Meyerzon, Dmitriy; Zaragoza, Hugo, System and method for ranking search results using click distance.
상세보기
Merrigan, Chadd Creighton; Peltonen, Kyle G.; Meyerzon, Dmitriy; Lee, David J., System and method for scoping searches using index keys.
상세보기
Fredricksen, Eric Russell; Schneider, Fritz John; Dean, Jeffrey Adgate; Ghemawat, Sanjay; Provos, Niels; Harik, Georges, System and method of accessing a document efficiently through multi-tier web caching.
상세보기
Fredricksen, Eric Russell; Schneider, Fritz John; Dean, Jeffrey Adgate; Ghemawat, Sanjay; Provos, Niels; Harik, Georges, System and method of accessing a document efficiently through multi-tier web caching.
상세보기
Fredrickson, Eric Russell; Feng, Hanping; Kataru, Naga Sridhar; Harik, Georges, System and method of accessing a document efficiently through multi-tier web caching.
상세보기
Bar Yossef, Ziv; Kanungo, Tapas; Krauthgamer, Robert, System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages.
상세보기
Eriksen, Bjorn Marius Aamodt; Laraki, Othman, Systems and methods for cache optimization.
상세보기
Eriksen, Bjorn Marius Aamodt; Rennie, Jeffrey Glenn; Laraki, Othman, Systems and methods for client authentication.
상세보기
Eriksen, Bjorn Marius Aamodt; Rennie, Jeffrey Glen; Laraki, Othman, Systems and methods for client cache awareness.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Adaptive web crawling using a statistical model 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (18)

이 특허를 인용한 특허 (69)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Adaptive web crawling using a statistical model 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (18)

이 특허를 인용한 특허 (69)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트