$\require{mediawiki-texvc}$

연합인증

연합인증 가입 기관의 연구자들은 소속기관의 인증정보(ID와 암호)를 이용해 다른 대학, 연구기관, 서비스 공급자의 다양한 온라인 자원과 연구 데이터를 이용할 수 있습니다.

이는 여행자가 자국에서 발행 받은 여권으로 세계 각국을 자유롭게 여행할 수 있는 것과 같습니다.

연합인증으로 이용이 가능한 서비스는 NTIS, DataON, Edison, Kafe, Webinar 등이 있습니다.

한번의 인증절차만으로 연합인증 가입 서비스에 추가 로그인 없이 이용이 가능합니다.

다만, 연합인증을 위해서는 최초 1회만 인증 절차가 필요합니다. (회원이 아닐 경우 회원 가입이 필요합니다.)

연합인증 절차는 다음과 같습니다.

최초이용시에는
ScienceON에 로그인 → 연합인증 서비스 접속 → 로그인 (본인 확인 또는 회원가입) → 서비스 이용

그 이후에는
ScienceON 로그인 → 연합인증 서비스 접속 → 서비스 이용

연합인증을 활용하시면 KISTI가 제공하는 다양한 서비스를 편리하게 이용하실 수 있습니다.

Method and system for detecting duplicate documents in web crawls 원문보기

IPC분류정보
국가/구분 United States(US) Patent 등록
국제특허분류(IPC7판)
  • G06F-017/00
  • G06F-017/30
출원번호 US-0343511 (1999-06-30)
발명자 / 주소
  • Meyerzon, Dmitriy
  • Shoroff, Srikanth
  • Terek, F. Soner
  • Norin, Scott
출원인 / 주소
  • Microsoft Corporation
대리인 / 주소
    Woodcock Washburn LLP
인용정보 피인용 횟수 : 149  인용 특허 : 3

초록

A Web crawler application takes advantage of a document store's ability to provide a content identifier (CID) having a value that is a unique function of the physical storage location of a data object or document, such as a Web page. In operation, the crawler first tries to fetch the CID for a docum

대표청구항

A Web crawler application takes advantage of a document store's ability to provide a content identifier (CID) having a value that is a unique function of the physical storage location of a data object or document, such as a Web page. In operation, the crawler first tries to fetch the CID for a docum

이 특허에 인용된 특허 (3)

  1. Brown Eric William ; Prager John Martin, Identifying duplicate documents from search results without comparing document content.
  2. Benson Max L. ; Shakib Darren A., Single instance storage of information.
  3. Marc Alexander Najork ; Clark Allan Heydon, System and method for associating an extensible set of data with documents downloaded by a web crawler.

이 특허를 인용한 특허 (149)

  1. Obata,Kenji C; Meyerzon,Dmitriy, Adaptive web crawling using a statistical model.
  2. Suponau,Dzmitry; Girotto,Jay; Wu,Qiang; Wad,Rohit Vishwas; Liu,Yue, Aggregating data from difference sources.
  3. Suponau, Dzmitry; Girotto, Jay; Wu, Qiang; Wad, Rohit Vishwas; Liu, Yue, Aggregating data from different sources.
  4. Kopelman, Yaniv; Arad, Carmi; Bishara, Nafea, Apparatus and method for high speed flow classification.
  5. Gupta, Bhupesh, Apparatus and method of highlighting categorized web pages on a web server.
  6. Fontoura, Marcus Felipe; Neumann, Andreas; Rajagopalan, Sridhar; Shekita, Eugene J.; Zien, Jason Yeong, Architecture for an indexer.
  7. Fontoura,Marcus F.; Neumann,Andreas; Rajagopalan,Sridhar; Shekita,Eugene J.; Zien,Jason Yeong, Architecture for an indexer with fixed width sort and variable width sort.
  8. Zhu, Huican; Acharya, Anurag, Assigning document identification tags.
  9. Zhu, Huican; Acharya, Anurag, Assigning document identification tags.
  10. Shuster, Gary Stephen, Avoiding masked web page content indexing errors for search engines.
  11. Wideman, Roderick B., Classifying data for deduplication and storage.
  12. Petriuc, Mihai, Click distance determination.
  13. Diamond, Michael B.; White, Jonathan B., Content keys for authorizing access to content.
  14. Diamond, Michael B.; White, Jonathan B., Content server and method of storing content.
  15. Sokolan, Patrick; Doherty, Dennis; Duguay, Claude; Radcliffe, William; Bourassa, Virgil, Data collector.
  16. Rudary, Matthew R., Date-based web page annotation.
  17. Sugawara, Yu; Kato, Yoshikiyo; Imaizumi, Ryoichi; Fukushima, Ken′ichi, Deduplication in search results.
  18. Lloyd, Matthew, Detecting common prefixes and suffixes in a list of strings.
  19. Gomes, Benedict A.; Smith, Benjamin T., Detecting query-specific duplicate documents.
  20. Gomes, Benedict A.; Smith, Benjamin T., Detecting query-specific duplicate documents.
  21. Gomes, Benedict Anthony; Smith, Benjamin Thomas, Detecting query-specific duplicate documents.
  22. Tankovich, Vladimir; Meyerzon, Dmitriy; Poznanski, Victor, Detection of junk in search result ranking.
  23. Halevy, Alon Y.; Madhavan, Jayant; Ko, David H., Determining a geographic location relevant to a web page.
  24. Schwerk, Uwe, Document broadcasting utilizing hashcodes.
  25. Hewett, Jeffrey R.; Hewett, Michael S.; Hewett, Daria K., Document de-duplication and modification detection.
  26. Tankovich, Vladimir; Meyerzon, Dmitriy; Taylor, Michael James, Document length as a static relevance feature for ranking search results.
  27. Murthy, Ravi; Chandrasekar, Sivasankaran; Sedlar, Eric; Agarwal, Nipun, Document level indexes for efficient processing in multiple tiers of a computer system.
  28. Zhu, Huican; Acharya, Anurag; Ibel, Max; Gobioff, Howard B., Document reuse in a search engine crawler.
  29. Zhu, Huican; Acharya, Anurag; Ibel, Max; Gobioff, Howard Bradley, Document reuse in a search engine crawler.
  30. Zhu, Huican; Ibel, Maximilian; Acharya, Anurag; Gobioff, Howard Bradley, Document reuse in a search engine crawler.
  31. Dulitz, Daniel; Verstak, Alexandre A.; Ghemawat, Sanjay; Dean, Jeffrey A., Duplicate document detection in a web crawler system.
  32. Kato, Kyoko, Duplicate file detection device, duplicate file detection method, and computer-readable storage medium.
  33. Dean, Jeffrey A.; Ghemawat, Sanjay; Thambidorai, Gautham, Efficient indexing of documents with similar content.
  34. Lempel, Ronny; Leyba, Todd; McPherson, Jr., John A.; Perez, Justo Luis, Enforcing native access control to indexed documents.
  35. Meyerzon, Dmitriy; Shnitko, Yauhen; Burges, Chris J. C.; Taylor, Michael James, Enterprise relevancy ranking using a neural network.
  36. Krishnaprasad, Muralidhar; Liao, Ciya; Chang, Thomas H.; Bhavsar, Meeten, Extensible mechanism for detecting duplicate search items.
  37. Chang, Thomas H.; Bhavsar, Meeten; Krishnaprasad, Muralidhar, Extensible mechanism for grouping search results.
  38. Robertson, Stephen; Zaragoza, Hugo; Taylor, Michael; Larimore, Stefan Isbein; Petriuc, Mihai, Field weighting in text searching.
  39. Diamond, Theodore George; Hendrick, Daniel Allen; Rehm, Eric Carl; Riesland, Melissa Anne, Full-text relevancy ranking.
  40. Prince, John, Fuzzy database retrieval.
  41. Broder, Andrei Z.; Fontoura, Marcus Felipe; Herscovici, Michael; Lempel, Ronny; McPherson, Jr., John Ai; Neumann, Andreas; Qi, Runping; Shekita, Eugene Jon, Generic architecture for indexing document groups in an inverted text index.
  42. Abajian, Aram Christian, Grouping multimedia and streaming media search results.
  43. Abajian, Aram Christian, Grouping multimedia and streaming media search results.
  44. Beynon, Margaret Ann Ruth; Flegg, Andrew James, Guaranteeing hypertext link integrity.
  45. Beynon,Margaret Ann Ruth; Flegg,Andrew James, Guaranteeing hypertext link integrity.
  46. Bender, Michael; Nachman, David E.; Shute, Michael P.; Walker, Keith R., Identifying webpages accessible by unauthorized users via URL guessing or network sniffing.
  47. Krishnamurthy, Sanjay M., Indexing XML documents efficiently.
  48. Uetabira, Shigeki; Uetabira, Mitsugu, Information search provision apparatus and information search provision system.
  49. Jensen-Grey, Sean S., Internet crawl seeding.
  50. Abajian, Aram Christian; Alexander, Robin Andrew; Lee, Scott Chao-Chueh; Dahl, Austin David; Derosa, John Anthony; Porter, Charles A.; Rehm, Eric Carl; Kolar, Jennifer Lynn; Sudanagunta, Srinivasan, Internet streaming media workflow architecture.
  51. Alpert, Jesse L.; Tammana, Praveen K.; Kurzion, Yair, Managing URLs.
  52. Alpert, Jesse L.; Tammana, Praveen K.; Kurzion, Yair, Managing URLs.
  53. Alpert, Jesse L., Managing items in crawl schedule.
  54. Rassool,Reza; Worzel,William P.; Baker,Brian, Media tracking system and method.
  55. Dengler, Patrick M.; Krishnan, Arvind K.; Singh, Jagdish; Sanchez, Lawrence M.; Shankar, Sai; Chittamuru, Satish Kumar; Pekic, Zoltan; Mondal, Nabarun; Kumar, Namendra; i Dalfó, Ricard Roma, Metadata driven user interface.
  56. Villadsen, Peter; Chen, Zhaoqi; Gottumukkala, Ramakanthachary S.; Calderon, Marcos, Metadata-based eventing supporting operations on data.
  57. Davis, Mark W.; Murphy, John; Carter, Paul Grant, Method and apparatus for duplicate detection.
  58. Kantrowitz, Mark, Method and apparatus for efficient identification of duplicate and near-duplicate documents and text spans using high-discriminability text fragments.
  59. Hendricks, John S.; Bonner, Alfred E.; McCoskey, John S.; Asmussen, Michael L., Method and apparatus for targeting of interactive virtual objects.
  60. Hendricks, John S.; Bonner, Alfred E.; McCoskey, John S.; Asmussen, Michael L., Method and apparatus for targeting of interactive virtual objects.
  61. Diamond, Michael B.; White, Jonathan B., Method and system for accessing content on demand.
  62. Lee, ChangHee, Method and system for detecting original document of web document, method and system for providing history information of web document for the same.
  63. Von Weihe, Daniel, Method and system for document retrieval with selective document comparison.
  64. Safa, John, Method and system for shared document approval.
  65. Brei,James; Holt,Mark; Olson,Ken; Potharaju,Sri, Method for creating durable web-enabled uniform resource locator links.
  66. Pollastro, Paul J., Method for generating increased numbers of leads via the internet.
  67. Kraft,Reiner; Neumann,Andreas, Method for handling anchor text.
  68. Fontoura,Marcus Felipe; Lempel,Ronny; Qi,Runping; Zien,Jason Yeong, Method for searching documents for ranges of numeric values.
  69. Hayward, Monte Duane, Method of disseminating advertisements using an embedded media player page.
  70. Hayward, Monte Duane, Method of disseminating advertisements using an embedded media player page.
  71. Hayward, Monte Duane, Method of disseminating advertisements using an embedded media player page.
  72. Hayward, Monte Duane, Method of sizing an embedded media player page.
  73. Hobson, Stephen J.; Todd, Stephen J., Method, apparatus and computer program for retrieving data.
  74. Shi, Teng, Method, apparatus, and communication system for transmitting graphic information.
  75. Fontoura, Marcus F.; Neumann, Andreas; Qi, Runping; Shekita, Eugene J., Method, system, and program for handling redirects in a search engine.
  76. Glover, Robin Wallace, Methods and systems for comparing presentation slide decks.
  77. Abajian, Aram Christian; Alexander, Robin Andrew; Lee, Scott Chao-Chueh; Dahl, Austin David; Derosa, John Anthony; Porter, Charles A.; Rehm, Eric Carl; Kolar, Jennifer Lynn; Sudanagunta, Srinivasan, Methods and systems for enhancing metadata.
  78. Abajian, Aram Christian; Alexander, Robin Andrew; Lee, Scott Chao-Chueh; Dahl, Austin David; Derosa, John Anthony; Porter, Charles A.; Rehm, Eric Carl; Kolar, Jennifer Lynn; Sudanagunta, Srinivasan, Methods and systems for enhancing metadata.
  79. More, Scott; Beyer, Ilya, Methods and systems for exact data match filtering.
  80. Abajian, Aram Christian, Methods and systems for grouping uniform resource locators based on masks.
  81. More, Scott, Methods and systems for image fingerprinting.
  82. Mulder, Samuel Peter Matthew, Methods and systems for monitoring documents exchanged over email applications.
  83. More, Scott, Methods and systems for preventing unauthorized disclosure of secure information using image fingerprinting.
  84. More, Scott; Beyer, Ilya; Sweeting, Daniel Christopher John, Methods and systems for protect agents using distributed lightweight fingerprints.
  85. More, Scott; Beyer, Ilya, Methods and systems to fingerprint textual information using word runs.
  86. More, Scott; Beyer, Ilya; Sweeting, Daniel Christopher John, Methods and systems to implement fingerprint lookups across remote agents.
  87. More, Scott; Beyer, Ilya; Sweeting, Daniel Christopher John, Methods and systems to implement fingerprint lookups across remote agents.
  88. Carver, Anton P. T., Minimizing visibility of stale content in web searching including revising web crawl intervals of documents.
  89. Carver, Anton P. T., Minimizing visibility of stale content in web searching including revising web crawl intervals of documents.
  90. Carver, Anton P. T., Minimizing visibility of stale content in web searching including revising web crawl intervals of documents.
  91. Idicula, Sam; Agarwal, Nipun; Murthy, Ravi; Sedlar, Eric, Path-caching mechanism to improve performance of path-related operations in a repository.
  92. Idicula, Sam; Agarwal, Nipun; Murthy, Ravi; Sedlar, Eric, Path-caching mechanism to improve performance of path-related operations in a repository.
  93. Fontoura, Marcus Felipe; Kraft, Reiner; Leung, Tony Kai-Chi; McPherson, Jr., John A.; Neumann, Andreas; Qi, Runping; Rajagopalan, Sridhar; Shekita, Eugene J.; Zien, Jason Yeong, Pipelined architecture for global analysis and index building.
  94. Fontoura,Marcus F.; Kraft,Reiner; Leung,Tony K.; McPherson, Jr.,John Ai; Neumann,Andreas; Qi,Runping; Rajagopalan,Sridhar; Shekita,Eugene J.; Zien,Jason Yeong, Pipelined architecture for global analysis and index building.
  95. Redpath, Richard J., Plug-in parsers for configuring search engine crawler.
  96. Obata, Kenji; Meyerzon, Dmitriy, Proxy server using a statistical model.
  97. Obata, Kenji; Meyerzon, Dmitriy, Proxy server using a statistical model.
  98. Brandenberger,Sarah M., Published web page version tracking.
  99. Meyerzon, Dmitriy; Zaragoza, Hugo, Ranking search results using biased click distance.
  100. Meyerzon, Dmitriy; Li, Hang, Ranking search results using feature extraction.
  101. Meyerzon, Dmitriy; Zaragoza, Hugo, Ranking search results using language types.
  102. Poznanski, Victor; Wang, Oivind; Holm, Fredrik; Bodd, Nicolai; Tankovich, Vladimir; Meyerzon, Dmitriy, Re-ranking search results.
  103. Dulitz, Daniel; Verstak, Alexandre A.; Ghemawat, Sanjay; Dean, Jeffrey A., Representative document selection for a set of duplicate documents.
  104. Dulitz, Daniel; Verstak, Alexandre A.; Ghemawat, Sanjay; Dean, Jeffrey A., Representative document selection for sets of duplicate documents in a web crawler system.
  105. Dulitz, Daniel; Verstak, Alexandre A.; Ghemawat, Sanjay; Dean, Jeffrey A., Representative document selection for sets of duplicate documents in a web crawler system.
  106. Randall, Keith H., Scheduler for search engine crawler.
  107. Randall, Keith H., Scheduler for search engine crawler.
  108. Randall, Keith H., Scheduler for search engine crawler.
  109. Zhu, Huican; Ibel, Maximilian; Acharya, Anurag; Gobioff, Howard Bradley, Scheduler for search engine crawler.
  110. Zhu, Huican; Ibel, Maximilian; Acharya, Anurag; Gobioff, Howard Bradley, Scheduler for search engine crawler.
  111. Auerbach, David B.; Alpert, Jesse L., Scheduling a recrawl.
  112. Tankovich, Vladimir; Li, Hang; Meyerzon, Dmitriy; Xu, Jun, Search results ranking using editing distance and document information.
  113. Fontoura, Marcus F.; Lempel, Ronny; Qi, Runping; Zien, Jason Y., Searching documents for ranges of numeric values.
  114. Fontoura, Marcus Felipe; Lempel, Ronny; Qi, Runping; Zien, Jason Yeong, Searching documents for ranges of numeric values.
  115. Fontoura, Marcus Felipe; Lempel, Ronny; Qi, Runping; Zien, Jason Yeong, Searching documents for ranges of numeric values.
  116. Halevy, Alon Y.; Madhavan, Jayant; Ko, David H., Searching through content which is accessible through web-based forms.
  117. Benton,James R.; Kalach,Ran; Oltean,Paul Adrian; Matev,Georgi M., Storage reports duplicate file detection.
  118. Iitsuka, Takayoshi, Storage system.
  119. Iitsuka, Takayoshi, Storage system with improved de-duplication arrangement.
  120. Krishnamurthy, Sanjay M., Storing XML documents efficiently in an RDBMS.
  121. Huang, Anita Wai-Ling; Sundaresan, Neelakantan, System and method for classifying electronically posted documents.
  122. Huang,Anita Wai Ling; Sundaresan,Neelakantan, System and method for classifying electronically posted documents.
  123. Black,Cameron; Schmidt,Ross A.; Brockway,Sean M.; Craig,Robert M.; Partington,Todd, System and method for data management.
  124. Glover, Robin, System and method for determining document version geneology.
  125. Najork, Marc Alexander; Heydon, Clark Allan, System and method for efficient filtering of data set addresses in a web crawler.
  126. Meyerzon, Dmitriy; Zaragoza, Hugo, System and method for ranking search results using click distance.
  127. Merrigan, Chadd Creighton; Peltonen, Kyle G.; Meyerzon, Dmitriy; Lee, David J., System and method for scoping searches using index keys.
  128. Mulder, Matthew, System and method for securing documents prior to transmission.
  129. Connaughton, Chris, System and method of analyzing an HTML document for changes such that the changed areas can be displayed with the original formatting intact.
  130. Knauft, Christopher L.; Franklin, Martin, System and method of dynamically generating index information.
  131. Kraft, Reiner; Neumann, Andreas, System and program for handling anchor text.
  132. Frieder, Ophir; Chowdhury, Abdur R., System for similar document detection.
  133. Frieder, Ophir; Chowdhury, Abdur R., System for similar document detection.
  134. Frieder, Ophir; Chowdhury, Abdur R., System for similar document detection.
  135. Brill, Eric D.; Meek, Christopher A., Systems and methods for client-based web crawling.
  136. Hayward, Monte Duane, Systems and methods for rendering content.
  137. Howe, Karen N.; Kolar, Jennifer L.; Sudanagunta, Srinivasan, Targeted advertising for playlists based upon search queries.
  138. Lloyd, Matthew; Bergan, Thomas, Uniform resource locator canonicalization.
  139. Merrells,John; Natkovich,Olga; Good,Gordon; Smith,Mark C., UniqueID-based addressing in a directory server.
  140. Kupke, Joachim; Cox, Jeff, Updating search engine document index based on calculated age of changed portions in a document.
  141. Kupke, Joachim; Cox, Jeff, Updating search engine document index based on calculated age of changed portions in a document.
  142. McCoskey, John S.; Swart, William D.; Asmussen, Michael L., Video and digital multimedia aggregator.
  143. McCoskey, John S.; Swart, William D.; Asmussen, Michael L., Video and digital multimedia aggregator.
  144. McCoskey, John S.; Swart, William D.; Asmussen, Michael L., Video and digital multimedia aggregator.
  145. McCoskey, John S.; Swart, William D.; Asmussen, Michael L., Video and digital multimedia aggregator.
  146. Swart, William D.; McCoskey, John S.; Asmussen, Michael L., Video and digital multimedia aggregator content coding and formatting.
  147. Asmussen, Michael L.; Mccoskey, John S.; Swart, William D., Video and digital multimedia aggregator content suggestion engine.
  148. Swart, William D.; Asmussen, Michael L.; McCoskey, John S., Video and digital multimedia aggregator remote content crawler.
  149. Shi, Bin; Xu, Gu; Ma, Wei Ying, Web forum crawler.
섹션별 컨텐츠 바로가기

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

AI-Helper 아이콘
AI-Helper
안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

선택된 텍스트

맨위로