Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pair
Provided is a method and system for indexing documents in a collection of linked documents. A link log, including one or more pairings of source documents and target documents is accessed. A sorted anchor map, containing one or more target document to source document pairings, is generated. The pairings in the sorted anchor map are ordered based on target document identifiers.
대표청구항▼
What is claimed is: 1. A method of processing information related to documents in a collection of linked documents, the method comprising: accessing a link log, the link log that comprises a plurality of link records, each link record identifying a source document and a list of one or more target d
What is claimed is: 1. A method of processing information related to documents in a collection of linked documents, the method comprising: accessing a link log, the link log that comprises a plurality of link records, each link record identifying a source document and a list of one or more target documents pointed to by one or more outbound links in the source document; the link record including a source document identifier for the identified source document and one or more target document identifiers for the identified list of target documents; wherein the link records are based, at least in part on information extracted from crawled documents in the collection of linked documents; and outputting a sorted anchor map that corresponds to the link log and that comprises a plurality of anchor records, each anchor record identifying a respective target document and a list of inbound links, the list of inbound links identifying source documents that contain links to the respective target document; the anchor record including a respective target document identifier; wherein the plurality of anchor records are ordered in the sorted anchor map based, at least in part, on their respective target document identifiers; and wherein each respective target document identifier in the plurality of anchor records corresponds to one of the one or more target document identifiers in the link log. 2. The method of claim 1, wherein each anchor record in the sorted anchor map further comprises a respective list of annotations. 3. The method of claim 2, wherein each annotation included in the respective list of annotations for a respective anchor record corresponds to a respective inbound link identifying a respective source document that contains a link to the respective target document. 4. The method of claim 2, wherein at least one entry in the respective list of annotations of an anchor record in the sorted anchor map includes a text passage and a list of attributes of the text passage. 5. The method of claim 4, wherein the text passage is determined from text within a predetermined distance of an anchor tag in a respective source document in the source documents of the anchor record. 6. The method of claim 1, further including repeating the accessing and outputting so as to produce a layered set of sorted anchor maps. 7. The method of claim 6, further including, when a merge condition has been satisfied, merging a subset of the layered set of sorted anchor maps to produce a merged anchor map; wherein the merged anchor map includes a plurality of merged anchor records, each merged anchor record corresponding to at least one anchor record from the subset of the layered set of sorted anchor maps, wherein the merged anchor records are ordered in the merged anchor map based on their respective target document identifiers. 8. The method of claim 1, further including outputting a sorted link map, the sorted link map comprising a plurality of link map records, each link map record comprising the source document identifier and the list of target document identifiers in an associated link record. 9. The method of claim 8, further including repeating the accessing, the outputting of the sorted anchor map, and the outputting of the sorted link map, so as to produce a layered set of sorted anchor maps and a layered set of sorted link maps. 10. The method of claim 9, further including when a merge condition has been satisfied, merging a subset of the layered set of sorted link maps to produce a merged link map; wherein the merged link map includes a plurality of merged link records, each merged link record corresponding to at least one link record from the subset of the layered set of sorted link maps, wherein the merged link records are ordered in the merged link map based on their respective source document identifiers. 11. The method of claim 10, wherein merging a subset of the layered set of sorted link maps further includes: searching, within the subset of sorted link maps, for link map records containing a particular source document identifier; and when a first link map record and a second link map record each contain the particular source document identifier, if a particular target document identifier is contained in the list of target document identifiers in the first link map record and the particular target document identifier is not contained in the list of target document identifiers in the second link map record, generating a delete entry in a record, the record comprising the particular source document identifier and the particular target document identifier. 12. The method of claim 11, further including when an anchor merge condition has been satisfied, merging a subset of the layered set of sorted anchor maps to produce a merged anchor map; where the merged anchor map includes a plurality of merged anchor records, each merged anchor map record corresponding to at least one anchor map record from the subset of the layered set of sorted anchor maps, wherein the merged anchor records are ordered in the merged anchor map based on their respective target document identifiers. 13. The method of claim 12, wherein, if the delete entry has been generated, the merged anchor map record containing the particular target document identifier in the delete record does not contain the particular source document identifier in the delete record. 14. The method of claim 1, wherein the collection of linked documents reside on a plurality of computers interconnected by the Internet. 15. The method of claim 1, wherein the collection of linked documents includes a first document and a second document, wherein a document address of the first document contains information about a first host on an intranet; wherein a document address of the second document contains information about a second host on an intranet; and wherein the first host and the second host are distinct computer systems connected to one another. 16. The method of claim 1, wherein the respective target document identifiers are monotonically ordered. 17. The method of claim 1, wherein the respective target document identifiers increase monotonically and the anchor records are ordered in the sorted anchor map in accordance with the monotonically increasing target document identifiers in the anchor records. 18. The method of claim 1, wherein the respective target document identifiers decrease monotonically and the anchor records are ordered in the sorted anchor map in accordance with the monotonically increasing target document identifiers in the anchor records. 19. A method of indexing annotations associated with links between documents in a collection of linked documents, the method comprising: crawling at least a subset of the documents in the collection of linked documents and extracting from the crawled documents information concerning outbound links between documents in the collection of linked documents; generating, based on the extracted information, a link log that comprises a plurality of link records, each link record identifying a respective source document and a list of one or more target documents pointed to by outbound links in the respective source document; and generating an anchor map that corresponds to the link log and that comprises a plurality of anchor records, each anchor record identifying a respective target document, a list of inbound links, the list of inbound links identifying source documents that contain links to the respective target document, and a list of annotations associated with the links in the source documents that point to the respective target document; wherein each respective target document identified in the plurality of anchor records corresponds to a target document identified in the link log; and processing at least a plurality of the anchor records, including, for each anchor record in the plurality of anchor records, adding to a document index entries for terms in the list of annotations in the anchor record, wherein the entries are associated with the target document identified by the anchor record. 20. The method of claim 19, wherein the list of annotations in a respective anchor record comprises a subset of the text in the source documents that contain links to the respective target document. 21. The method of claim 20, wherein the list of annotations in a respective anchor record includes a list of attributes for at least a portion of the subset of the text in the source documents in the list respective anchor record. 22. The method of claim 19, wherein the document index is searchable for documents matching specified search queries. 23. The method of claim 19, wherein at least one entry in the list of annotations of an anchor record in the anchor map includes a text passage and a list of attributes of the text passage. 24. The method of claim 23, wherein the text passage is determined from text within a predetermined distance of an anchor tag in a particular source document that contains links to the respective target document. 25. The method of claim 19, wherein the respective target documents are monotonically ordered. 26. The method of claim 19, wherein the respective target documents have monotonically increasing target document identifiers and the anchor records are ordered in the anchor map in accordance with the monotonically increasing target document identifiers of the target documents identified in the anchor records. 27. The method of claim 19, wherein the respective target documents have monotonically decreasing target document identifiers and the anchor records are ordered in the anchor map in accordance with the monotonically decreasing target document identifiers of the target documents identified in the anchor records.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (7)
Blumer Thomas P. ; Mauceri ; Jr. Robert J., Computer system and computer-implemented process for presenting document connectivity.
Shimizu Takeshi (Kanagawa JPX) Saito Takahiro (Kanagawa JPX) Nakamura Osamu (Kanagawa JPX), System for managing hypertext node information and link information.
Wong, Sandy; Huynh, Yet L.; Natarajan, Ramakrishnan; Kim, Joon Young; Thogersen, Michael D.; Yao, Tong, Calculating a downloading priority for the uniform resource locator in response to the domain density score, the anchor text score, the URL string score, the category need score, and the link proximity score for targeted web crawling.
Gandhi, Amar S.; Praitis, Edward J.; Kim, Jane T.; Lyndersay, Sean O.; von Koch, Walter V.; Gould, William; Morgan, Bruce A.; Kwan, Cindy, Content syndication platform.
Clark, Timothy Pressler; Garbow, Zachary Adam; Theis, Richard Michael; Wallenfelt, Brian Paul, Document-based information and uniform resource locator (URL) management.
Palliyil, Sudarshan; Venkateshamurthy, Shivakumara; Vijayaraghavan, Srinivas Belur; Aswathanarayana, Tejasvi, Hash-based access to resources in a data processing network.
Dom, Byron Edward; Popescul, Alexandrin; Zhang, Tong, System and method for determining web page quality using collective inference based on local and global information.
Coffman, Daniel M.; Munson, Jonathan P.; Narayanaswami, Chandrasekhar; Soroker, Danny; Wang, Jingtao, Tool and method for annotating an event map, and collaborating using the annotated event map.
Coffman, Daniel M.; Munson, Jonathan P.; Narayanaswami, Chandrasekhar; Soroker, Danny; Wang, Jingtao, Tool and method for mapping and viewing an event.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.