Mintz, Levin, Cohn, Ferris, Glovsky and Popeo, P.C.
인용정보
피인용 횟수 :
44인용 특허 :
4
초록▼
Indexing documents is accomplished by generating an inverted index for a collection of one or more documents. The inverted index includes an inverted list for an index term appearing in one or more of the documents in the collection, and one or more postings. A posting includes a document identifier
Indexing documents is accomplished by generating an inverted index for a collection of one or more documents. The inverted index includes an inverted list for an index term appearing in one or more of the documents in the collection, and one or more postings. A posting includes a document identifier identifying a document in the collection of documents, a position identifier identifying a position of the index term in the document, and proximity information specifying whether the index term is positioned in a predefined proximal relationship between the index term and another a second index term in the document.
대표청구항▼
What is claimed is: 1. A computer implemented method comprising: parsing, by at least one of one or more processors, each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms compri
What is claimed is: 1. A computer implemented method comprising: parsing, by at least one of one or more processors, each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms; identifying, by at least one of the one or more processors, the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms; generating, by at least one of the one or more processors, a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising: an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears, an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms, a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents, a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, and the unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms; creating, by at least one of the one or more processors, an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms; parsing, by at least one of the one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms; retrieving, by at least one of the one or more processors, two or more of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists. 2. A method as in claim 1, wherein the indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and that the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms indicates that the corresponding one of the plurality of index terms occurs immediately after one of the commonly occurring terms. 3. A method as in claim 1, wherein the indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and that the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms indicates that the corresponding one of the plurality of index terms occurs immediately before one of the commonly occurring terms. 4. A method as in claim 1, wherein each posting further comprises information about whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and that the corresponding one of the plurality of index terms occurs both immediately after one of the commonly occurring terms and immediately before another one of the commonly occurring terms; and wherein the parsing further comprises identifying an additional sequence of the search terms that includes one of the commonly occurring terms that is both immediately followed and immediately preceded by one of the infrequently occurring terms. 5. A method as in claim 1, wherein the documents in the collections of documents comprise one or more of source code, binary files, tables of genetic code, text documents, structured documents, and unstructured documents. 6. A method as in claim 1, further comprising compressing the expanded inverted index using an integer compression scheme. 7. A method as in claim 1, wherein the vocabulary comprises all index terms that occur in the collection of documents. 8. A method as in claim 1, wherein the reference location is a beginning of the document and wherein the location offset value indicates how many terms from the beginning of the document the corresponding term is. 9. A computer-implemented method comprising: parsing, by at least one of one or more processors, a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of a plurality of commonly occurring terms that is immediately adjacent to one of a plurality of infrequently occurring terms and also identifying a second of the search terms that is a second one of the plurality of infrequently occurring terms and that is not immediately adjacent to any of the commonly occurring terms; retrieving, by at least one of the one or more processors, two or more expanded inverted lists of a plurality of expanded inverted lists that comprise an expanded inverted index, the expanded inverted index comprising a vocabulary that comprises a plurality of index terms that occur in a collection of documents that are searched in response to the search query, each of the plurality of expanded inverted lists corresponding to one of the plurality of index terms, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms and infrequently occurring terms differing form each other, each expanded inverted list comprising a posting for each document in the collection of documents that includes the corresponding index term, each posting in each expanded inverted list comprising: an identification of the document of the collection of documents in which the corresponding index term appears, an indication of whether the corresponding index term in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding index term occurs immediately adjacent to one of the commonly occurring terms, a number of times the corresponding index term occurs in the one of the collection of documents, a location offset value specifying where in the document the corresponding term occurs relative to a reference location in the document, and a unique common term identifier assigned to the commonly occurring term if the corresponding index term occurs immediately adjacent to one of the commonly occurring terms, the retrieved two or more of the plurality of expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; returning, by at least one of the one or more processors, one or more documents of the collections of documents that appear in the all of the two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists. 10. An apparatus comprising: one or more processors that perform functions comprising: parsing each document in a collection of documents to create a vocabulary comprising a plurality of index terms that occur in the collection of documents, the plurality of index terms comprising commonly occurring terms and infrequently occurring terms, the commonly occurring terms differing from the infrequently occurring terms; identifying the commonly occurring terms and assigning a unique common term identifier value to each of the commonly occurring terms; generating a plurality of expanded inverted lists, each expanded inverted list of the plurality of expanded inverted lists corresponding to one of the plurality of index terms and comprising a posting for each document in the collection of documents that includes the corresponding one of the plurality of index terms, each posting comprising: an identification of the document of the collection of documents in which the corresponding one of the plurality of index terms appears, an indication of whether the corresponding one of the plurality of index terms in the one of the collection of documents is one of the infrequently occurring terms and whether the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms, a number of times the corresponding one of the plurality of index terms occurs in the one of the collection of documents, a location offset value specifying where in the document the corresponding one of the plurality of index terms occurs relative to a reference location in the document, and the unique common term identifier assigned to the commonly occurring one of the plurality of index terms if the corresponding one of the plurality of index terms occurs immediately adjacent to one of the commonly occurring terms; creating an expanded inverted index that comprises the vocabulary and the expanded inverted list corresponding to each of the plurality of index terms; parsing a search query that comprises a phrase query of two or more search terms that must appear in a specified order, the parsing comprising identifying a sequence of the search terms that includes a first one of the commonly occurring terms that is immediately followed or immediately preceded by one of the infrequently occurring terms and also identifying a second one of the search terms that is a second of the infrequently occurring terms and that is not immediately adjacent to one of the commonly occurring terms; retrieving of the plurality of expanded inverted lists from the expanded inverted index, the two or more of the plurality of expanded inverted lists comprising expanded inverted lists corresponding to the first and the second infrequently occurring index terms identified in the parsing; and returning one or more documents of the collections of documents that appear in all of the retrieved two or more of the plurality of expanded inverted lists after evaluating the search query using only the retrieved two or more of the plurality of expanded inverted lists.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (4)
Rubin Bradley Scott, Object oriented information retrieval framework mechanism.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Adding information or functionality to a rendered document via association with an electronic counterpart.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Aggregate analysis of text captures performed by multiple users from rendered documents.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Association of a portable scanner with input/output and storage devices.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Association of a portable scanner with input/output and storage devices.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Automatically capturing information, such as capturing information using a document-aware device.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J.; Daley-Watson, Christopher J., Automatically providing content associated with captured information, such as information captured in real-time.
King, Martin Towle; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Quentin, Capturing text from rendered documents using supplement information.
King, Martin Towle; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Quentin, Capturing text from rendered documents using supplemental information.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Determining actions involving captured information and electronic content associated with rendered documents.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Handheld device for capturing text from both a document printed on paper and a document displayed on a dynamic display device.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Identifying a document by performing spectral analysis on the contents of the document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Methods and systems for initiating application processes by data capture from rendered documents.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Performing actions based on capturing information from rendered documents, such as documents under copyright.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Performing actions based on capturing information from rendered documents, such as documents under copyright.
King, Martin Towle; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Quentin, Processing techniques for text capture from a rendered document.
King, Martin T.; Kushler, Clifford A.; Stafford-Fraser, James Q.; Grover, Dale L., Processing techniques for visual capture data from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Search engines and systems with handheld document data capture devices.
King, Martin Towle; Stafford-Fraser, James Quentin; Kushler, Clifford A.; Grover, Dale L., System and method for information gathering utilizing form identifiers.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.