Method and system for compression indexing and efficient proximity search of text data
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-017/30
G06F-017/00
출원번호
US-0795721
(2004-03-08)
등록번호
US-7433893
(2008-10-07)
발명자
/ 주소
Lowry,Douglas B.
출원인 / 주소
Marpex Inc.
대리인 / 주소
McDowell,Brouse
인용정보
피인용 횟수 :
98인용 특허 :
5
초록▼
A system and method of compression indexing and efficient proximity search of text data permits high speed search featuring ranking the relevance of search results according to closeness of desired terms within each portion of text found. The system includes (a) preparing target text, (b) creating a
A system and method of compression indexing and efficient proximity search of text data permits high speed search featuring ranking the relevance of search results according to closeness of desired terms within each portion of text found. The system includes (a) preparing target text, (b) creating a "compression index ebook", (c) browsing in a compression index ebook, and (d) searching in a compression index ebook. To create the compression index, the method includes the steps of selecting target text, identifying tokens, such as words and punctuation strings, wherein each of the tokens has a frequency. The frequencies of each token are counted. Tokens are ranked from highest frequency to lowest frequency. The frequencies are compressed. The next step is assigning positions to each token frequency and compressing the positions to form a compression index ebook, which is stored in random access memory to eliminate disk seeks during browsing and searching.
대표청구항▼
Having thus described the invention, it is now claimed: 1. A computer implemented method of compression indexing, comprising the steps of: selecting at least one data file, said data file having target text; identifying each and every unique token, each of the unique tokens having a frequency; coun
Having thus described the invention, it is now claimed: 1. A computer implemented method of compression indexing, comprising the steps of: selecting at least one data file, said data file having target text; identifying each and every unique token, each of the unique tokens having a frequency; counting the frequency of each unique token; calculating parameters; ranking the tokens from highest frequency to lowest frequency; compressing the frequencies; assigning a position to each instance of each and every unique token; compressing the positions; aggregating tokens, frequencies, parameters, and positions to form a compression index, said compression index being exhaustive of all said tokens, such that said target text is compressed 100%; reconstituting a portion of the data file; displaying the portion of the data file on a screen, wherein compressed positions point to a compressed text random access memory file, wherein the step of reconstituting the data file further comprising the steps of: a) creating the compressed text RAM file; b) selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point; c) decompressing successive integers; d) determining positions of the tokens in the token list; e) extracting the tokens from the token list; and f) writing the tokens to the screen; repeating steps c-f until the ending point of the domain is reached; wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain. 2. The method of claim 1, wherein the tokens include every word, every number, every string of punctuation characters, and every markup tag without exception. 3. The method of claim 1, wherein the token is a string of punctuation characters. 4. The method of claim 1, wherein the token is a markup tag with no blanks. 5. The method of claim 1, further comprising the step of: searching the compression index. 6. The method of claim 5, wherein searching occurs in random access memory. 7. The method of claim 6, wherein the step of searching the compression index is performed without disk seeks. 8. The method of claim 1, wherein the step of compressing the positions to form a compression index, the position of the first instance of each token is absolute, and the position of each subsequent instance of the same token is relative to the preceding position. 9. The method of claim 1, wherein the compression index is structured to derive closeness-of-fit measures, the compression index also structured to reproduce any portion of original text for display. 10. The method of claim 5, further comprising the step of: ranking search results by relevance, wherein relevance is determined by the closeness of fit of search terms, headings, and frequencies of search terms. 11. The method of claim 1, further comprising the step of: browsing the compression index, wherein an associated user is able to sequentially read content created from the compression index starting at any point. 12. A computer implemented method for using a compression index, comprising the steps of: a) creating the compression index having the steps of: (i) providing target text, the target text being at least one data file, the target text having tokens, the tokens having frequencies; (ii) accumulating parameters; (iii) building a list of all unique tokens represented in the target text, together with their respective frequencies; (iv) sorting the list in order of declining token frequencies; (v) accumulating positions data of each instance of each token; and (vi) combining steps i-v into the compression index, the compression index being exhaustive of all tokens, such that the target text is 100% compressed; b) browsing and searching the compression index, wherein the step of accumulating parameters comprises: accumulating parameters and token frequencies on a single pass through the at least one data file, the data file having lightly marked up text, wherein the single pass through of the at least one data file of lightly marked up text comprises input for the compression index, wherein the step of building a list of all tokens represented in the target text, together with their respective frequencies further comprises the steps of: sorting all tokens in order of declining frequency; creating a temporary parameters file to facilitate passing parameters between successive states of the method to create the compression index; parsing the target text as a means for creating the token list, each of the tokens having a flag byte preceding the token and a null byte following the token; and compressing and outputting the frequencies of tokens; wherein the step of accumulating positions data further comprises the steps of: reserving a block of random access memory to accumulate positions data; reparsing the entire data file; recording the position of all tokens in the random access memory block; and compressing and outputting the positions data, wherein the position of the first instance of each token is absolute, and the position of each subsequent instance of the same token is relative to the preceding position; wherein the step of combining steps (i)-(v) into one compression index, further comprises the steps of: outputting public parameters as plain text at a beginning of the compression index; outputting compressed private parameters; and appending the tokens, frequencies and positions to complete the compression index; wherein numeric values assigned to successive tokens from the data file are compressed and successively appended to create a compressed text random access memory file, wherein the random access memory file is computationally equivalent to the target text such that the positions within this compressed text random access memory file are used as the position values of successive tokens. 13. The method of claim 12, wherein browsing and searching occurs on a personal computer. 14. The method of claim 12 wherein, browsing and searching is server based over an Internet. 15. The method of claim 12, wherein positions are compressed, compressed positions of the data file point to a compressed text random access memory file, the method further comprising the step of reconstituting the data file to browse and search the compression index comprising the steps of: creating the compressed text RAM file; selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point, wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain; decompressing successive integers; determining positions of the tokens in the token list; extracting the tokens from the token list; writing the tokens to the screen; and repeating steps c-f until the ending point of the domain is reached. 16. The method of claim 12, wherein after searching the compression index, search results are ranked by relevance. 17. The method of claim 16, wherein searching the compression index comprises the step of: scoring for closeness of fit of search terms by measuring the number of non-requested words that intervene between the first and last term found within a domain, and subtracting the count of non-requested words from the maximum allowable value. 18. The method of claim 17, wherein searching the compression index comprises the step of: scoring for headings. 19. The method of claim 18, wherein searching the compression index comprises the step of: scoring for frequencies of search terms. 20. The method of claim 18, further comprising the step of: ranking search results from the domain with the highest score to lowest score, wherein scores equal the sum of closeness of fit, frequency of search terms in heading and frequency of terms in the domain, wherein search results are displayed as a list of hits arranged in order of closeness of fit and with frequency of occurrence in headings and body controlling the ranking when the number of intervening non-requested words is the same for multiple domains. 21. A computer readable medium containing instructions for controlling a computer system to perform a method, the method comprising the steps of: selecting at least one file having target text; identifying each and every unique token, each of the unique tokens having a frequency, the tokens including every word, every number, every string of punctuation characters, and every markup tag without exception; counting the frequency of each unique token; calculating parameters; ranking the tokens from highest frequency to lowest frequency; compressing the frequencies; assigning a position to each instance of each token; compressing the positions; aggregating tokens, frequencies, parameters, and positions to form a compression index, such that the target text is compressed 100%; and browsing and searching the compression index; reconstituting a portion of the at least one file; displaying the portion of the at least one file on a screen; wherein compressed positions point to a compressed text random access memory file, the step of reconstituting the at least one file further comprising the steps of: a) creating the compressed text RAM file; b) selecting a domain to display, the domain being a portion of the at least one file, the domain having a starting point and an ending point; c) decompressing successive integers; d) determining positions of the tokens in the token list; e) extracting the tokens from the token list; and f) writing the tokens to the screen; repeating steps c-f until the ending point of the domain is reached; wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain. 22. An apparatus, comprising: means for selecting at least one file, the at least one file having target text; means for identifying each and every unique token, each of the unique tokens having a frequency, the tokens including every word, every number, every string of punctuation characters, and every markup tag without exception; means for counting the frequency of each unique token; means for calculating parameters; means for ranking the tokens from highest frequency to lowest frequency; means for compressing the frequencies; means for assigning a position to each instance of each token; means for compressing the positions; and means for aggregating tokens, frequencies, parameters, and positions to form a compression index, such that the target text is compressed 100%; and means for browsing and searching text derived from the compression index; wherein the compression index is formed by accumulating parameters and token frequencies on a single pass through the at least one data file, the at least one data file having lightly marked up text, wherein the single pass through of the at least one data file of lightly marked up text comprises input for the compression index; wherein the tokens are all sorted in order of declining frequency; wherein the compression index further comprises a temporary parameters file to facilitate passing parameters between successive states of the method to create the compression index; wherein the target text is parsed as a means for creating the token list, each of the tokens having a flag byte preceding the token and a null byte following the token; and wherein the frequencies of tokens are compressed and outputted; wherein a block of random access memory is reserved to accumulate positions data; wherein the entire data file is reparsed; wherein the position of all tokens are recorded in the random access memory block; and wherein the compressed and outputted positions data includes the position of the first instance of each token being absolute, and the position of each subsequent instance of the same token being relative to the preceding position; wherein public parameters are outputted as plain text at a beginning of the compression index; compressed private parameters are outputted; and the tokens, frequencies and positions are appended to complete the compression index; wherein numeric values assigned to successive tokens from the data file are compressed and successively appended to create a compressed text random access memory file, wherein the random access memory file is computationally equivalent to the target text such that the positions within this compressed text random access memory file are used as the position values of successive tokens.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (5)
Kaplan Ronald M. ; Mullins Atty T., Compact encoding of multi-lingual translation dictionaries.
de Hita Carolina Rubio,BEX ; Akker David van den,BEX ; Govaers Erik C. E.,BEX ; Platteau Frank M. J.,BEX ; Deun Kurt Van,BEX ; Macpherson Melissa ; de Bie Peter,BEX ; Laviolette Sophie,BEX, Natural language information retrieval system and method.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Adding information or functionality to a rendered document via association with an electronic counterpart.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Aggregate analysis of text captures performed by multiple users from rendered documents.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Association of a portable scanner with input/output and storage devices.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Association of a portable scanner with input/output and storage devices.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Automatically capturing information, such as capturing information using a document-aware device.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J.; Daley-Watson, Christopher J., Automatically providing content associated with captured information, such as information captured in real-time.
King, Martin Towle; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Quentin, Capturing text from rendered documents using supplement information.
King, Martin Towle; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Quentin, Capturing text from rendered documents using supplemental information.
Knight, William C.; Nussbaum, Nicholas I., Computer-implemented system and method for inclusion-based electronically stored information item cluster visual representation.
Knight, William C.; Nussbaum, Nicholas I., Computer-implemented system and method for visually suggesting classification for inclusion-based cluster spines.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Determining actions involving captured information and electronic content associated with rendered documents.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Handheld device for capturing text from both a document printed on paper and a document displayed on a dynamic display device.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Identifying a document by performing spectral analysis on the contents of the document.
Risvik, Knut Magne; Hopcroft, Michael; Bennett, John G.; Kalyanaraman, Karthik; Chilimbi, Trishul; Walters, Chad P.; Pedersen, Jan Otto, Matching funnel for large document index.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Methods and systems for initiating application processes by data capture from rendered documents.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Performing actions based on capturing information from rendered documents, such as documents under copyright.
King, Martin T.; Stephens, Redwood; Mannby, Claes-Fredrik; Peterson, Jesse; Sanvitale, Mark; Smith, Michael J., Performing actions based on capturing information from rendered documents, such as documents under copyright.
King, Martin Towle; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Quentin, Processing techniques for text capture from a rendered document.
King, Martin T.; Kushler, Clifford A.; Stafford-Fraser, James Q.; Grover, Dale L., Processing techniques for visual capture data from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Search engines and systems with handheld document data capture devices.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Search engines and systems with handheld document data capture devices.
Knight, William C.; Nussbaum, Nicholas I.; Conwell, John W., System and method for displaying relationships between concepts to provide classification suggestions via inclusion.
Knight, William C.; Nussbaum, Nicholas I.; Conwell, John W., System and method for displaying relationships between concepts to provide classification suggestions via injection.
Knight, William C.; Nussbaum, Nicholas I.; Conwell, John W., System and method for displaying relationships between concepts to provide classification suggestions via nearest neighbor.
Knight, William C.; Nussbaum, Nicholas I., System and method for displaying relationships between electronically stored information to provide classification suggestions via inclusion.
Knight, William C.; Nussbaum, Nicholas I., System and method for displaying relationships between electronically stored information to provide classification suggestions via injection.
Knight, William C.; Nussbaum, Nicholas I., System and method for displaying relationships between electronically stored information to provide classification suggestions via nearest neighbor.
King, Martin Towle; Stafford-Fraser, James Quentin; Kushler, Clifford A.; Grover, Dale L., System and method for information gathering utilizing form identifiers.
Borchardt, Jonathan M.; Walter, Edward L., System and method for providing a dynamic user interface for a dense three-dimensional scene with a plurality of compasses.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
King, Martin T.; Grover, Dale L.; Kushler, Clifford A.; Stafford-Fraser, James Q., Triggering actions in response to optically or acoustically capturing keywords from a rendered document.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.