Embodiments are directed to ranking search results using a junk profile. For a given corpus of documents, one or more junk profiles may be created and maintained. The junk profile provides reference metrics to represent known junk documents. For example, a junk profile may comprise a dictionary of d
Embodiments are directed to ranking search results using a junk profile. For a given corpus of documents, one or more junk profiles may be created and maintained. The junk profile provides reference metrics to represent known junk documents. For example, a junk profile may comprise a dictionary of document data that is automatically inserted into documents created using a particular system or template. A junk profile may also comprise one or more representations (e.g., histograms) of a distribution of a particular junk variable for known junk documents. The junk profile provides a usable representation of known junk documents, and the present systems and methods employ the junk profile to predict the likelihood that documents in the corpus are junk. In embodiments, junk scores are calculated and used to rank such documents higher or lower in response to a search query.
대표청구항▼
1. A computer-implemented method for ranking candidate documents in response to a search query, comprising steps of: creating, by at least a first processor, an index of a plurality of documents in a corpus;calculating a junk score for at least a first document in the corpus, wherein calculating the
1. A computer-implemented method for ranking candidate documents in response to a search query, comprising steps of: creating, by at least a first processor, an index of a plurality of documents in a corpus;calculating a junk score for at least a first document in the corpus, wherein calculating the junk score comprises: using a first candidate histogram for the first document in the corpus, wherein the first candidate histogram is specific to the first document; andusing a junk profile, wherein the junk profile comprises: a first reference histogram for a first known junk document, wherein the first reference histogram is specific to the first known junk document and is based on a first junk variable; andcomparing the first candidate histogram to the first reference histogram;receiving a search query;identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;ranking the candidate documents. 2. The computer-implemented method of claim 1, wherein ranking the candidate documents comprises ranking the candidate documents based at least in part on the junk score for the first document, and wherein the ranking of the first document is decreased where the first document is more similar to the first known junk document. 3. The computer-implemented method of claim 1, wherein calculating the junk score further comprises determining a first similarity metric. 4. The computer-implemented method of claim 3, wherein the junk profile comprises a second reference histogram for the first junk variable of a second known junk document, and wherein calculating the junk score comprises comparing the candidate histogram to the second reference histogram to determine a second similarity metric. 5. The computer-implemented method of claim 4, wherein calculating the junk score comprises at least one of: calculating a maximum of the first and second similarity metrics and calculating an average of the first and second similarity metrics. 6. The computer-implemented method of claim 1, further comprising the step of displaying the ranked candidate documents and displaying a junk status for at least the first document. 7. The computer-implemented method of claim 1, wherein the first junk variable comprises chunk size. 8. The computer-implemented method of claim 1, wherein: the junk profile comprises a dictionary of automatically generated data, and wherein creating the index comprises ignoring document data that matches the automatically generated data. 9. The computer-implemented method of claim 1, wherein: the junk profile comprises a dictionary of automatically generated data;calculating the junk score further comprises comparing document data from the plurality of documents in the corpus to the dictionary of automatically generated data; andcreating the index comprises delineating in the index document data that matches the automatically generated data. 10. The computer-implemented method of claim 9, wherein identifying the candidate documents includes comparing the search query to document data in the index, and wherein ranking the candidate documents includes determining whether document data matching the search query has been delineated as matching the automatically generated data. 11. The computer-implemented method of claim 9, wherein calculating the junk score for the first document comprises determining a similarity metric between document data in the first document and the automatically generated data. 12. The computer-implemented of claim 9, further comprising: creating the junk profile, comprising creating the dictionary of automatically generated data by: creating a blank template containing automatically generated data; andextracting the automatically generated data from the blank template. 13. The computer-implemented method of claim 1, wherein the step of calculating includes calculating a junk score for a second document in the corpus and wherein the step of identifying comprises excluding the second document from the candidate documents when the junk score for the second document exceeds a predetermined threshold. 14. The computer-implemented method of claim 1, wherein the step of calculating occurs after the step of identifying and wherein the step of calculating comprises calculating a junk score for a plurality of the candidate documents. 15. The computer-implemented method of claim 1, wherein the corpus is an intranet, the plurality of documents is created using a particular template, and the junk profile is specific to the particular template. 16. The computer-implemented method of claim 1, wherein the search query comprises a query for documents in the corpus that have a junk score that exceeds a predetermined threshold. 17. The computer-implemented method of claim 1, wherein the junk score for the first document exceeds a predetermined threshold, further comprising: sending to an administrator a message identifying the first document as junk. 18. A system for ranking candidate documents in response to a search query, comprising: at least one processor;a memory, operatively connected to the at least one processor and containing instructions that, when executed by the at least one processor, perform a method comprising: creating an index of a plurality of documents in a corpus;calculating a junk score for at least a first document in the corpus, wherein calculating the junk score comprises: using a first candidate histogram for the first document in the corpus, wherein the first candidate histogram is specific to the first document; andusing a junk profile, wherein the junk profile comprises: a first reference histogram for a first known junk document, wherein the first reference histogram is specific to the first known junk document and is based on a first junk variable; andcomparing the first candidate histogram to the first reference histogram;receiving a search query;identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;ranking the candidate documents based at least in part on the junk score for the first document;wherein creating the index comprises separately delineating document data from the plurality of documents if the document data matches the junk profile. 19. The system of claim 18, wherein the method further comprises: creating, for at least the first document, a candidate histogram for at least a first junk variable;wherein calculating the junk score comprises comparing the candidate histogram to the first reference histogram to determine a first similarity metric;wherein the junk profile comprises a dictionary of automatically generated data;wherein calculating the junk score further comprises comparing document data from the plurality of documents in the corpus to the dictionary of automatically generated data; andwherein creating the index comprises delineating in the index document data that matches the automatically generated data. 20. A computer storage medium including computer-executable instructions that, when executed by at least one processor, perform a method comprising: creating an index of a plurality of documents in a corpus;creating, for at least a first document of the plurality of documents, a candidate histogram specific to the first document for at least a first junk variable;calculating a junk score for at least the first document using a junk profile, wherein: the junk profile comprises: a first reference histogram for a first known junk document, wherein the first reference histogram is specific to at least the first known junk document and is based on the first junk variable, anda dictionary of automatically generated data; andcalculating a junk score comprises at least (a) comparing the candidate histogram to the first reference histogram to determine a first similarity metric and (b) determining a second similarity metric between document data in the first document and the dictionary of automatically generated data;receiving a search query;identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;ranking the candidate documents based at least in part on the junk score for the first document.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (199)
Simmonds Christopher D.,GBX ; Jack Ian,GBX ; Marincic Dusan,GBX ; Wilkes Anthony M.,GBX, Accessing network resources using network resource replicator and captured login script for use when the computer is di.
Braden-Harder Lisa ; Corston Simon H. ; Dolan William B. ; Vanderwende Lucy H., Apparatus and methods for an information retrieval system that employs natural language processing of search results to.
Peterson, Leonard J.; Freedman, Steven J.; Partovi, Hadi; Endres, Raymond E.; D'Souza, David J.; Ellerman, Erik Castedo; Jiggins, Julian P., Client-side system for scheduling delivery of web content and locally managing the web content.
Eichstaedt Matthias ; Ford Daniel Alexander ; Lehman Tobin Jon ; Lu Qi ; Teng Shang-Hua, Collaborative team crawling:Large scale information gathering over the internet.
Pant Sangam ; Andre David L. ; Watson Gray ; Green Richard M. ; Schiegg Michael J., Computer system with user-controlled relevance ranking of search results.
Leonardo C. Massarani, Content-indexing search system and method providing search results consistent with content filtering and blocking policies implemented in a blocking engine.
Khoyi Dana (Dracut MA) San Soucie Marc (Tyngsboro MA) Surprenant Carolyn E. (Dracut MA) Stern Laura O. (Woburn MA) Pham Ly-Huong T. (Chelmsford MA), Data integration by object management.
San Soucie Marc (Tyngsboro MA) Surprenant Carolyn E. (Dracut MA) Fitzgerald Thomas (Lowell MA) Walker Susan (Arlington MA), Data processor that customizes program behavior by using a resource retrieval capability.
Davis ; III James R. ; Sanders Daniel S. ; Pathakis Scott W. ; Bradshaw W. Brent ; Jensen Brian L. ; Hodgkinson Andrew A., Hybrid query apparatus and method.
Bowman Dwayne ; Ortega Ruben E. ; Linden Greg ; Spiegel Joel R., Identifying the items most relevant to a current query based on items selected in connection with similar queries.
Kyu-Young Whang KR; Byung-Kwon Park KR; Wook-Shin Han KR; Young-Koo Lee KR, Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems.
Ram Subbaroyan ; Yongdong Wang ; Paul Andre Gauthier ; Douglas Michael Cook ; Douglass Russell Judd, Method and apparatus for identifying spoof documents.
Birrell Andrew D. ; Wobber Edward P. ; Schroeder Michael, Method and apparatus for organizing and accessing electronic mail messages using labels and full text and label indexing.
Pratt, John P.; Johnson, Russell Clark; Millett, Ronald P.; Tietjen, Bruce R., Method and apparatus for organizing and using indexes utilizing a search decision table.
Douglass R. Judd ; Paul Gauthier ; J. Eric Baldeschwieler, Method and apparatus for retrieving documents based on information other than document content.
Mitchell, Frederick H.; Bainbridge, David K., Method and apparatus providing a graphical user interface for representing and navigating hierarchical networks.
Gilmour David L. ; Wang Hua-Wen, Method and system for constructing a knowledge profile of a user having unrestricted and restricted access portions according to respective levels of confidence of content of the portions.
Kobayashi, Mei; Takeda, Kohichi, Method and system for document collection final search result by arithmetical operations between search results sorted by multiple ranking metrics.
Barney, Jonathan A., Method and system for probabilistically quantifying and visualizing relevance between two or more citationally or contextually related data objects.
Raghavan, Prabhakar; Rajagopalan, Sridhar; Ravikumar, Shanmugasundaram; Tomkins, Andrew S., Method and system for trawling the World-wide Web to identify implicitly-defined communities of web pages.
Lewak Jerzy (Del Mar CA) Grzechnik Slawek (La Mesa CA) Matousek Jon (San Diego CA), Method for accessing computer files and data, using linked categories assigned to each data file record on entry of the.
Schultz John Michael, Method for identifying themes associated with a search query using metadata and for organizing documents responsive to the search query in accordance with the themes.
Day, Don Rutledge; Dutta, Rabindranath; Schell, David Allen, Method, system, and program for gathering indexable metadata on content at a data repository.
Fox, Kevin L.; Frieder, Ophir; Knepper, Margaret M.; Killam, Robert A.; Nemethy, Joseph M.; Cusick, Gregory J.; Snowberg, Eric J., Multiple engine information retrieval and visualization system.
Kirsch Steven T. ; Chang William I., Performing automated document collection and selection by providing a meta-index with meta-index values indentifying co.
Sung Chih-Ta (Princeton CA) Chan Tzoyao (Saratoga CA) Chang Richard (San Jose CA) Rosenau Mark A. (San Jose CA) Ort Jeffrey G. (Bellevue WA) Daum Daniel T. (San Jose CA) Sun Yuanyuan (San Jose CA), Programmable audio-video synchronization method and apparatus for multimedia systems.
Bowman Dwayne E. ; Ortega Ruben E. ; Hamrick Michael L. ; Spiegel Joel R. ; Kohn Timothy R., Refining search queries by the suggestion of correlated terms from prior searches.
Lamping John O. ; Dourish James P. ; Edwards Warren K. ; LaMarca Anthony G. ; Petersen Karin ; Salisbury Michael P. ; Terry Douglas B. ; Thornton James D., Self-contained document management based on document properties.
Belfiore Joseph D. ; Ellison-Taylor Ian M. ; Ramasubramanian Sankaranarayanan ; Chew Chee H. ; Berkun Scott E., Storage of sitemaps at server sites for holding information regarding content.
Candan, Kasim Selcuk; Li, Wen-Syan, System and method employing random walks for mining web page associations and usage to optimize user-oriented web page refresh and pre-fetch scheduling.
Chidlovskii Boris,FRX ; Glance Natalie S.,FRX ; Grasso Antonietta,FRX, System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis.
Min, Shermann Loyall; Tanno, Constantin Lorenzo; Mainen, Zachary Frank; Softky, William Russell, System and method for context-based document retrieval.
Meyerzon, Dmitriy; Robertson, Stephen Edward; Zaragoza, Hugo; Taylor, Michael J., System and method for incorporating anchor text into ranking search results.
Kraft, Reiner; Emens, Michael Lawrence; Yim, Peter Chi-Shing, System and method for providing a session query within the context of a dynamic search result set.
Huang, Anita Wai-Ling; Sundaresan, Neelakantan, System and method of ranking and retrieving documents based on authority scores of schemas and documents.
Horvitz, Eric J., System and methods for inferring informational goals and preferred level of detail of results in response to questions posed to an automated information-retrieval or question-answering service.
Monier Louis M., System for adding a new entry to a web page table upon receiving a web page including a link to another web page not having a corresponding entry in the web page table.
Pirolli Peter L. ; Pitkow James E. ; Huberman Bernardo A., System for ranking search results from a collection of documents using spreading activation techniques.
Rose Daniel E. ; Bornstein Jeremy J. ; Tiene Kevin ; Ponceleon Dulce B., System for ranking the relevance of information objects accessed by computer users.
Fagin,Ronald; McCurley,Kevin Snow; Novak,Jasmine; Ravikumar,Shanmugasundram; Sivakumar,Dandapani; Tomlin,John Anthony; Williamson,David Paul, System, method and service for ranking search results using a modular scoring system.
Sisk, Jacob; Bramlet, Heidi Eldenburg; Fain, Daniel C.; Mao, Jianchang; Rieck, Charity A., Term expansion using associative matching of labeled term pairs.
Najork Marc Alexander ; Heydon Clark Allan ; Wiener Janet Lynn, Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.