Machine-learned approach to determining document relevance for search over large electronic collections of documents
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-015/18
G06F-017/00
출원번호
US-0754159
(2004-01-09)
등록번호
US-7287012
(2007-10-23)
발명자
/ 주소
Corston,Simon H.
Chandrasekar,Raman
Chen,Harr
출원인 / 주소
Microsoft Corporation
대리인 / 주소
Amin, Turocy & Calvin, LLP
인용정보
피인용 횟수 :
22인용 특허 :
42
초록▼
The present invention relates to a system and methodology that applies automated learning procedures for determining document relevance and assisting information retrieval activities. A system is provided that facilitates a machine-learned approach to determine document relevance. The system include
The present invention relates to a system and methodology that applies automated learning procedures for determining document relevance and assisting information retrieval activities. A system is provided that facilitates a machine-learned approach to determine document relevance. The system includes a storage component that receives a set of human selected items to be employed as positive test cases of highly relevant documents. A training component trains at least one classifier with the human selected items as positive test cases and one or more other items as negative test cases in order to provide a query-independent model, wherein the other items can be selected by a statistical search, for example. Also, the trained classifier can be employed to aid an individual in identifying and selecting new positive cases or utilized to filter or re-rank results from a statistical-based search.
대표청구항▼
What is claimed is: 1. A computer-implemented system that facilitates a machine-learned approach to determine document relevance, comprising: a storage component that receives a set of human or machine selected items to be employed as positive test cases; and a training component that trains at lea
What is claimed is: 1. A computer-implemented system that facilitates a machine-learned approach to determine document relevance, comprising: a storage component that receives a set of human or machine selected items to be employed as positive test cases; and a training component that trains at least one classifier with the human or machine selected items as positive test cases and one or more other items as negative test cases in order to provide a query-independent model, the trained classifier is employed to filter documents obtained from statistical-based or probabilistic-based searches. 2. The system of claim 1, the negative test cases selected by a statistical search. 3. The system of claim 1, the trained classifier is employed to aid an individual in selecting new positive cases. 4. The system of claim 1, outputs of the filter are ranked such that positive cases are ranked before negative cases. 5. The system of claim 1, the outputs are ranked according to a probability they are a positive case. 6. The system of claim 1, the storage component includes logs of relevant sites of interest for users, documents, or data items. 7. The system of claim 6, the storage component includes information for a centralized store or from divergent sources such as web sites, document collections, encyclopedias, local data sources and remote data sources. 8. The system of claim 1, the classifier is employed to automatically analyze data in the storage component in order to assist one or more tools that can interact with a user interface. 9. The system of claim 8, the tools include at least one of an administrative tool, an editing tool, and a ranking tool. 10. The system of claim 8, the tools are employed in at least one of an online and an offline manner. 11. The system of claim 1, the classifiers are trained according to positive and negative test data in order to determine an item's relevance such as from documents or links that suggest other sites of useful information. 12. The system of claim 11, further comprising a set of manually selected documents or items to train a machine-learned classifier. 13. The system of claim 11, the classifier is applied to new terms to identify best bet or relevant documents. 14. The system of claim 11, further comprising bootstrapping new models over various training iterations to facilitate a growing model of learned expressions that are employed for more accurate information retrieval activities. 15. The system of claim 14, further comprising best bets that are hand-selected by an editor. 16. The system of claim 15, further comprising a component to maximize a likelihood of displaying types of documents or items that users are likely to think are interesting enough to view or retrieve. 17. The system of claim 1, the classifier includes at least one of the following learning techniques: Support Vector Machines (SVM), a Naive Bayes, a Bayes Net, a decision tree, similarity-based, a vector-based, a Hidden Markov Model, or other learning technique. 18. The system of claim 1, further comprising a component to perform post-processing of information to determine a document or site's relevance to a user or administrator. 19. The system of claim 18, the post-processing includes ranking in accordance with predetermined probability thresholds, items having a higher probability of being relevant are presented before items of lower probability. 20. The system of claim 18, further comprising explicit annotations that are added to displayed items to indicate a document or site's relevance or importance. 21. A computer readable medium having computer readable instructions stored thereon for implementing the training component and the storage component of claim 1. 22. A computer-based information retrieval system, comprising: means for determining a training set for data terms; means for automatically classifying the training set; means for determining new items from the classified training set; and means for presenting the new items in accordance with an information retrieval request. 23. The system of claim 22, further comprising means for testing the classified training set. 24. A computer-implemented method to facilitate automated information retrieval, comprising: processing n queries from a data log, n being an integer; identifying relevant candidates from the n queries; and training classifiers to identify other relevant candidates for subsequent search activities. 25. The method of claim 24, farther comprising forwarding an analysis to an editor that determines whether or not a piece of information is desirable to be presented for a given query or topic. 26. The method of claim 24, farther comprising extracting relevant candidates from a list of potential documents or sites and automatically placing the best bets before other statistical rankings. 27. The method of claim 24, further comprising re-ranking results by a probability that a document is relevant, respective documents are downloaded, and terms are extracted and looked-up for terms appearing in the document. 28. The method of claim 24, farther comprising determining at least one category to be classified. 29. The method of claim 28, further comprising employing a subset of a training data set to test the classified categories. 30. A computer readable medium having a data structure stored thereon, comprising: a first data field related to a training data set for a relevance category; a second data field that relates to a new set of data items pertaining to the relevance category; and a third data field that relates to a probability ranking for the new set of data items.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (42)
Braden-Harder Lisa ; Corston Simon H. ; Dolan William B. ; Vanderwende Lucy H., Apparatus and methods for an information retrieval system that employs natural language processing of search results to.
Lee Shih-Jong J. ; Wilhelm Paul S. ; Bannister Wendy R. ; Kuan Chih-Chau L. ; Oh Seho ; Meyer Michael G., Apparatus for the identification of free-lying cells.
Lee Shih-Jong J. ; Wilhelm Paul S. ; Bannister Wendy R. ; Kuan Chih-Chau L. ; Oh Seho ; Meyer Michael G., Apparatus for the identification of free-lying cells.
Lee Shih-Jong J. ; Wilhelm Paul S. ; Bannister Wendy R. ; Kuan Chih-Chau L. ; Oh Seho ; Meyer Michael G., Apparatus for the identification of free-lying cells.
Bolle,Rudolf M.; Haas,Norman; Oles,Frank J.; Zhang,Tong, Business method and apparatus for employing induced multimedia classifiers based on unified representation of features reflecting disparate modalities.
Amado Carlos Armando (444 Brickell Avenue #51-111 Miami FL 33131-2400), Method and apparatus for applying if-then-else rules to data sets in a relational data base and generating from the resu.
Bolle, Rudolf M.; Haas, Norman; Oles, Frank J.; Zhang, Tong, Method and apparatus for inducing classifiers for multimedia based on unified representation of features reflecting disparate modalities.
Errico James H. ; Labun Nicholas M. ; Loda John J. ; Murdock Michael C. ; Wang Shay-Ping T., Method and system using meta-classes and polynomial discriminant functions for handwriting recognition.
Hong,Se June; Hosking,Jonathan R.; Natarajan,Ramesh, Method for ensemble predictive modeling by multiplicative adjustment of class probability: APM (adjusted probability model).
Barry G. Becker ; Ron Kohavi ; Daniel A. Sommerfield ; Joel D. Tesler, Method system and computer program product for visualizing an evidence classifier.
Barry Glenn Becker ; Roger A. Crawfis, Method, system and computer program product for visually approximating scattered data using color to represent values of a categorical variable.
Becker Barry G. ; Kohavi Ron ; Sommerfield Daniel A. ; Tesler Joel D., Method, system, and computer program product for visualizing an evidence classifier.
Chandrasekar, Raman; Finger, II, James Charles; Salas, Sally K.; Watson, Eric Benjamin, System and method for performing a search and a browse on a query.
Chandrasekar,Raman; Finger, II,James C.; Watson,Eric B., System and method for query refinement to enable improved searching based on identifying and utilizing popular concepts related to users' queries.
Corston, Simon H.; Dolan, William B.; Vanderwende, Lucy H.; Braden-Harder, Lisa, System for processing textual inputs using natural language processing techniques.
Reis James J. (La Palma CA) Luk Anthony L. (Rancho Palos Verdes CA) Lucero Antonio B. (Anaheim CA) Garber David D. (Cypress CA), Target acquisition and tracking system.
Horvitz Eric ; Heckerman David E. ; Dumais Susan T. ; Sahami Mehran ; Platt John C., Technique which utilizes a probabilistic classifier to detect "junk" e-mail by automatically updating a training and re-training the classifier based on the updated training set.
Rao, Arjun Kumar; Kumar, Karthik; Dhakshinamoorthy, Nagadhilipan, System and method for generating a report in real-time from a resource management system.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.