Method and apparatus for measuring similarity among electronic documents
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G06F-015/00
G06F-017/00
G06F-017/21
출원번호
US-0333121
(1999-06-14)
발명자
/ 주소
Palmer,Michael E.
Sun,Gordon G.
Zha,Hongyuan
출원인 / 주소
Yahoo! Inc.
대리인 / 주소
Hickman Palermo Truong &
인용정보
피인용 횟수 :
179인용 특허 :
23
초록▼
A method and apparatus are provided for determining when electronic documents stored in a large collection of documents are similar to one another. A plurality of similarity information is derived from the documents. The similarity information may be based on a variety of factors, including hyperlin
A method and apparatus are provided for determining when electronic documents stored in a large collection of documents are similar to one another. A plurality of similarity information is derived from the documents. The similarity information may be based on a variety of factors, including hyperlinks in the documents, text similarity, user click-through information, similarity in the titles of the documents or their location identifiers, and patterns of user viewing. The similarity information is fed to a combination function that synthesizes the various measures of similarity information into combined similarity information. Using the combined similarity information, an objective function is iteratively maximized in order to yield a generalized similarity value that expresses the similarity of particular pairs of documents. In an embodiment, the generalized similarity value is used to determine the proper category, among a taxonomy of categories in an index, cache or search system, into which certain documents belong.
대표청구항▼
What is claimed is: 1. A computer implemented method of categorizing a plurality of new electronic documents into a set of categories, comprising the steps of: establishing a plurality of training sets, wherein each training set is associated with a category and includes training documents that hav
What is claimed is: 1. A computer implemented method of categorizing a plurality of new electronic documents into a set of categories, comprising the steps of: establishing a plurality of training sets, wherein each training set is associated with a category and includes training documents that have been classified as belonging to said associated category; determining how strongly each document of said plurality of documents corresponds to each of said plurality of categories by determining similarity between said each document and the training documents that belong to the training set of said category; and wherein the step of determining similarity is performed using a matrix representing document similarity that is derived by combining two or more measures of document similarity. 2. A method as recited in claim 1, wherein the measures of document similarity include hyperlink similarity. 3. A method as recited in claim 2, in which two documents among the plurality of documents are considered similar to each other when there is a link from one to the other, or when the two documents link to, or are linked to by, a set of other associated documents. 4. A method as recited in claim 3, in which certain hyperlinks have greater or lesser similarity weight than other hyperlinks, based on other features of the links or their source or destination documents. 5. A method as recited in claim 1, wherein the measures of document similarity include a similarity of text of the documents. 6. A method as recited in claim 5, wherein two documents are considered similar based on a comparison of word vectors derived from the text of each of the two documents. 7. A method as recited in claim 5, wherein text similarity is determined in part based upon weight values assigned to words of the text, and wherein certain words have greater or lesser weight than other words. 8. A method as recited in claim 1, wherein the measures of document similarity include user click-through similarity. 9. A method as recited in claim 8, wherein two documents are considered similar based on user click-through similarity when the documents are associated with similar patterns of user click behavior, selected from among frequency of clicks, click context, duration of viewing, proximity in time to other clicks, or proximity in context to other clicks. 10. A method as recited in claim 1, wherein the measures of document similarity are derived from patterns detected in user viewing of the documents. 11. A method as recited in claim 10, wherein the user viewing information is monitored by a web caching system and stored in a log. 12. A method as recited in claim 10, wherein two documents are considered similar based on patterns of user viewing behavior, including frequency of viewing, viewing context, duration of viewing, proximity in time to other documents viewed by the same user, or similarity of patterns of viewing by all users. 13. A method as recited in claim 1, wherein the measures of document similarity include URL similarity. 14. A method as recited in claim 13, wherein two documents are considered similar if a URL of each document contains similar URL sub-components. 15. A method as recited in claim 1, wherein the measures of document similarity include multimedia similarity. 16. A method as recited in claim 15, wherein two documents are considered similar based on features derived from multimedia components linked to or contained by the documents. 17. A method as recited in claim 1, wherein the combination of two or more measures of document similarity is achieved by taking the union of each of a plurality of graphs, each graph describing one of the measures of document similarity, to compute a combined graph that describes the combined document similarity. 18. A method as recited in claim 1, wherein the combination of two or more measures of document similarity is achieved by taking the intersection of each of a plurality of graphs, each graph describing one of the measures of document similarity, to compute a combined graph that describes the combined document similarity. 19. A method as recited in claim 1, further comprising the step of extracting similarity information from the similarity matrix to obtain new documents supported by the set of training documents for each category. 20. A method as recited in claim 19, wherein the similarity information is obtained by optimizing an objective function. 21. A method as recited in claim 19, wherein the similarity information is obtained by only approximately optimizing an objective function. 22. A method as recited in claim 21, wherein approximately optimizing the objective function comprises repeated application of a growth transformation. 23. A method as recited in claim 19, further comprising the step of creating and storing a second matrix that represents an interim score for each document in each category. 24. A method as recited in claim 19, further comprising the steps of, periodically as the matrix is being computed, normalizing rows of the matrix by normalizing within each document, across all categories, whereby the score for one document in a particular category will depend on the scores for that document in all other categories. 25. A method as recited in claim 19, further comprising the steps of, periodically as the matrix is being computed, normalizing columns of the matrix by normalizing within each category, across all documents, whereby the score for one document in a particular category depends on the scores for all other documents in that category. 26. A method as recited in claim 1, in which the categories come from a manually defined taxonomy. 27. A method as recited in claim 1, wherein the categories are derived from logs of user queries. 28. A method as recited in claim 1, further comprising the steps of creating and storing a second matrix using columns representing documents and rows representing user sessions, and wherein values of elements of the second matrix represent interest in a document shown by a particular user in a particular session. 29. A method as recited in claim 1, further comprising the steps of creating and storing a matrix using columns representing user sessions and rows representing documents, and wherein values of elements of the second matrix represent interest in a document shown by a particular user in a particular session. 30. A method as recited in claim 28, wherein the element values are computed as a function of a time that a user has spent viewing a document associated with each element. 31. A method as recited in claim 28, further comprising the steps of creating and storing a second matrix representing a Similarity between pairs of documents i and j, wherein the second matrix is derived by comparing pairs of column vectors or row vectors, respectively i and j of the first matrix. 32. A method as recited in claim 28, further comprising the steps of creating and storing a second matrix representing a Similarity between pairs of documents i and j, by finding pairs of documents i and j which have high interest values for a particular user in a particular session or period of time. 33. The method recited in claim 1, further comprising the steps of: identifying a category of a classification taxonomy of the hypertext system in which a first electronic document is presently classified; and if a second electronic document is found to be highly Similar, storing information that classifies the second electronic document into the category. 34. A computer-readable recording medium carrying one or more sequences of instructions, wherein execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform the steps of: establishing a plurality of training sets, wherein each training set is associated with a category and includes training documents that have been classified as belonging to said associated category; determining how strongly each document of said plurality of documents corresponds to each of said plurality of categories by determining similarity between said each document and the documents that belong to the training set of said category; and wherein the step of determining similarity is performed using a matrix representing document similarity that is derived by combining two or more measures of document similarity.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (23)
Hekmatpour Amir, Adaptive hypermedia presentation method and system.
Pitkow James E. ; Pirolli Peter L., Method and apparatus for finding related documents in a collection of linked documents using a bibliographic coupling link analysis.
Huberman Bernardo A. ; Pitkow James E. ; Pirolli Peter L., Method and apparatus for predicting document access in a collection of linked documents featuring link proprabilities and spreading activation.
Douglass R. Judd ; Paul Gauthier ; J. Eric Baldeschwieler, Method and apparatus for retrieving documents based on information other than document content.
Bharat Krishna Asur ; Henzinger Monika R., Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis.
Bengio Yoshua,CAX ; Bottou Leon ; LeCun Yann Andre, Module for constructing trainable modular network in which each module inputs and outputs data structured as a graph.
Davies Nicholas John,GBX ; Weeks Richard,GBX, Software agent for comparing locally accessible keywords with meta-information and having pointers associated with dist.
Pirolli Peter L. ; Pitkow James E. ; Rao Ramana B., System for predicting documents relevant to focus documents by spreading activation through network representations of a.
Kucharewski, Valerie; Traylor, Michael; Buonomo, Michael Anthony; Panzer, John; Mazzeo, Jack, Central people lists accessible by multiple applications.
Kimura,Yasuhiro; Koba,Yuichi; Yoshii,Kenichiro; Shono,Atsushi; Sato,Hideaki; Seki,Toshibumi, Data transfer scheme using re-direct response message for reducing network load.
Appelman, Barry; Buonviri, Terry Christian; Buonviri, Joseph Paul; Erickson, Andrew Ivar; Jarmolowski, Thomas; Weltman, Robert Eugene, Dynamic identification of other users to an online user.
Odell, James A.; Bergstrom, Raine; Appelman, Barry; Wick, Andrew L.; Keister, Alan; Yin, Xiaoyan; McNally, Barbara; Hullfish, Keith C., Enhanced buddy list using mobile device identifiers.
Matsubayashi,Tadataka; Sugaya,Natsuko; Iijima,Michio; Ogawa,Yuichi; Watanabe,Yuuki; Yamamoto,Shinya; Sudou,Tsuyoshi, Method and apparatus for calculating similarity among documents.
Gaudet, Teresa Ruth; Gaudet, Gordon James; Pollreis, Gary Kenneth; Natale, Sandra Joyce, Method and process for performing category-based analysis, evaluation, and prescriptive practice creation upon stenographically written and voice-written text files.
Zhang, Benyu; Zeng, Hua Jun; Ma, Wei Ying; Chen, Zheng; Liu, Ning; Yan, Jun, Method and system for determining similarity of items based on similarity objects and their features.
Scian, Anthony F.; Yach, David P.; Zinn, R. Scotte; Klassen, Gerhard D., Method, system and computer software product for pre-selecting a folder for a message.
Gaussier, Eric; Renders, Jean Michel; Dejean, Herve; Goutte, Cyril; Matveeva, Irina, Methods and apparatuses for identifying bilingual lexicons in comparable corpora using geometric processing.
Heikes, Brian D.; Krantz, Kristine Amber; Matthews, Kelly Monroe; Medeiros, Russell Scott; Ramanathan, Venkatesh; Robinson, Richard W.; Roman, Perry E.; Sears, Edward L.; Wick, Andrew L.; Yurow, Deborah R., Methods and systems for capturing and managing instant messages.
Heikes, Brian Dean; Krantz, Kristine Amber; Mathews, Kelly Monroe; Medeiros, Russell Scott; Ramanathan, Venkatesh; Robinson, Jr., Richard W.; Roman, Perry E. Miranda; Sears, Edward L.; Wick, Andrew L.; Yurow, Deborah Ruth, Methods for capturing electronic messages based on capture rules relating to user actions regarding received electronic messages.
Heikes, Brian Dean; Krantz, Kristine Amber; Mathews, Kelly Monroe; Medeiros, Russell Scott; Ramanathan, Venkatesh; Robinson, Jr., Richard W.; Roman, Perry E. Miranda; Sears, Edward L.; Wick, Andrew L.; Yurow, Deborah Ruth, Methods for controlling display of electronic messages captured based on community rankings.
Dom, Byron Edward; Popescul, Alexandrin; Zhang, Tong, System and method for determining web page quality using collective inference based on local and global information.
Hao, Ming C.; Dayal, Umeshwar; Hsu, Meichun; Holenstein, Thomas; Gross, Markus, System and method for visualization of objects using energy minimization of customized potential functions.
Parker, Charles T.; Lyons, Catherine M.; Roston, Gerald P.; Garrity, George M., Systems and methods for automatically identifying and linking names in digital resources.
Rao, Venkatesh Guru; Silverstein, Jesse; Reid, James Walter; Vandervort, David Russell, Trail-based data content discovery, organization, and processing.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.