IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
US-0910169
(2001-07-20)
|
발명자
/ 주소 |
- Decary,Michel
- Stern,Jonathan
- Karadimitriou,Kosmas
- Rothman Shore,Jeremy W.
|
출원인 / 주소 |
|
대리인 / 주소 |
Hamilton, Brook, Smith &
|
인용정보 |
피인용 횟수 :
82 인용 특허 :
44 |
초록
▼
Computer method and apparatus for extracting information from a Web page is disclosed. The invention apparatus is formed of an extractor coupled to receive Web pages from a source. The extractor uses natural language processing to extract desired information from the Web page. A storage subsystem re
Computer method and apparatus for extracting information from a Web page is disclosed. The invention apparatus is formed of an extractor coupled to receive Web pages from a source. The extractor uses natural language processing to extract desired information from the Web page. A storage subsystem receives from the extractor the extracted desired information and stores the extracted desired information in a database. The invention method for extracting data from a Web page includes the computer implemented steps of (i) using natural language processing, finding possible formal names on a given Web page, (ii) using pattern matching, searching the given Web page for formal names not found by the natural language processing, and (iii) refining a combined set of the found formal names to produce a working set of people and organization names extracted from the given Web page. The refining includes determining aliases of respective people and organization names, so as to effectively reduce duplicate names.
대표청구항
▼
What is claimed is: 1. A method for extracting data from a Web page comprising the computer-implemented steps of: using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names; searching the given Web page for f
What is claimed is: 1. A method for extracting data from a Web page comprising the computer-implemented steps of: using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names; searching the given Web page for formal names not found by the natural language processing step of finding, said searching using pattern matching techniciues and producing a second set of formal names; and refining a combined set of formal names formed of the first found set and the second set, said refining producing a working set of people and organization names extracted from the given Web page. 2. A method as claimed in claim 1 wherein the step of refining includes rejecting predefined formal names as not being people names of interest. 3. A method as claimed in claim 1 wherein the step of refining includes determining aliases of respective people and organization names in the combined set, so as to reduce effective duplicate names. 4. A method as claimed in claim 1 wherein the step of finding further finds professional titles and determines organization for which a person named on the given Web page holds that title. 5. A method as claimed in claim 4 wherein the step of finding includes employing rules to extract at least title and formal names. 6. A method as claimed in claim 1 wherein the step of finding further includes determining educational background of a person named on the given Web page, the educational background including at least one of name of institution, degree earned from the institution and date of graduation from the institution. 7. A method as claimed in claim 1 wherein the step of finding further includes determining biographical information relating to a person named on the given Web page. 8. A method as claimed in claim 7 wherein the step of determining biographical information includes determining current and previous employment history of the named person. 9. A method as claimed in claim 1 further comprising the steps of: determining type of the given Web page; and from the determined type, defining contents of different portions of the Web page, such that the steps of finding and searching are performed as a function of the defined contents. 10. A method as claimed in claim 9 wherein the step of determining type of the given Web page includes determining structure or arrangement of contents of the Web page. 11. A method as claimed in claim 10 further comprising the step of using the determined type, deducing additional information regarding a named person or organization on the given Web page, the additional information supplementing information found on another Web page of a same Web site as the given Web page. 12. A method as claimed in claim 1 wherein the step of finding further includes determining at least one of addresses, telephone number, and email address relating to a person or organization named on the given Web page. 13. A method for extracting information from a Web page document comprising the computer implemented steps of: performing a lexical analysis on a given Web page document to identify elements of interest, the elements of interest producing formal names; detecting a regular recurrence of a certain type of element throughout the given web pane document, the detecting producing additional formal names; resolving aliases of the produced formal names and additional formal names to form a working set of names of people and/or organizations named in the given Web page document. 14. A method as claimed in claim 13, further comprising the step of transforming the given Web page document into a standardized form, the step of transforming including identifying page structure of the Web page document. 15. A method as claimed in claim 13, further comprising the step of assigning a type to each line in the given Web page document, the step of assigning a type indicating purpose of each line in the given Web page document. 16. A method as claimed in claim 15 wherein the step of performing a lexical analysis further identifies elements of interest on lines of certain assigned types. 17. A method as claimed in claim 15 wherein the step of detecting includes using pattern matching, detecting a regular recurrence of a certain type of line, to produce additional formal names. 18. A method as claimed in claim 13 wherein the step of performing a lexical analysis includes syntactically and grammatically identifying elements of interest. 19. A method as claimed in claim 18 wherein the step of identifying elements of interest identifies noun phrases that correspond to a person or organization named in the given Web page document. 20. A method as claimed in claim 18 wherein the step of performing a lexical analysis includes using natural language processing. 21. A method as claimed in claim 18 wherein the step of performing a lexical analysis includes utilizing rules describing composition of a name. 22. A method as claimed in claim 13 wherein the step of resolving aliases includes employing rules for determining variant versions of a person's name or an organization's name. 23. A method as claimed in claim 13 wherein the step of aliasing includes rejecting names containing predefined forms of common known phrases. 24. A method as claimed in claim 13 further comprising the steps of: grouping subsets of lines together to form respective text units; and extracting from the formed text units desired information relating to the people or organizations named in the given Web page document wherein the step of grouping identifies boundaries where information about a person or organization is to be found. 25. A method as claimed in claim 24 wherein the step of grouping recognizes elements of information that span across more than one line. 26. A method as claimed in claim 24 wherein the step of extracting includes: determining type of Web page document; and from the determined type, defining contents of different portions of the Web page document such that extraction is performed as a function of the defined contents. 27. A method as claimed in claim 26 wherein the step of determining type of Web page document includes determining structure and organization of contents of the document. 28. A method as claimed in claim 26 wherein the step of extracting includes determining whether the given Web page document is a press release, and if so, identifying organization mentioned in the press release. 29. A method as claimed in claim 24 wherein the step of extracting includes using a parser to recognize the relationship between elements of information. 30. A method as claimed in claim 29 wherein the step of extracting further includes utilizing predefined semantic frames for determining (i) sentences that express a relationship between a person and organization named in the given Web page document and (ii) sentences that express that a person has a certain level of education. 31. A method as claimed in claim 24 wherein the step of extracting includes associating a person or organization with an element of information if said element appears in a non-sentence within a formed text unit for that person or organization. 32. A method as claimed in claim 24 wherein the step of extracting further divides a line that contains multiple names. 33. A method as claimed in claim 24 wherein the step of extracting is rules based. 34. A method as claimed in claim 13 further comprising the step of post-processing to extract further names of organizations and relationships to people named in the given Web page document. 35. A method as claimed in claim 34 wherein the step of post-processing includes: extracting organization names from professional titles held by a named person; associating a named person with an organization whose Web site is hosting the given Web page document; and deducing organization names from biographical text of a named person. 36. Computer apparatus for extracting information from a Web page comprising: a source of Web pages of interest; an extractor coupled to receive Web pages from the source, the extractor being computer implemented and using natural language processing to extract desired information from the Web pages; and a storage subsystem coupled to the extractor for storing the extracted desired information in a data store; wherein the extractor extracts desired information from a given Web pane by: using natural language processing, finding possible formal names on a given Web page, the step of finding producing a first found set of formal names: using pattern matching, searching the given Web page for formal names not found by the natural language processing sten of finding, said searching producing a second set of formal names; and refining a combined set of formal names formed of the first found set and the second set, said refining producing a working set of people and organization names extracted from the given Web page. 37. Computer apparatus as claimed in claim 36 wherein the extractor further determines aliases of respective people and organization names in the combined set so as to reduce effectively duplicate names. 38. Computer apparatus as claimed in claim 36 wherein the extractor further finds professional titles and determines organization for which a person named on the given Web page holds that title. 39. Computer apparatus as claimed in claim 36 wherein the extractor further determines educational background of a person including at least one of name of institution, degree earned from the institution and date of graduation from the institution. 40. Computer apparatus as claimed in claim 36 wherein the extractor further determines employment history of a person named on the given Web page. 41. Computer apparatus as claimed in claim 36 wherein the extractor is rules based. 42. Computer apparatus as claimed in claim 36 wherein the extractor further determines type of the given Web page, and from the determined type defines contents of different portions of the Web page, such that extraction of desired information is performed as a function of the defined contents. 43. Computer apparatus as claimed in claim 42 wherein the extractor further using the determined type, deduces additional information regarding a named person on the given Web page, the additional information supplementing information found on another Web page of the same Web site as the given Web page. 44. Computer apparatus as claimed in claim 36 wherein the extracted desired information includes names of people or organizations named on the given Web page, addresses, telephone numbers and email addresses relating to the named person or organization. 45. Computer apparatus as claimed in claim 36 wherein the storage subsystem is formed of a loader responsive to the extracted desired information, the loader post-processing the extracted desired information to refine the extracted desired information for storage in the data store.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.