Abstract
▼
Named Entities (NEs) normally refer to a range of concepts such as people names, location names, organization names, and product names. As large quantities of new named entities (or emerging named entities) appear everyday in newspaper, web sites, and TV programs, NE analysis becomes more and more i...
Named Entities (NEs) normally refer to a range of concepts such as people names, location names, organization names, and product names. As large quantities of new named entities (or emerging named entities) appear everyday in newspaper, web sites, and TV programs, NE analysis becomes more and more important in data mining and information retrieval society. Information on NEs can be extracted from (a) structured sources such as databases and tables, (b) semi-structured sources such as knowledge bases (or, called interchangeably as ontologies), or (c) unstructured sources such as text corpora. Among many research topics related with NE analysis such as ontology integration, named entity linking, and named entity translation, this dissertation addresses the problem of mining NE translations from comparable corpora, specifically, mining English and Chinese NE translation. I observe that existing approaches use one or more of the following NE similarity metrics: entity name similarity, entity context similarity, and entity relationship similarity. Motivated by this observation, this dissertation proposes a new holistic approach, by (1) combining all similarity types used and (2) additionally considering a new similarity measure, relationship context similarity between pairs of NEs, which is a missing quadrant in the taxonomy of similarity metrics. I abstract the NE translation problem as the matching of two NE graphs extracted from the comparable corpora. Specifically, two monolingual NE graphs are first constructed from comparable corpora to extract relationship between NEs. Entity name similarity and entity context similarity are then calculated from every pair of bilingual NEs for computing initial pairwise NE similarity. A reinforcing method is utilized to reflect relationship similarity and relationship context similarity between NEs. I also discover corpus “latent” features lost in the graph extraction process and integrate them into proposed framework, and improve relationship-based similarities by overcoming asymmetry of comparable corpora and considering other types of NEs. According to the experimental results, proposed holistic graph-based approaches and its enhancements are highly effective and proposed framework significantly outperforms previous state-of-the-art approaches.
주제어
#데이터 분석 개체명 번역 그래프 방법론;
이 논문을 인용한 문헌 (0)
- 이 논문을 인용한 문헌 없음