[논문]NLP기반 NER을 이용해 소셜 네트워크의 조직 구조 탐색을 위한 협력 프레임 워크

프랭크 엘리호데; 양현호; 이재완

doi:10.7472/jksii.2012.13.2.99

NLP기반 NER을 이용해 소셜 네트워크의 조직 구조 탐색을 위한 협력 프레임 워크
A Collaborative Framework for Discovering the Organizational Structure of Social Networks Using NER Based on NLP 원문보기

인터넷정보학회논문지 = Journal of Korean Society for Internet Information, v.13 no.2, 2012년, pp.99 - 108

프랭크 엘리호데 (군산대학교 전자정보공학부) , 양현호 (군산대학교 정보통신공학과) , 이재완 (군산대학교 정보통신공학과)

초록
AI-Helper

방대한 양의 데이터로부터 정보추출의 정확도를 향상시키기 위한 많은 방법이 개발되어 왔다. 본 논문에서는NER(named entity recognition), 문장 추출, 스피치 태깅과 같은 여러 가지의 자연어 처리 작업을 통합하여 텍스트를 분석하였다. 데이터는 도메인에 특화된 데이터 추출 에이전트를 사용하여 웹에서 수집한 텍스트로 구성하였고, 위에서 언급한 자연어 처리 작업을 사용하여 비 구조화된 데이터로부터 정보를 추출하는 프레임 워크를 개발하였다. 조직 구조의 탐색을 위한 택스트 추출 및 분석 관점에서 연구의 성능을 시뮬레이션을 통해 분석하였으며, 시뮬레이션 결과, 정보추출에서 MUC 및 CoNLL과 같은 다른 NER 분석기 보다 성능이 우수함을 보였다.

Abstract ▼ AI-Helper

Many methods had been developed to improve the accuracy of extracting information from a vast amount of data. This paper combined a number of natural language processing methods such as NER (named entity recognition), sentence extraction, and part of speech tagging to carry out text analysis. The data source is comprised of texts obtained from the web using a domain-specific data extraction agent. A framework for the extraction of information from unstructured data was developed using the aforementioned natural language processing methods. We simulated the performance of our work in the extraction and analysis of texts for the detection of organizational structures. Simulation shows that our study outperformed other NER classifiers such as MUC and CoNLL on information extraction.

주제어

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

Looking at our results, combining the capabilities ofthe aforementioned processes would greatly improve the accuracy and reliability of text analysis in knowledge discovery. In this paper, we focused our methods in discovering organizational structures from a given corpus of text.
The information extraction agent is assigned to extract text on a given domain. In this work, we are interested in extracting documents that might contain information leading to the discovery of possible criminal organizations. Pages with interesting contents are being crawled, thus extracting text from it.
The significant contribution of this work is the improvement of the accuracy of information extraction by using natural language processing techniques in processing data. As shown by the results, accuracy of text analysis is remarkably enhanced.

제안 방법

The actual count of entities was setto 41 which will be used as a basis for evaluating the actual performance of the system. For the purpose of evaluation, the text was fed onto NER using three different classifier models included with Stanford NER. The performance is then measured in terms of Precision, Recall, and F-measure as shown by the following equations:
In [13], they proposed a method of social tension detection and intention recognition based on natural language analysis of social networks, forums, blogs and news comments. The approach combines natural language syntax and semantics analysis with statistical processing to identify possible indicators of social tension. A work in [14] adopted network analysis tools to carry out a terrorist social network quantitative analysis.
The proposed architecture is presented in a layered approach as shown in Figure 1. The main components of the architecture are distributed among the three layers, namely the Information Extraction Layer, Data Processing Layer, and Data Presentation Layer.

대상 데이터

The three NER classifiers were tested and their performance was measured. The first model usedis a seven-class model trained for MUC. It has seven classes-Time, Location, Organization, Person, Money, Percent, and Date.
In this case, it was able to retrieve 21 entities and 5 of these were mislabeled. The second model used is a four-class model trained for CoNLL. It has four classes- Location, Person, Organization, and Misc.

이론/모형

Part of Speech Tagging isthen used to mark words with their corresponding value in the Penn Treebank tag set. The PoST process was implemented using Stanford Log-linear Part-of-Speech Tagger [23]. It is a Java-made tagger that reads text and assigns parts of speech to each word and other token.
The visualization of the network structure was implemented using JGraph. It is a powerful, lightweight, feature-rich, and thoroughly documented open-source graph component available for Java [24].

참고문헌 (24)

H. Lauw, E. Lim, T. Tan, and H. Pang: Mining Social Network from Spatio-Temporal Events, Proceedings of SIAM Data Mining Conference (2005)
J.J. Xu and H. Chen:Crimenet Explorer: A Framework For Criminal Network Knowledge Discovery., ACM Transactions on Information Systems, pp. 201-226 (2005)
J. Diesner, and K.M. Carley: Using Network Text Analysis to Detect The Organizational Structure of Covert Networks, Proceedings of the North American Association for Computational Social and Organizational Science (NAACSOS) Conference, Pittsburgh, PA (2004).
Named Entity Recognition, http://en.wikipedia.org/wiki/Named_entity_recognition
L. Zhang, Y. Pan, and T. Zhang:Focused Named Entity Recognition using Machine Learning, SIGIR'04 (2004)
J. Zhu, A. L. Goncalves, and V. Uren: Adaptive Named Entity Recognition for Social Network Analysis and Domain Ontology Maintenance, Tech Report kmi-04-30 (2005)
W. Murnane:Improving Accuracy of Named Entity Recognition on Social Media Data, Thesis, Graduate School, University of Maryland (2010)
K. Knight, and D. Marcu:Summarization beyond sentence extraction: A probabilistic approach to sentence compression, Artificial Intelligence Volume 139, Issue 1, pp. 91-107 (2002) 8

상세보기
Part of Speech Tagging, http://en.wikipedia.org/wiki/Part-of-speech_tagging
D. Rusu, L. Dali, B. Fortuna, M. Grobelnik, and D. Mladenid: Triplet Extraction from Sentences, In Proceedings of the 10th International Multiconference "Information Society--IS 2007". Vol. A, pp. 218-222 (2007)
L. Dali and B. Fortuna: Triplet extraction from sentences using svm. In SiKDD (2008)
Karmakar, and Z. Ying, "Mining collaboration through textual semantic interpretation,"Intelligent Systems (HIS), 2011 11th International Conference onvol., no., pp.728-733, 5-8 Dec. 2011
O. Vybornova, I. Smirnov, I. Sochenkov, A. Kiselyov, I. Tikhomirov, N. Chudova, Y. Kuznetsova, G. Osipov, "Social Tension Detection and Intention Recognition Using Natural Language Semantic Analysis: On the Material of Russian-Speaking Social Networks and Web Forums,"and Security Informatics Conference (EISIC), 2011 Europeanvol., no., pp.277-281, 12-14 Sept. 2011
Sun Duo-Yong; Guo Shu-Quan; Zhang Hai; Li Ben-Xian; , "Study on covert networks of terroristic organizations based on text analysis,"Intelligence and Security Informatics (ISI), 2011 IEEE International Conference onvol., no., pp.373-378, 10-12 July 2011
Automap by CASOS, http://www.casos.cs.cmu.edu/projects/automap/
ORA by CASOS, http://www.casos.cs.cmu.edu/projects/ora/
H. Cunningham: Information Extraction-A User Guide, Research memo CS-97-02 (1997)
D. Nadeau, and S.Sekine: A survey of named entity recognition and classification, Lingvisticae Investigationes, Volume 30,1 , pp. 3-26(24) (2007)

상세보기
D. Nadeau, and S.Sekine: A survey of named entity recognition and classification, Lingvisticae Investigationes, Volume 30,1 , pp. 3-26(24) (2007)

상세보기
Doing Named Entity Recognition? Don't optimize for F1, http://nlpers.blogspot.com/2006/08/doing-namedentity-recognition-dont.html
Aperture Framework, http://aperture.sourceforge.net/
Stanford Named Entity Recognizer, http://nlp.stanford.edu/software/CRF-NER.shtml
Stanford Log-linear Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml
Graph, http://sourceforge.net/projects/jgraph

저자의 다른 논문 :

LOADING...

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증