[논문]텍스트 마이닝을 위한 그래프 기반 텍스트 표현 모델의 연구 동향

장재영

doi:10.7236/jiibc.2013.13.5.37

텍스트 마이닝을 위한 그래프 기반 텍스트 표현 모델의 연구 동향
A Study on Research Trends of Graph-Based Text Representations for Text Mining 원문보기

한국인터넷방송통신학회 논문지 = The journal of the Institute of Internet Broadcasting and Communication, v.13 no.5, 2013년, pp.37 - 47

초록
AI-Helper

텍스트 마이닝은 비정형화된 텍스트를 분석하여 그 안에 내재된 패턴, 추세, 분포 등의 고급정보들을 추출하는 분야이다. 텍스트 마이닝은 기본적으로 비정형 데이터를 가정하므로 텍스트를 단순화된 모델로 표현하는 것이 필요하다. 현재까지 가장 많이 사용되고 있는 모델은 텍스트를 단순한 단어들의 집합으로 표현한 벡터공간 모델이다. 그러나 최근 들어 단어들의 의미적 관계까지 표현하기 위해 그래프를 이용한 텍스트 표현 모델을 많이 사용하고 있다. 본 논문에서는 텍스트 마이닝을 위한 기존의 연구 중에서 그래프에 기반한 텍스트 표현 모델의 방법들과 그들의 특징들을 기술한다. 또한 그래프 기반 텍스트 마이닝의 향후 발전방향에 대해서도 논한다.

Abstract ▼ AI-Helper

Text Mining is a research area of retrieving high quality hidden information such as patterns, trends, or distributions through analyzing unformatted text. Basically, since text mining assumes an unstructured text, it needs to be represented as a simple text model for analyzing it. So far, most frequently used model is VSM(Vector Space Model), in which a text is represented as a bag of words. However, recently much researches tried to apply a graph-based text model for representing semantic relationships between words. In this paper, we survey research trends of graph-based text representation models for text mining. Additionally, we also discuss about future models of graph-based text mining.

주제어

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

우선 기본 모델인 벡터공간 모델을 살펴보고, 지금까지 제안된 그래프 기반 텍스트 모델들의 종류를 특성에 따라 분류한다. 또한 그래프 기반 텍스트 마이닝에서 제안되었던 대표적인 알고리즘이나 표현 모델에 대해 살펴보고, 국내의 관련연구 동향에 대해서도 간략히 정리한다. 마지막으로 이들의 장단점을 바탕으로 향후의 그래프 기반 텍스트 모델의 발전 방향 및 전망에 대해 논한다.
또한 그래프 기반 텍스트 마이닝에서 제안되었던 대표적인 알고리즘이나 표현 모델에 대해 살펴보고, 국내의 관련연구 동향에 대해서도 간략히 정리한다. 마지막으로 이들의 장단점을 바탕으로 향후의 그래프 기반 텍스트 모델의 발전 방향 및 전망에 대해 논한다.
본 논문에서는 기존에 제안되었던 그래프 기반 텍스트 표현 모델의 방법과 종류들을 제시하였다. 우선 그래프 표현 모델에 있어서 노드와 간선의 종류를 그 특성에 따라 나누었다.
본 논문에서는 다양한 연구에서 제안된 그래프 기반 텍스트 마이닝의 연구 동향을 분석한다. 우선 기본 모델인 벡터공간 모델을 살펴보고, 지금까지 제안된 그래프 기반 텍스트 모델들의 종류를 특성에 따라 분류한다.
그래프 기반 텍스트 표현 모델은 기본적으로 그래프 마이닝에서 제안된 다양한 분석 기술들을 이용할 수 있다는 장점이 있다. 본 논문에서는 지금까지의 연구에서 적용된 그래프 마이닝 분석 기술들을 정리한다.
이외에도 본 논문에서는 그래프 기반 텍스트 표현 모델에서 서브그래프를 탐색하기 위한 여러 가지 기법들을 소개하였다. 이러한 기법들은 대부분 그래프 마이닝 연구에서 제안되었던 것들이다.

가설 설정

. 다만 차이점은 TextRank는 기본적으로 undirected 그래프를 가정하며, 유사도를 계산하기 위해 간선에 가중치가 부여된 그래프까지 고려하였다. 예를 들어 간선에 가중치가 부여된 그래프의 경우 노드의 가중치는 다음과 같은 수식으로 계산된다.

제안 방법

그러나 대부분의 연구에서는 weighted를 가정하므로 이러한 구분은 큰 의미는 없다고 하겠다. 간선에 대해서는 directed 또는 undirected, weighted 또는 unweighted, labeled 또는 unlabeled로 구분하였다. 또한 그래프의 내용에 따라 각각 공기 또는 유사성 표현모델, 문법적 연관성 표현 모델, 마지막으로 의미적 연관성 표현 모델로 구분하였다.
우선 그래프 표현 모델에 있어서 노드와 간선의 종류를 그 특성에 따라 나누었다. 노드에 대해서는 노드를 표현하는 객체의 다양성에 따라 동종 표현과 이종 표현으로 구분하였고, 노드에 가중치를 부여 여부에 따라 weighted와 unweighted로 나누었다. 그러나 대부분의 연구에서는 weighted를 가정하므로 이러한 구분은 큰 의미는 없다고 하겠다.
간선에 대해서는 directed 또는 undirected, weighted 또는 unweighted, labeled 또는 unlabeled로 구분하였다. 또한 그래프의 내용에 따라 각각 공기 또는 유사성 표현모델, 문법적 연관성 표현 모델, 마지막으로 의미적 연관성 표현 모델로 구분하였다.
이러한 구조에 대해서 V와 E의 변화에 따라 다양한 형태의 그래프 정의가 가능하다. 본 논문에서는 이들을 노드의 표현 방식과 간선의 표현 방식에 따라 세부적으로 분류한다.
본 논문에서는 기존에 제안되었던 그래프 기반 텍스트 표현 모델의 방법과 종류들을 제시하였다. 우선 그래프 표현 모델에 있어서 노드와 간선의 종류를 그 특성에 따라 나누었다. 노드에 대해서는 노드를 표현하는 객체의 다양성에 따라 동종 표현과 이종 표현으로 구분하였고, 노드에 가중치를 부여 여부에 따라 weighted와 unweighted로 나누었다.
본 논문에서는 다양한 연구에서 제안된 그래프 기반 텍스트 마이닝의 연구 동향을 분석한다. 우선 기본 모델인 벡터공간 모델을 살펴보고, 지금까지 제안된 그래프 기반 텍스트 모델들의 종류를 특성에 따라 분류한다. 또한 그래프 기반 텍스트 마이닝에서 제안되었던 대표적인 알고리즘이나 표현 모델에 대해 살펴보고, 국내의 관련연구 동향에 대해서도 간략히 정리한다.

후속연구

예를 들어 문서분류를 위해 제안된 그래프 모델은 군집화나, 요약, 검색을 위한 방법에 응용되기 어려운 점이 있다. 따라서 향후 연구에서는 문서 표현을 위한 체계화된 그래프 모델의 개발이 요구된다. 이러한 개발이 이루어진다면 이를 기반으로 하여 문서분류, 군집화, 요약, 검색 등 기존의 다양한 문서 분석기술에 응용할 수 있을 것으로 기대된다.
따라서 향후 연구에서는 문서 표현을 위한 체계화된 그래프 모델의 개발이 요구된다. 이러한 개발이 이루어진다면 이를 기반으로 하여 문서분류, 군집화, 요약, 검색 등 기존의 다양한 문서 분석기술에 응용할 수 있을 것으로 기대된다.

질의응답

핵심어	질문	논문에서 추출한 답변
	그래프에 기반을 둔 텍스트 표현 모델의 장점은 무엇인가?	이러한 문제를 해결하기 위해 2,000년대 이후 그래프 기반 텍스트 마이닝에 대한 연구가 활발히 진행되고 있다. 그래프에 기반을 둔 텍스트 표현 모델에서는 텍스 트에 존재하는 단어(term 또는 word), 문장(sentence), 단락(paragraph), 개념(concept) 등의 공기 (co-occurrence) 또는 기타 관계(relation) 정보를 활용 하여 문서의 특징을 보다 정밀하게 표현할 수 있는 장점이 있다. 따라서 문서에 대한 표현력(expressive power)이 증가하여 텍스트 분석의 정확도를 높일 수있다.
	그래프에 기반을 둔 텍스트 표현 모델의 단점은 무엇인가?	따라서 문서에 대한 표현력(expressive power)이 증가하여 텍스트 분석의 정확도를 높일 수있다. 하지만 반대로 벡터공간 모델에 비해 계산량이 많아지고 많은 자원이 소모되는 단점을 안고 있다. 이러한 문제점들은 최근의 비약적인 하드웨어의 발전으로 인해 점점 해소되고 있는 실정이다.
	텍스트 마이닝이란 무엇인가?	텍스트 마이닝(text mining)은 비정형(unstructured) 문서를 대상으로 한 데이터 마이닝(data mining)의 한 분야로서 문서분류(document classification), 군집화 (clustering), 인덱싱(indexing), 검색(retrieval), 요약 (summarization) 등 문서에 숨겨진 고급 지식들을 탐색 하는 분야이다. 특히 최근 들어 빅 데이터(big data) 시대 도래에 따라 대용량 텍스트 데이터 분석기술에 대한 관심이 증대하고 있어, 이 분야의 핵심 기술로서 텍스트 마이닝의 중요성이 더욱 강조되고 있다.

참고문헌 (41)

G. Salton, A. Wong, and C. S. Yang , "A Vector Space Model for Automatic Indexing," Communications of the ACM, Vol. 18, Vo. 11, pp. 613-620, 1975.

상세보기
G. Salton and M. J. Mcgill, Introduction to Moderm Information Retrieval, McGraw-Hill, New York, 1983.
J. Wu, Z. Xuan, and D. Pan, "Enhancing Text Representation for Classification Tasks with Semantic Graph Structures", International Journal if Innovative Computing, Information Control, Vol. 7, No. 5(B), pp. 2689-2698, 2011.
W. Wang, D. B. Do, and X. Lin, "Term Graph Model for Text Classification", Proceedings of the First international conference on Advanced Data Mining and Applications, pp. 19-30, 2005.
K. Valle and P. Ozturk, "Graph-Based Representation for Text Classification", India-Norway Workshop on Web Concepts and Technologies, 2011.
C. Jiang F. Coenen, R. Sanderson, and M. Zito, "Text Classification Using Graph Mining-Based Feature Extraction", Knowledge-Based Systems, Vol. 23, No. 4, pp. 302-308, 2009.
A. Schenker, M. Last, H. Bunke, and A. Kandel, "Classification of Web Documents Using a Graph Model", 2003. Proceedings. Seventh International Conference on Document Analysis and Recognition, pp. 240-244, 2003.
R. Chau, A. C. Tsoi, M. Hagenbuchner, and V. C.S. Lee, "A Concept Graph for Text Structure Mining", Proceedings of the Thirty-Second Australasian Conference on Computer Science, Vol 91, pp. 141-150, 2009.
K. M. Hammouda and M S. Kamel, "Document Similarity Using a Phrase Indexing Graph Model", Knowledge and Information Systems, Vol. 6, No. 6, pp. 710-727, 2006.
M. S. Hossain, R. A. Angryk, "GDClust: A Graph-Based Document Clustering Technique", Proceedings of Seventh IEEE International Conference on Data Mining Workshops, pp. 417-422, 2007.
I. Yoo, X. Hu, and I.-Y. Song, "Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering", Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 791-796, 2006.
M. Litvak and M. Last, "Graph-Based Keyword Extraction for Single-Document Summarization", Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17-24, 2008.
J. Leskovec, M. Grobelnik, and N. Milic-Fraying, "Learning Semantic Graph Mapping for Document Summarization", Proceedings of the ECML/PKDD-2004 Workshop on Knowledge Discovery and Ontologies. 2005.
G. Erkan and D. R. Radev, "LexRank: Graph-Based Lexical Centrality as Salience in Text Summarization", Journal of Artificial Intelligence Research, Vol. 22, No. 1, pp. 457-479, 2004.
S. Hariharan and R. Srinivasan, "Studies on Graph based Approaches for Single and Multi Document Summarizations", International Journal of Computer Theory and Engineering, Vol. 1, No. 5, pp. 1793-8201, 2009.
C. A. Chahine, N. Chaignaud, JHP Kotowicz, and JP Pecuchet, "Context and Keyword Extraction in Plain Text Using a Graph Representation", Proceedings of the 2008 IEEE International Conference on Signal Image Technology and Internet Based Systems, pp. 692-696, 2008.
R. Mihalcea and P. Tarau, "TextRank: Bringing Order into Texts", Proceedings of International Conference on Empirical Methods in Natural Language Processing, 2004.
S. T. Dumais, "Latent Semantic Analysis", Annual Review of Information Science and Technology, Vol. 38, No. 1, pp. 188-230, 2004

상세보기
S. Hensman, "Construction of Conceptual Graph Representation of Texts", Proceedings of the Student Research Workshop at HLT-NAACL, pp. 49-54, 2004.
M. Gamon, "Graph-Based Text Representation for Novelty Detection", Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing, pp. 17-24, 2006.
B. Li, L. Zhou, S. Feng, and K.-F. Wong "A Unified Graph Model for Sentence-Based Opinion Retrieval" Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1367-1375, 2010.
J. Tomita, H. Nakawatase, and M. Ishii, "Graph-Based Text Database for Knowledge Discovery", Proceedings of the 13th international World Wide Web conference, pp. 454-455, 2004.
F. Zhou, F. Zhang, and B. Yang, "Graph-Based Text Representation Model and its Realization", Proceedings of International Conference on Natural Lan guage Processing and Knowledge Engineering, pp. 1-8, 2010.
Y. Wu, Q. Zhang X. Huang, and L Wu, "Structural Opinion Mining for Graph-based Sentiment Representation", Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1332-1341, 2011.
X. Wan and J. Yang, "Improved Affinity Grapg Based Multi-Document Summarization", Proceedings of the Human Language Technology Conference of the NAACL, pp. 181-184, 2006.
R. Mihalcea, "Graph-Based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization", Proceedings of 3rd International Conference on Emerging Trends in Engineering and Technology(ICETET), pp. 516-519, 2010.
R. Mihalcea and P. Tarau, "A Language Independent Algorithm for Single and Multiple Document Summarization", Proceedings of International Joint Conference on Natural Language Processing, 2005.
L. Zhang, C. Li, J. Liu, and H. Wang, "Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge", World Academy of Science, Engineering and Technology, Issue 59, pp. 1548-1553, 2011.
S. Brin and L. Page, "The Anatomy of a Large-scale Hypertextual Web Search Engine", Proceedings of the seventh International Conference on World Wide Web 7, pp. 107-117, 1998.
J. M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment", Journal of ACM, Vol. 45, No. 5, pp. 605-632, 1999.
C. Jiang, F. Coenen, and M. Zito, "A Survey of Frequent Subgraph Mining Algorithm", The Knowledge Engineering Review, Vol. 28, Issue 1, pp. 75-105, 2012.
G. Jeh and J. Widom, "SimRank: A Measure of Structural-Context Similarity", Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 538-543, 2002
W.-S. Bae and J.-W Cha, "Text Categorization Using TextRank Algorithm", Journal of KIISE, Vol. 16, No. 1, pp. 110-114, 2010.
J. H. Lyu and S. C. Park, "Document Summarization Method Using Complete Graph", Journal of Korea Society of Industrial Information Systems, Vol. 10, No. 2, pp. 26-31, 2005.
H. K. Bae, H. Park, S. Lee, and K. Kim, "Improved Concept-based Search System Using HITS Algorithm on Conceptual Graph", Proceedings of KIISE conference, pp. 470-472, 2003.
S. Cho and K. Lee, "Query Expansion Based on Word Graphs Using Pseudo Non-Relevant Documents and Term Proximity", Journal of KIPS, Vol 19B, No. 3, pp. 189-194, 2012.

원문보기 상세보기
W. M. Song, Y. Kim, E.-J. Kim, and M. Kim, "A Document Summarization System Using Dynamic Connection Graph", Journal of KIISE, Vol. 36, No. 1, pp. 62-69, 2009.
http://en.wikipedia.org/wiki/Vector_space_mode
M. Hwang, D. Choi, and P. Kim "A Context Information Extraction Method according to Subject for Semantic Text Processing", Journal of Korean Institute of Information Technology, vol. 8, No. 11, pp. 197-204, 2010.
J. Shim, H. C. Lee, "The Development of Automatic Ontology Generation System Using Extended Search Keywords" Journal of the Korea Academia-Industrial cooperation Society, Vol. 11, no. 6, 2009.
J. Chang, "Efficient Retrieval of Short Opinion Documents Using Learning to Rank", Journal of the Institute of Internet, Broadcasting and Communication, Vol. 13, No. 4, Aug., 2013.

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증