[논문]거대 인용 자료를 이용한 문서 추천 방법

채민우; 강민수; 김용대

doi:10.7465/jkdi.2013.24.5.999

거대 인용 자료를 이용한 문서 추천 방법
Documents recommendation using large citation data 원문보기

Journal of the Korean Data & Information Science Society = 한국데이터정보과학회지, v.24 no.5, 2013년, pp.999 - 1011

채민우 (서울대학교 통계학과) , 강민수 (광개토연구소) , 김용대 (서울대학교 통계학과)

초록
AI-Helper

본 연구에서는 논문이나 특허 등의 문서들의 인용 정보를 활용하여 연관성이 높고 중요한 특허를 추천하는 방법을 제안한다. 문서 간의 연관성 지표인 공통피인용횟수와 중요도 지표인 HITS를 적절한 형태로 결합한 뉴먼 커널로부터 두 정보의 반영 정도를 조율하는 것이 핵심이다. 제안하는 방법은 미래의 인용에 대한 예측 오차를 최소화하는 것으로 이를 통해 뉴먼 커널의 조율모수 ${\gamma}$를 적절하게 선택할 수 있다. 또한, 거대 인용 자료를 분석하기 위해 필요한 계산 기술에 대해서 자세히 논의한다. 마지막으로, 미국 등록 특허 400만 건에 대한 실증적 자료 분석을 시행한다.

Abstract ▼ AI-Helper

In this research, we propose a document recommendation method which can find documents that are relatively important to a specific document based on citation information. The key idea is parameter tuning in the Neumann kernal which is an intermediate between a measure of importance (HITS) and of relatedness (co-citation). Our method properly selects the tuning parameter ${\gamma}$ in the Neumann kernal minimizing the prediction error in future citation. We also discuss some comutational issues needed for analysing large citation data. Finally, results of analyzing patents data from the US Patent Office are given.

주제어

질의응답

핵심어	질문	논문에서 추출한 답변
	SPARSKIT은 어떤 언어로 작성되었는가?	Golub과 Van Loan (2012)는 행렬 연산에 대한 내용을 다루는 책인데 여기에 거대 성긴 행렬에서의 여러 이론과 알고리즘이 자세히 나와 있다. 현재 구현되어 있는 라이브러리 중 가장 대표적인 것으로는 포트란 언어로 작성된 SPARSKIT (Saad, 1990)이 있다. 본 연구에서 사용한 프로그램은 통계 패키지 R의 Matrix라는 라이브러리이다.
	같은 기술분야 내에서의 인용정보만을 사용한다고 하더라도 본 연구의 목적에 크게 어긋나지 않는다고 할 수 있는 이유는?	이는 추후 과제로 남겨두기로 하고 본 연구에서는 일반 PC에서 쉽게 다룰 수 있을 정도의 용량만을 처리하기로 한다. 그리고 전체 인용의 60% 정도가 동일한 IPC1 기술 분야 내에서 발생하기 때문에 인용 정보를 통해 다른 기술분야에서 추천할 만한 문서는 그리 많지 않다. 따라서, 같은 기술분야 내에서의 인용정보만을 사용한다고 하더라도 본 연구의 목적에 크게 어긋나지 않는다고 할 수 있다.
	LDA 모형의 특징은?	언어, 특히 단어와 단어가 유기적으로 연결되어 있는 하나의 글을 모델링하는 것은 매우 어렵기 때문에, 통계적으로 쉽게 다룰 수 있는 간단한 형태로 문서를 변환한 후 이를 모델링 하는 것이 훨씬 효과적이다. 예를 들어, Blei 등 (2003)이 제안한 LDA (latent Dirichlet allocation) 모형은 문서를 단순히 단어를 모아놓은 집합 (bag of words), 즉 단어의 빈도를 나타내는 벡터로 변환한 후에 다항혼합분포 (mixture of multinomial distributions)를 사용하여 문서를 모델링한다.

참고문헌 (27)

Blei, D. M. and Lafferty, J. D. (2007) A correlated topic model of science. The Annals of Applied Statistics, 1, 17-35.

상세보기
Blei, D. M., NG, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual (web) search engine. Computer Network and ISDN Systems, 30, 107-117.

상세보기
Cook, D. J. and Holder, L. B. (2006). Mining graph data, John Wiley & Sons, New Jersey.
Garfield, E. and Merton, R. K. (1979). Citation indexing: Its theory and application in science, technology, and humanities, Wiley, New York.
Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, Johns Hopkins University Press, Baltimore.
He, Q., Pei, J., Kifer, D., Mitra, P. and Giles, C. L. (2010). Context-aware citation recommendation. Proceedings of the 19th International Conference on World Wide Web, 421-430.
Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM Transactions on Information Systems, 22, 22, 89-115.

상세보기
Jannach, D., Zanker, M., Felfernig, A. and Friedrich, G. (2010). Recommender systems: An introduction, Cambridge University Press, New York.
Kandola, J., Shawe-Taylor, J. and Cristianini, N. (2003). Learning semantic similarity. In Neural Information Processing Systems, 673-680.
Kessler, M. M. (1963). Bibliographic coupling between scientific papers. American Documentation, 14, 10-25.

상세보기
Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 604-632.

상세보기
Lam, C. (2010). Hadoop in action, Manning Publications Company, Stamford.
Lehoucq, R. B., Sorensen, D. C. and Yang, C. (1998). ARPACK users’ guide: Solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods, 6, Siam, Philadelphia.
Li, W. and McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. Proceedings of the 23rd International Conference on Machine Learning, 577-584.
Liben-Nowell, D. and Kleinberg, J. (2007). The link prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58, 1019-1031.

상세보기
McNee, S. M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S. K., Rashid, A. M., Konstan, J. A. and Riedl, J. (2002). On the recommending of citations for research papers. Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work, 116-125.
Page, L. and Brin S. (1999). The PageRank citation ranking: Bringing order to the web, Stanford InfoLab, California.
Shimbo, M. and Ito, T. (2006). Kernels as link analysis measures, John Wiley & Sons, New Jersey, 283-310.
Saad, Y. (1990). SPARSKIT: A basic toolkit for sparse matrix computations, Research Institute for Advanced Computer Science, NASA Ames Research Center Moffet Field, CA.
Sanders, J. and Kandrot, E. (2010). CUDA by example: An introduction to general-purpose GPU programming, Addison-Wesley Professional, Boston.
Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24, 265-269.

상세보기
Strohman, T., Croft, W. and Jensen, D. (2007). Recommending citations for academic papers. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 705-706.
Tang, J. and Zhang, J. (2009). A discriminative approach to topic-based citation recommendation. Advances in Knowledge Discovery and Data Mining, 572-579.
Teh, Y. W., Jordan M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581.

상세보기
Wei, X. and Croft W. B. (2006). LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 178-185.
White, S. and Smyth P. (2003). Algorithms for estimating relative importance in networks. Proceedings of the KDD’03, 266-275.

저자의 다른 논문 :

LOADING...

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

거대 인용 자료를 이용한 문서 추천 방법
Documents recommendation using large citation data 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (27)

이 논문을 인용한 문헌

저자의 다른 논문 :

연구과제 타임라인

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

거대 인용 자료를 이용한 문서 추천 방법 Documents recommendation using large citation data 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

질의응답

참고문헌 (27)

이 논문을 인용한 문헌

저자의 다른 논문 :

김용대 (23)

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

관련 콘텐츠

원문 보기

원문 URL 링크

오픈액세스(OA) 유형

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

거대 인용 자료를 이용한 문서 추천 방법
Documents recommendation using large citation data 원문보기

초록
AI-Helper