[논문]WWW 환경에서 중복문서의 검출 기법에 대한 고찰

이순행; 이상철; 김상욱; 김학진

WWW 환경에서 중복문서의 검출 기법에 대한 고찰
A Survey on Detecting Duplicate Documents in World Wide Web Environment

데이타베이스 연구 = Database research, v.25 no.1, 2009년, pp.1 - 17

이순행 (한양대학교 전자컴퓨터통신공학과) , 이상철 (한양대학교 전자컴퓨터통신공학과) , 김상욱 (한양대학교 전자컴퓨터통신공학과) , 김학진 (연세대학교 경영대학)

초록
AI-Helper

최근 들어 웹 문서가 증가함에 따라 중복문서 검출의 중요성이 점차 커지고 있다. 본 논문에서는 WWW 환경에서 중복문서를 검출하는 기법에 관련된 기존의 연구 현황에 대하여 소개한다. 먼저, 두 개의 문서가 주어졌을 때 중복인지의 여부를 판정하는 기법들을 소개한다. 두 번째로는 대용량의 문서 데이터베이스에서 중복문서들을 효율적으로 검출하는 기법들에 대해 논한다. 마지막으로 향후 연구 방향에 대하여 제시한다.

Abstract ▼ AI-Helper

Recently, as the number of documents in the WWW(World Wide Web) increases, it becomes crucial to treat duplicate documents. In this article, we survey previous research results related to handling duplicate documents in WWW environment. First, we introduce a variety of methods for determining whether given two documents are duplicated. Second, we address methods for detecting duplicate documents efficiently from a large document database. Finally, we suggest further research directions.

주제어

참고문헌 (37)

K. Bharat and A. Broder, "Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content.," In Proc. Int' l. World Wide Web Conference, WWW,pp. 1579-1590,1999.
S. Brin, J. Davis and H. Garcia-Molina, "Copy Detection Mechanisms for Digital Documents," In Proc. ACM Int' I. Conf. on Management of Data, SIGMOD, pp. 398-409, 1995.
S. Brin and L. Page, "The Anatomy of a Largescale Hypertextual Web Search Engine," Journal of Computer Networks and ISDN Systems, Vol. 30, pp. 107-117, 1998.

상세보기
A. Broder et aI., "Syntactic Clustering of the Web;' In Proc. Int'l. World Wide Web Conference, WWW,pp. 391-404, 1997.
A. Broder, "On the Resemblance and Containment of Documents," In Proc. Int' l. Conf. on Compression and Complexity of Sequences, SEQUENCES' 97, pp. 21-29, 1998.
A. Broder et ai., "Min-Wise Independent Permutations;' Journal of Computer and System Sciences, Vol. 60, No.3, pp. 630-659,2000.

상세보기
A. Broder, "Identifying and Filtering Near Duplicate Documents," In Proc. Int'l. Symp. on Combinatorial Pattern Matching, CPM, pp. 1-10, 2000.
M. Charikar, "Similarity Estimation Techniques from Rounding Algorithms,' In Proc. ACM Int' l. Symp. on Theory of Computing, pp. 380-388, 2002.
A. Chowdhury et aI., "Collection Statistics for Fast Duplicate Document Detection;' ACM Trans. on Information System, Vol. 20, No.2, pp. 171-191, 2002.

상세보기
J. Conrad, X. Guo, and C. Schriber, "Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment;' In Proc. Int' l. Conf. on Information and Knowledge management, CIKM, pp. 443-452, 2003.
J. Conrad and C. Schriber, "Constructing a Text Corpus for Inexact Duplicate Detection;' In Proc. ACM Int' l. Conf. on Information Retrieval, SIGIR, pp. 582-583, 2004.
J. Cooper, A. Coden, and E. Brown, "Detecting Similar Documents Using Salient Terms," In Proc. Int'l. Conf. on Information and Knowledge Management, CIKM, pp. 245-251, 2002.
J. Dean and M. Henzinger, "Finding Related Pages in the World Wide Web," Journal of Computer Networks, Vol. 31, pp. 1467-1479, 1999.

상세보기
D. Fetterly, M. Manasse, and M. Najork, "On the Evolution of Clusters of Near-Duplicate Web Pages," In Proc. Int' I. Conf. on the 1st Latin American Web Congress, LA-WEB, pp. 37-45, 2003.
T. Haveliwala, A. Gionis, and P. Indyk, "Scalable Techniques for Clustering the Web," In Proc. Int' l. Workshop on the Web and Databases, WebDB, pp. 129-134,2000.
T. Haveliwala et al., "Evaluating Strategies for Similarity Search on the Web," In Proc. Int'l. World Wide Web Conference, WWW, pp. 432-442,2002.
N. Heintze, "Scalable Document Fingerprinting," In Proc. USENIX Electronic Commerce Workshop, pp. 1917200,1996.
M. Henzinger, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms;' In Proc. ACM Int' I. Conf. on Information Retrieval, SIGIR, pp. 284-291, 2006.
T. Hoad and J. Zobel, "Methods for Identifying Versioned and Plagiarized Documents," Journal of the American Society for Information Science and Technology, Vol. 54, No.3, pp. 203- 215,2003.

상세보기
N. Jain, M. Dahlin, and R. Tewari, "Using Bloom Filters to Refine Web Search Results," In Proc. Int'l. Conf. on Web Databases, WebDB, pp. 25-30, 2005.
S. Jonathan and A. Paepcke, "SpotSigs: Near Duplicate Detection in Web Page Collections;' In Proc. ACM Int' l. Conf. on Information Retrieval, SIGIR, 2007.
A. Kolcz, A. Chowdhury, and J. Alspector, "Improved Robustness of Signature-based Near-replica Detection via Lexicon Randomization;' In Proc. ACM Int'l. Conf. on Knowledge Discovery and Data Mining, SIGKDD, pp. 605-610, 2004.
S. Lawrence and L. Giles, "Searching the World Wide Web;' Journal of Science, Vol. 280, No. 5360, pp. 98-100, 1998.

상세보기
U. Manber, "Finding Similar Files in a Large File System;' In Proc. Int'l. Conf. on USENIX, pp. 1-10, 1994.
G. Manku, A. Jain, and A. Sarma, "Detecting Near-Duplicates for Web Crawling;' In Proc. Int'l. World Wide Web Conference, WWW, pp. 141-149, 2007.
S. Park et al., "Analysis of Lexical Signatures for Finding Lost or Related Documents,' In Proc. ACM Int'l. Conf. on Information Retrieval, SIGIR, pp.11-18, 2002.
A. Pereira Jr. and N. Ziviani, "Syntactic Similarity of Web Documents;' In Proc. Int'l. Conf. on Latin American Web Congress, LAWEB, pp. 194-121,2003.
M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR-CSE-03-01, Harvard University, 1981.
S. Schleimer, D. S. Wilkerson, and A. Aiken, "Winnowing: Local Algorithms for Document Fingerprinting," In Proc. ACM lnt' I. Conf on Management of Data, SIGMOD, pp. 76-85,2003.
N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for Digital Documents," In Proc. Int' I. Conf on Theory and Practice of Digital Libraries, DL, pp. 155-163, 1995.
N. Shivakumar and H. Garcia-Molina, "Finding Near-Replicas of Documents on the Web," In Proc. lnt' l. Conf. on Web Databases, WebDB, pp. 204-212, 1998.
A. Spink et aI., "Searching the Web: the Public and Their Queries," Journal of the American Society for Information Science, Vol. 52, No.3, pp. 226-234, 2001.

상세보기
H. Yang and J. Callan, "Near-Duplicate Detection for eRulemaking," In Proc. lnt'l. Conf. on Digital Government Research, DGO, pp.15-18,2005.
H. Yang and J. Callan, "Near-Duplicate Detection by Instance-level Constrained Clustering," In Proc. ACM In t 'l. Conf. on Information Retrieval, SIGIR, pp. 421-428, 2006.
S. Ye et aI., "A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines," In Proc. lnt' l. Conf. on Asia-Pacific Web Conference, APWeb, pp. 48-58, 2004.
S. Ye, J. Wen, and W. Ma, "A Systematic Study of Parameter Correlations in Large Scale Duplicate Document Detection," In Proc. lnt' I. Conf. on Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, pp. 275-284.2006.
J. Zobel and Y. Bernstein, "The Case of the Duplicate Documents Measurement, Search, and Science," In Proc. lnt' l. Conf on Asia-Pacific Web Conference, APWeb, pp. 26-39, 2006.

저자의 다른 논문 :

LOADING...

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

WWW 환경에서 중복문서의 검출 기법에 대한 고찰
A Survey on Detecting Duplicate Documents in World Wide Web Environment

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (37)

이 논문을 인용한 문헌

저자의 다른 논문 :

연구과제 타임라인

관련 콘텐츠

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

WWW 환경에서 중복문서의 검출 기법에 대한 고찰 A Survey on Detecting Duplicate Documents in World Wide Web Environment

초록 AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (37)

이 논문을 인용한 문헌

저자의 다른 논문 :

이순행 (2) 이상철 (13) 김상욱 (107) 김학진 (12)

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

관련 콘텐츠

연관된 기능

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

WWW 환경에서 중복문서의 검출 기법에 대한 고찰
A Survey on Detecting Duplicate Documents in World Wide Web Environment

초록
AI-Helper