최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기데이타베이스 연구 = Database research, v.25 no.1, 2009년, pp.1 - 17
이순행 (한양대학교 전자컴퓨터통신공학과) , 이상철 (한양대학교 전자컴퓨터통신공학과) , 김상욱 (한양대학교 전자컴퓨터통신공학과) , 김학진 (연세대학교 경영대학)
최근 들어 웹 문서가 증가함에 따라 중복문서 검출의 중요성이 점차 커지고 있다. 본 논문에서는 WWW 환경에서 중복문서를 검출하는 기법에 관련된 기존의 연구 현황에 대하여 소개한다. 먼저, 두 개의 문서가 주어졌을 때 중복인지의 여부를 판정하는 기법들을 소개한다. 두 번째로는 대용량의 문서 데이터베이스에서 중복문서들을 효율적으로 검출하는 기법들에 대해 논한다. 마지막으로 향후 연구 방향에 대하여 제시한다.
Recently, as the number of documents in the WWW(World Wide Web) increases, it becomes crucial to treat duplicate documents. In this article, we survey previous research results related to handling duplicate documents in WWW environment. First, we introduce a variety of methods for determining whethe...
K. Bharat and A. Broder, "Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content.," In Proc. Int' l. World Wide Web Conference, WWW,pp. 1579-1590,1999.
S. Brin, J. Davis and H. Garcia-Molina, "Copy Detection Mechanisms for Digital Documents," In Proc. ACM Int' I. Conf. on Management of Data, SIGMOD, pp. 398-409, 1995.
S. Brin and L. Page, "The Anatomy of a Largescale Hypertextual Web Search Engine," Journal of Computer Networks and ISDN Systems, Vol. 30, pp. 107-117, 1998.
A. Broder et aI., "Syntactic Clustering of the Web;' In Proc. Int'l. World Wide Web Conference, WWW,pp. 391-404, 1997.
A. Broder, "On the Resemblance and Containment of Documents," In Proc. Int' l. Conf. on Compression and Complexity of Sequences, SEQUENCES' 97, pp. 21-29, 1998.
A. Broder et ai., "Min-Wise Independent Permutations;' Journal of Computer and System Sciences, Vol. 60, No.3, pp. 630-659,2000.
A. Broder, "Identifying and Filtering Near Duplicate Documents," In Proc. Int'l. Symp. on Combinatorial Pattern Matching, CPM, pp. 1-10, 2000.
M. Charikar, "Similarity Estimation Techniques from Rounding Algorithms,' In Proc. ACM Int' l. Symp. on Theory of Computing, pp. 380-388, 2002.
A. Chowdhury et aI., "Collection Statistics for Fast Duplicate Document Detection;' ACM Trans. on Information System, Vol. 20, No.2, pp. 171-191, 2002.
J. Conrad, X. Guo, and C. Schriber, "Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment;' In Proc. Int' l. Conf. on Information and Knowledge management, CIKM, pp. 443-452, 2003.
J. Conrad and C. Schriber, "Constructing a Text Corpus for Inexact Duplicate Detection;' In Proc. ACM Int' l. Conf. on Information Retrieval, SIGIR, pp. 582-583, 2004.
J. Cooper, A. Coden, and E. Brown, "Detecting Similar Documents Using Salient Terms," In Proc. Int'l. Conf. on Information and Knowledge Management, CIKM, pp. 245-251, 2002.
J. Dean and M. Henzinger, "Finding Related Pages in the World Wide Web," Journal of Computer Networks, Vol. 31, pp. 1467-1479, 1999.
D. Fetterly, M. Manasse, and M. Najork, "On the Evolution of Clusters of Near-Duplicate Web Pages," In Proc. Int' I. Conf. on the 1st Latin American Web Congress, LA-WEB, pp. 37-45, 2003.
T. Haveliwala, A. Gionis, and P. Indyk, "Scalable Techniques for Clustering the Web," In Proc. Int' l. Workshop on the Web and Databases, WebDB, pp. 129-134,2000.
T. Haveliwala et al., "Evaluating Strategies for Similarity Search on the Web," In Proc. Int'l. World Wide Web Conference, WWW, pp. 432-442,2002.
N. Heintze, "Scalable Document Fingerprinting," In Proc. USENIX Electronic Commerce Workshop, pp. 1917200,1996.
M. Henzinger, "Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms;' In Proc. ACM Int' I. Conf. on Information Retrieval, SIGIR, pp. 284-291, 2006.
T. Hoad and J. Zobel, "Methods for Identifying Versioned and Plagiarized Documents," Journal of the American Society for Information Science and Technology, Vol. 54, No.3, pp. 203- 215,2003.
N. Jain, M. Dahlin, and R. Tewari, "Using Bloom Filters to Refine Web Search Results," In Proc. Int'l. Conf. on Web Databases, WebDB, pp. 25-30, 2005.
S. Jonathan and A. Paepcke, "SpotSigs: Near Duplicate Detection in Web Page Collections;' In Proc. ACM Int' l. Conf. on Information Retrieval, SIGIR, 2007.
A. Kolcz, A. Chowdhury, and J. Alspector, "Improved Robustness of Signature-based Near-replica Detection via Lexicon Randomization;' In Proc. ACM Int'l. Conf. on Knowledge Discovery and Data Mining, SIGKDD, pp. 605-610, 2004.
S. Lawrence and L. Giles, "Searching the World Wide Web;' Journal of Science, Vol. 280, No. 5360, pp. 98-100, 1998.
U. Manber, "Finding Similar Files in a Large File System;' In Proc. Int'l. Conf. on USENIX, pp. 1-10, 1994.
G. Manku, A. Jain, and A. Sarma, "Detecting Near-Duplicates for Web Crawling;' In Proc. Int'l. World Wide Web Conference, WWW, pp. 141-149, 2007.
S. Park et al., "Analysis of Lexical Signatures for Finding Lost or Related Documents,' In Proc. ACM Int'l. Conf. on Information Retrieval, SIGIR, pp.11-18, 2002.
A. Pereira Jr. and N. Ziviani, "Syntactic Similarity of Web Documents;' In Proc. Int'l. Conf. on Latin American Web Congress, LAWEB, pp. 194-121,2003.
M. Rabin, Fingerprinting by Random Polynomials, Technical Report TR-CSE-03-01, Harvard University, 1981.
S. Schleimer, D. S. Wilkerson, and A. Aiken, "Winnowing: Local Algorithms for Document Fingerprinting," In Proc. ACM lnt' I. Conf on Management of Data, SIGMOD, pp. 76-85,2003.
N. Shivakumar and H. Garcia-Molina, "SCAM: A Copy Detection Mechanism for Digital Documents," In Proc. Int' I. Conf on Theory and Practice of Digital Libraries, DL, pp. 155-163, 1995.
N. Shivakumar and H. Garcia-Molina, "Finding Near-Replicas of Documents on the Web," In Proc. lnt' l. Conf. on Web Databases, WebDB, pp. 204-212, 1998.
A. Spink et aI., "Searching the Web: the Public and Their Queries," Journal of the American Society for Information Science, Vol. 52, No.3, pp. 226-234, 2001.
H. Yang and J. Callan, "Near-Duplicate Detection for eRulemaking," In Proc. lnt'l. Conf. on Digital Government Research, DGO, pp.15-18,2005.
H. Yang and J. Callan, "Near-Duplicate Detection by Instance-level Constrained Clustering," In Proc. ACM In t 'l. Conf. on Information Retrieval, SIGIR, pp. 421-428, 2006.
S. Ye et aI., "A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines," In Proc. lnt' l. Conf. on Asia-Pacific Web Conference, APWeb, pp. 48-58, 2004.
S. Ye, J. Wen, and W. Ma, "A Systematic Study of Parameter Correlations in Large Scale Duplicate Document Detection," In Proc. lnt' I. Conf. on Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD, pp. 275-284.2006.
J. Zobel and Y. Bernstein, "The Case of the Duplicate Documents Measurement, Search, and Science," In Proc. lnt' l. Conf on Asia-Pacific Web Conference, APWeb, pp. 26-39, 2006.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.