[논문]고차원 자료의 비지도 부분공간 이상치 탐지기법에 대한 요약 연구

안재형; 권성훈

doi:10.5351/kjas.2021.34.3.507

고차원 자료의 비지도 부분공간 이상치 탐지기법에 대한 요약 연구
A survey on unsupervised subspace outlier detection methods for high dimensional data 원문보기

응용통계연구 = The Korean journal of applied statistics, v.34 no.3, 2021년, pp.507 - 521

초록
AI-Helper

고차원 자료에서 이상치를 탐지하기 위해서는 변수를 선별해야 할 필요성이 있다. 이상치 탐지에 적합한 정보가 종종 일부 변수에만 포함되어 있기 때문이다. 많은 수의 부적합한 변수가 자료에 포함될 경우 모든 관측치의 거리가 비슷해지는 집중효과가 발생하고 이로 인해 모든 관측치의 이상정도가 비슷해지는 문제가 발생하게 된다. 부분공간 이상치 탐지기법은 전체 변수 중 이상치 탐지에 적합한 변수들의 집합을 선별하여 관측치의 이상정도를 측정함으로써 이러한 문제를 극복한다. 본 논문은 대표적인 부분공간 이상치 탐지기법을 부분공간 선정 방식에 따라 세가지 유형으로 분류하고 각 유형에 속한 방법론을 부분공간 선정 기준과 이상 정도 측정 방식에 따라 요약한다. 더하여, 부분공간 이상치 탐지기법들을 적용할 수 있는 컴퓨팅 프로그램을 소개하고 집중효과에 대한 간단한 가상 실험과 자료 분석 결과를 제시한다.

Abstract ▼ AI-Helper

Detecting outliers among high-dimensional data encounters a challenging problem of screening the variables since relevant information is often contained in only a few of the variables. Otherwise, when a number of irrelevant variables are included in the data, the distances between all observations tend to become similar which leads to making the degree of outlierness of all observations alike. The subspace outlier detection method overcomes the problem by measuring the degree of outlierness of the observation based on the relevant subsets of the entire variables. In this paper, we survey recent subspace outlier detection techniques, classifying them into three major types according to the subspace selection method. And we summarize the techniques of each type based on how to select the relevant subspaces and how to measure the degree of outlierness. In addition, we introduce some computing tools for implementing the subspace outlier detection techniques and present results from the simulation study and real data analysis.

주제어

표/그림 (6)

그림 Figure 1: Taxonomy of subspace outlier detection methods.
표 Table 1: Summary of subspace outlier detection methods
그림 Figure 2: Two trajectories from the simulations: distance ratio (left) and outlierness measured by LOF with k = 5 (right) for the outlier, minimum, maximum and mean of the other observations.
표 Table 2: Datasets used in the experiments
표 Table 3: F₁-score results
표 Table 4: User deﬁned environments used in the experiments

참고문헌 (30)

Agrawal R and Srikant R (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference Very Large Data Bases, VLDB, 125, 487-499.
Agrawal R, Gehrke J, Gunopulos D, and Raghavan P (1998). Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 94-105.
Barnett V and Lewis T (1984). Outliers in Statistical Data(2nd ed), Chichester, Wiley.
Beckmann N, Kriegel HP, Schneider R, and Seeger B (1990). The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, 322-331.
Bennett KP, Fayyad U, and Geiger D (1999). Density-based indexing for approximate nearest-neighbor queries. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 233-243.
Beyer K, Goldstein J, Ramakrishnan R, and Shaft U (1999). When is "nearest neighbor" meaningful?. In International Conference on Database Theory, Springer, Berlin, 217-235.
Breunig MM, Kriegel HP, Ng RT, and Sander J (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 93-104.
Campos GO, Zimek A, Sander J, et al. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study, Data Mining and Knowledge Discovery, 30, 891-927.

상세보기
Durrant RJ and Kaban A (2009). When is 'nearest neighbour' meaningful: A converse theorem and implications, Journal of Complexity, 25, 385-397.

상세보기
Eskin E, Arnold A, Prerau M, Portnoy L, and Stolfo S (2002). A geometric framework for unsupervised anomaly detection, In Applications of Data Mining in Computer Security, Springer, Boston, 77-101.
Fawcett T and Provost F (1997). Adaptive fraud detection, Data Mining and Knowledge Discovery, 1, 291-316.

상세보기
Hawkins DM (1980). Identification of Outliers, Chapman and Hall, London.
Houle ME, Kriegel HP, Kroger P, Schubert E, and Zimek A. (2010). Can shared-neighbor distances defeat the curse of dimensionality?. In International Conference on Scientific and Statistical Database Management, Springer, Berlin, 482-500.
Keller F, Muller E, and Bohm K (2012). HiCS: High contrast subspaces for density-based outlier ranking. In 2012 IEEE 28th International Conference on Data Engineering, 1037-1048.
Kriegel HP, Kroger P, Schubert E, and Zimek A (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, 831-838.
Lazarevic A and Kumar V (2005). Feature bagging for outlier detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 157-166.
Liu FT, Ting KM, and Zhou ZH (2008). Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, 413-422.
Muller E, Schiffer M, and Seidl T (2011). Statistical selection of relevant subspace projections for outlier ranking. In 2011 IEEE 27th International Conference on Data Engineering, 434-445.
Muller E, Assent I, Iglesias P, Mulle Y, and Bohm K (2012). Outlier ranking via subspace analysis in multiple views of the data. In 2012 IEEE 12th International Conference on Data Mining, 529-538.
Nguyen HV, Muller E, Vreeken J, Keller F, and Bohm K (2013). CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, 198-206.
Parsons L, Haque E, and Liu H (2004). Subspace clustering for high dimensional data: a review, Acm Sigkdd Explorations Newsletter, 6, 90-105.

상세보기
Penny KI and Jolliffe IT (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society: Series D (The Statistician), 50, 295-307.

상세보기
Powers DM (2020). Evaluation: from Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.
Procopiuc CM, Jones M, Agarwal PK, and Murali TM (2002). A Monte Carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, 418-427.
Schubert E and Zimek A (2019). ELKI: A Large Open-Source Library for Data Analysis-ELKI Release 0.7. 5" Heidelberg.
Silverman BW (1986). Density Estimation for Statistics and Data Analysis, 26, CRC press.
Steinbiss V, Tran BH, and Ney H (1994). Improvements in beam search. In Third International Conference on Spoken Language Processing.
Stephens MA (1970). Use of the kolmogorov-smirnov, cramer-von mises and related statistics without extensive tables, Journal of the Royal Statistical Society: Series B (Methodological), 32, 115-122.

상세보기
Tukey JW (1977). Exploratory Data Analysis, 2, 131-160.
Zimek A, Schubert E, and Kriegel HP (2012). A survey on unsupervised outlier detection in high-dimensional numerical data, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5, 363-387.

상세보기

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증