[논문]베이지안 확률 및 폐쇄 순차패턴 마이닝 방식을 이용한 설명가능한 로그 이상탐지 시스템

윤지영; 신건윤; 김동욱; 김상수; 한명묵

doi:10.7472/jksii.2021.22.2.77

[국내논문] 베이지안 확률 및 폐쇄 순차패턴 마이닝 방식을 이용한 설명가능한 로그 이상탐지 시스템
An Interpretable Log Anomaly System Using Bayesian Probability and Closed Sequence Pattern Mining 원문보기

Journal of Internet Computing and Services = 인터넷정보학회논문지, v.22 no.2, 2021년, pp.77 - 87

윤지영 (Department of Software, Gachon University) , 신건윤 (Department of Computer Engineering, Gachon University) , 김동욱 (Department of Computer Engineering, Gachon University) , 김상수 (Agency for Defense Development) , 한명묵 (Department of Software, Gachon University)

초록
AI-Helper

인터넷과 개인용 컴퓨터가 발달하면서 다양하고 복잡한 공격들이 등장하기 시작했다. 공격들이 복잡해짐에 따라 기존에 사용하던 시그니처 기반의 탐지 방식으로 탐지가 어려워졌으며 이를 해결하기 위해 행위기반의 탐지를 위한 로그 이상탐지에 대한 연구가 주목 받기 시작했다. 최근 로그 이상탐지에 대한 연구는 딥러닝을 활용해 순서를 학습하는 방식으로 이루어지고 있으며 좋은 성능을 보여준다. 하지만 좋은 성능에도 불구하고 판단에 대한 근거를 제공하지 못한다는 한계점을 지닌다. 판단에 대한 근거 및 설명을 제공하지 못할 경우, 데이터가 오염되거나 모델 자체에 결함이 발생해도 이를 발견하기 어렵다는 문제점을 지닌다. 결론적으로 사용자의 신뢰성을 잃게 된다. 이를 해결하기 위해 본 연구에서는 설명가능한 로그 이상탐지 시스템을 제안한다. 본 연구는 가장 먼저 로그 파싱을 진행해 로그 전처리를 수행한다. 이후 전처리된 로그들을 이용해 베이지안 확률 기반 순차 규칙추출을 진행한다. 결과적으로 "If 조건 then 결과, 사후확률(θ)" 형식의 규칙집합을 추출하며 이와 매칭될 경우 정상, 매칭되지 않을 경우, 이상행위로 판단하게 된다. 실험으로는 HDFS 로그 데이터셋을 활용했으며, 그 결과 F1score 92.7%의 성능을 나타내었다.

Abstract ▼ AI-Helper

With the development of the Internet and personal computers, various and complex attacks begin to emerge. As the attacks become more complex, signature-based detection become difficult. It leads to the research on behavior-based log anomaly detection. Recent work utilizes deep learning to learn the order and it shows good performance. Despite its good performance, it does not provide any explanation for prediction. The lack of explanation can occur difficulty of finding contamination of data or the vulnerability of the model itself. As a result, the users lose their reliability of the model. To address this problem, this work proposes an explainable log anomaly detection system. In this study, log parsing is the first to proceed. Afterward, sequential rules are extracted by Bayesian posterior probability. As a result, the "If condition then results, post-probability" type rule set is extracted. If the sample is matched to the ruleset, it is normal, otherwise, it is an anomaly. We utilize HDFS datasets for the experiment, resulting in F1score 92.7% in test dataset.

주제어

표/그림 (13)

그림 (그림 1) 제안모델 프레임워크 (Figure 1) Framework of the proposed model
그림 (그림 2) Drain 파싱 결과 (Figure 2) Result of parsing using Drain
그림 (그림 3) 추출된 규칙 예시 (Figure 3) Example of extracted rule
그림 (그림 4) 추출된 규칙 일부 (Figure 4) Part of extracted rules
그림 (그림 5) 제안모델 confusion matrix (Figure 5) Proposed model's confusion matrix
표 (표 1) 사용된 파라미터 (Table 1) Used parameter
표 (표 2) 제안모델 성능평가 (Table 2) Performance in proposed model
그림 (그림 6) 선행 연구 성능 비교 (Figure 6) Performance comparison between the proposed model and other models presented in the previous studies
그림 (그림 7) 샘플 "21,4,10,8,10, 8,10,8,25,25" 설명 (Figure 7) Explanation of sample "21,4,10,8,10,8,10,8,25,25"
그림 (그림 8) 샘플 "8,25,10,8,25,1,1,3,3,2"" 설명 (Figure 8) Explanation of sample "8,25,10,8,25,1,1,3,3,2"
그림 (그림 9) 샘플 "25,25,10,8,2,3,2,17,24,4" 설명 (Figure 9) Explanation of sample "25,25,10,8,2,3,2,17,24,4"
그림 (그림 10) 샘플 "10,8,10,8,10,8,25,25,25,22" 설명 (Figure 10) Explanation of sample "10,8,10,8,10,8,25,25,25,22"
그림 (그림 11) 샘플 "4,4,8,10,8,10,8,25,25,25" 설명 (Figure 11) Explanation of sample "4,4,8,10,8,10,8,25,25,25"

AI 본문요약
AI-Helper

문제 정의

본 연구에서는 해당 신뢰성 문제를 해결하기 위해 베이지안 확률 기반의 폐쇄 순차패턴 마이닝을 이용한 정확하면서도 설명가능한 로그 이상탐지 시스템을 제안한다. 제안 모델은 결과적으로 "If 조건 then 결과, 사후 확률(θ)” 형태의 규칙들을 추출한다.
제안 모델은 확률이 포함된 규칙집합을 생성해 설명 성을 제공하면서도, 정확한 규칙을 추출해 좋은 성능을 유지하는 것을 목표로 한다. 규칙과 매칭될 경우 정상으로 매칭되지 않을 경우 이상행위로 판단하게 된다.

데이터처리

그림 5는 제안 연구의 confusion matrix를 나타내며, 표 2는 해당 모델의 성능을 나타낸다. 성능은 정확도, F1score, TPR(True Postitive Rate), FPR(False Positive Rate) 을 기준으로 평가된다. 그림 5에서 0은 정상, 1은 이상행위를 의미하며, 숫자는 각 타입에 들어가는 테스트 샘플수를 의미한다.

성능/효과

해당 연구에선 총 다섯 가지의 중요 파라미터를 사용했으며 이는 n-length sequence의 n 값, 지지도 임계값, 신뢰도 임계값, 추출 패턴 길이, 베이지안 확률 임계값이다. 결과적으로 다섯 가지의 적절한 파라미터를 선택함으로써, 판별력있고 중복되지 않은 규칙들을 생성할 수 있다. n-length sequence의 n은 학습되는 시퀀스의 길이를 나타내며, 지지도와 신뢰도 임계값은 지지도와 신뢰도 자체에서 활용되는 임계값을 나타낸다.
추출 패턴 길이는 추출되는 규칙들 자체의 길이로 해당 규칙의 길이가 길 경우, 일반화된 규칙을 찾기가 어려우며, 짧을 경우 이상행위 탐지가 어렵다. 마지막으로 베이지안 확률 임계값은 규칙 자체의 임계 값으로 해당 값이 너무 높을 경우, 오탐율이 높아지며, 낮을 경우, 미탐율이 높아진다. 이에 대한 정리는 표 1과 같다.
제안 연구는 딥러닝과 비교에서 성능이 낮았지만 가장 높은 재현율을 기록했다. 해당 연구는 많은 리소스가 필요 없다는 점, 설명가능하다는 점에서 큰 장점을 지닌다.

참고문헌 (22)

M. Du, F. Li, G. Zheng, and V. Srikumar, "Deeplog: Anomaly detection and diagnosis from system logs through deep learning." Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1285-1298, 2017. https://doi.org/10.1145/3133956.3134015
W. Meng, Y. Liu, Y. Zhu, S. Zhang, and D. Pei et al. "LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs." IJCAI. Vol. 7. pp. 4739-4745, 2019. http://doi.org/10.24963/ijcai.2019/658
R. Yang, D. Qu, Y. Gao, Y. Qian, and Y. Tang, "NLSALog: An anomaly detection framework for log sequence in security management." IEEE Access, Vol. 7., pp. 181152-181164, 2019. http://doi.org/10.1109/ACCESS.2019.2953981

상세보기
X. Zhang, Y. Xu, Q. Lin, B. Qiao, and H. Zhang et al., "Robust log-based anomaly detection on unstable log data." Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. pp. 807-817, 2019. https://doi.org/10.1145/3338906.3338931
D. Gunning, and D. Aha, "DARPA's explainable artificial intelligence (XAI) program." AI Magazine Vol.40, No.2, pp. 44-58, 2019. https://doi.org/10.1609/aimag.v40i2.2850

상세보기
M. Hind, "Explaining explainable AI." XRDS: Crossroads, The ACM Magazine for Students, Vol.25 No.3 pp. 16-19, 2019. https://doi.org/10.1145/3313096

상세보기
P. He, J. Zhu, Z. Zheng, and M.R. Lyu, "Drain: An online log parsing approach with fixed depth tree." 2017 IEEE International Conference on Web Services (ICWS). IEEE, pp. 33-40, 2017. https://doi.org/10.1109/ICWS.2017.13
M. Du and F. Li, "Spell: Streaming parsing of system event logs." 2016 IEEE 16th International Conference on Data Mining (ICDM) IEEE, pp. 859-864, 2016. https://doi.org/10.1109/ICDM.2016.0103
W. Xu, L. Huang, A. Fox, D. Patterson, and M.I. Jordan, "Detecting large-scale system problems by mining console logs." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp.117-132, 2009. https://doi.org/10.1145/1629575.1629587
M. Landauer, M. Wurzenberger, F. Skopik, G. Settanni, and P. Filzmoser, "Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection." computers & security Vol. 79, pp. 94-116, 2018. https://doi.org/10.1016/j.cose.2018.08.009

상세보기
Q. Lin, H. Zhang, J.G. Lou, Y. Zhang, and X. Chen, "Log clustering based problem identification for online service systems." 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). IEEE, pp. 102-111, 2016. http://dx.doi.org/10.1145/2889160.2889232
C. C. Aggarwal, M. A. Bhuiyan, and M. A. l. Hasan "Frequent pattern mining algorithms: A survey." Frequent pattern mining. Springer, Cham, pp.19-64, 2014. https://doi.org/10.1007/978-3-319-07821-2_2
X. Yan, J. Han, and R. Afshar, "Clospan: Mining: Closed sequential patterns in large datasets." Proceedings of the 2003 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, pp. 166-177, 2003. https://doi.org/10.1137/1.9781611972733.15
J. Wang, and J. Han, "BIDE: Efficient mining of frequent closed sequences." Proceedings. 20th international conference on data engineering. IEEE, pp. 79-90, 2004. https://doi.org/10.1109/ICDE.2004.1319986
B. Letham, C. Rudin, T. H. McCormick, and D. Madigan, "Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model." Annals of Applied Statistics 9.3 Vol. 9, No. 3, pp. 1350-1371, 2015. https://doi.org/10.1214/15-AOAS848

상세보기
W. M. Bolstad, and J. M. Curran Curran. Introduction to Bayesian statistics. John Wiley & Sons, 2016. https://doi.org/10.1002/9781118593165
C. Forbes, M. Evans, N. Hastings, and B. Peacock, Statistical distributions. John Wiley & Sons, 2011. https://doi.org/10.1002/9780470627242
S. Brooks, A. Gelman, G. Jones, and X. L. Meng, eds. Handbook of markov chain monte carlo. CRC press, 2011. https://doi.org/10.1201/b10905
W. Xu, L. Huang, A. Fo, D. Patterson and M. I. Jordan, "Detecting large-scale system problems by mining console logs." Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. pp. 117-132, 2009. https://doi.org/10.1145/1629575.1629587
J. G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, "Mining Invariants from Console Logs for System Problem Detection." USENIX Annual Technical Conference, pp. 1-14, 2010. https://doi.org/10.5555/1855840.1855864
M. Pelikan, D. E. Goldberg, and E. Cantu-Paz, "BOA: The Bayesian optimization algorithm." Proceedings of the genetic and evolutionary computation conference GECCO-99. Vol. 1, pp. 525-532, 1999. https://dl.acm.org/doi/10.5555/2933923.2933973
K. F. Man, K. S. Tang, and S. Kwong, "Genetic algorithms: concepts and applications [in engineering design]." IEEE transactions on Industrial Electronics Vol. 43, No. 5 pp. 519-534, 1996. https://doi.org/10.1109/41.538609

상세보기

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증