[논문]파워쉘 기반 악성코드에 대한 역난독화 처리와 딥러닝 기반 탐지 방법

정호진; 유효곤; 조규환; 이상근

doi:10.13089/jkiisc.2022.32.3.501

파워쉘 기반 악성코드에 대한 역난독화 처리와 딥러닝 기반 탐지 방법
Deobfuscation Processing and Deep Learning-Based Detection Method for PowerShell-Based Malware 원문보기

情報保護學會論文誌 = Journal of the Korea Institute of Information Security and Cryptology, v.32 no.3, 2022년, pp.501 - 511

정호진 (고려대학교) , 유효곤 (고려대학교) , 조규환 (고려대학교) , 이상근 (고려대학교)

초록
AI-Helper

2021년에는 코로나의 여파로 랜섬웨어를 활용한 공격이 유행했으며 그 수는 매년 급증하고 있다. 그 중 파워쉘은 랜섬웨어에 주요 기술로 사용되고 있어 파워쉘 기반 악성코드 탐지 기법의 필요성은 증가하고 있으나 기존의 탐지기법은 난독화가 적용된 스크립트를 탐지하지 못하거나 역난독화에 시간이 오래 소요되는 한계가 존재한다. 이에 본 논문에서는 간단하고 빠른 역난독화 처리과정, Word2Vec과 CNN(Convolutional Neural Network)으로 구성되어 스크립트의 의미를 학습하고 특징을 추출해 악성 여부를 판단할 수 있는 딥러닝 기반의 분류 모델을 제안한다. 2021 사이버보안 AI/빅데이터 활용 경진대회의 AI 기반 파워쉘 악성 스크립트 탐지 트랙에서 제공된 1400개의 악성코드와 8600개의 정상 스크립트를 이용하여 제안한 모델을 테스트한 결과 기존보다 5.04배 빠른 역난독화 실행시간, 100%의 역난독화 성공률, 0.01의 FPR(False Positve Rate), 0.965의 TPR(True Positive Rate)로 악성코드를 빠르고 효과적으로 탐지함을 보인다.

Abstract ▼ AI-Helper

In 2021, ransomware attacks became popular, and the number is rapidly increasing every year. Since PowerShell is used as the primary ransomware technique, the need for PowerShell-based malware detection is ever increasing. However, the existing detection techniques have limits in that they cannot detect obfuscated scripts or require a long processing time for deobfuscation. This paper proposes a simple and fast deobfuscation method and a deep learning-based classification model that can detect PowerShell-based malware. Our technique is composed of Word2Vec and a convolutional neural network to learn the meaning of a script extracting important features. We tested the proposed model using 1400 malicious codes and 8600 normal scripts provided by the AI-based PowerShell malicious script detection track of the 2021 Cybersecurity AI/Big Data Utilization Contest. Our method achieved 5.04 times faster deobfuscation than the existing methods with a perfect success rate and high detection performance with FPR of 0.01 and TPR of 0.965.

주제어

표/그림 (11)

그림 Fig. 1. Structure of CNN
그림 Fig. 2. Structure of Word2Vec
그림 Fig. 3. Overall Structure of Malware Detection Method
표 Table 1. Python Style Pseudocode of Deobfuscation
그림 Fig. 4. Overall Structure of Classifier
그림 Fig. 5. Comparison of using first 2048 tokens (top) and using all tokens (bottom). Using first 2048 tokens forms better cluster.
표 Table 2. Experiment Environment Specification
표 Table 3. Obfuscated PowerShell Script(top) and Seperated Script Blocks(bottom)
표 Table 4. Success Rate and Execution Time by Deobfuscation Method
그림 Fig. 6. ROC curves of models on test set
표 Table 5. Performance comparison for detection. Standard deviations are less then 0.012 on TPR, 0.0011 on AUC, and 0.0004 for FPR

AI 본문요약
AI-Helper

제안 방법

첫째, 스크립트 블록에 존재하는 ‘iex’ 혹은 ‘invoke-expression’의 인자로 들어가는 부분을 구문분석하고 ‘echo’ 명령어를 통해 간단하고 효율적인 역난독화 처리 방법을 제안하고 해당 역난독화 처리방법의 역난독화 성공률과 속도를 평가한다. 둘째,Word2Vec과 CNN으로 스크립트의 의미를 학습해 악성 여부를 판단할 수 있는 딥러닝 기반의 분류모델을 제시하고 1600개의 악성코드와 8400개의 정상코드로 이루어진 데이터 세트를 잘 구분하는지 TPR, FPR, AUC(Area Under Curve), 스루픗(throughput)을 기준으로 분석한다.
본 논문에서 제안하는 딥러닝 기반 악성코드 탐지방법은 Fig.3.과 같이 크게 3가지 부분인 역난독화, 전처리, 분류기로 구성되어 있다.
이에 본 논문에서는 간단하고 효율적인 역난독화 방법, 스크립트의 의미를 학습해 악성 여부를 판단할 수 있는 딥러닝 기반의 분류 모델, 그리고 데이터 분석을 통해 모델에 설명을 부여하여 기존의 악성 파워쉘 스크립트 탐지 방법론보다 악성코드를 효과적으로 탐지할 수 있는 모델을 제안했다.
많은 난독화 파워쉘 스크립트들은 역난독화 과정을 거친 후, ‘iex’ 또는 ‘invoke-expression’을 통해서 실행하는 형식으로 실행된다. 이에 착안하여, 난독화 파워쉘 스크립트를 스크립트 블록(파워쉘 인터프리터가 스크립트를 실행하는 가장 작은 단위)별로 구문 분석하였고, echo 명령어와 함께 이를 실행하여 역난독화를 처리하였다.
하지만 hooking을 통하는 방법은 파워쉘을 실행하는 과정에서 overhead가 발생해 역난독화에 걸리는 시간을 증가시켰고, 로그를 가져오는 방법은 실제 악성 스크립트를 모두 실행해야 했기에 너무 긴 시간이 걸렸다. 제안하는 방법은 overhead도 발생하지 않으며, 악성 스크립트를 실행하지도 않기에, 빠른 역난독화가 가능하다.
첫째, 스크립트 블록에 존재하는 ‘iex’ 혹은 ‘invoke-expression’의 인자로 들어가는 부분을 구문분석하고 ‘echo’ 명령어를 통해 간단하고 효율적인 역난독화 처리 방법을 제안하고 해당 역난독화 처리방법의 역난독화 성공률과 속도를 평가한다

대상 데이터

실험을 위한 데이터셋 으로는 한국인터넷진흥원에서 주관하고 과학기술정보통신부가 주최하는 2021사이버보안 AI/빅데이터 활용 경진대회의 트랙 A,AI 기반 파워쉘 악성 스크립트 탐지에서 제공된 예선 데이터셋을 사용했다[22]. 해당 데이터셋은 8600개의 정상 스크립트, 1400개의 악성 스크립트로 구성되어 있다.
실험을 위한 데이터셋 으로는 한국인터넷진흥원에서 주관하고 과학기술정보통신부가 주최하는 2021사이버보안 AI/빅데이터 활용 경진대회의 트랙 A,AI 기반 파워쉘 악성 스크립트 탐지에서 제공된 예선 데이터셋을 사용했다[22]. 해당 데이터셋은 8600개의 정상 스크립트, 1400개의 악성 스크립트로 구성되어 있다.

데이터처리

데이터 세트 분포에 크게 치우치지 않고 분류기의 성능을 적은 편향으로 예측하기 위해 학습, 그리고 테스트용 데이터 세트를 분리하였다. 각 데이터 세트는 7:3으로 랜덤하게 분배되었으며, 학습용 데이터 세트에서는 다시 3-fold 교차 검증을 통해 최적의 하이퍼파라미터(hyperparameter)를 계산했다. 설정한 하이퍼파라미터는 Word2Vec의 임베딩 차원,CNN의 필터 수, 입력 시퀀스 길이 등이 있다.
주어진 데이터 10000개의 스크립트에 대해서 기존의 역난독화 방법과 본 논문에서 제안한 역난독화 방법의 성능 차이를 비교한다. 같은 환경에서 스크립트마다 역난독화 처리에 든 시간과 성공률을 기준으로 비교한다.

이론/모형

CNN 계층과 임베딩 층을 합쳐서 훈련하기 전, 각 토큰의 의미를 미리 학습시키기 위해 Word2Vec으로 사전 학습(pretrain) 된[17] 임베딩을 사용한다. 훈련 방식은 CBOW[3] 방식을 사용하며, 훈련과정에서 네거티브 샘플링(negative sampling)[16]방식을 사용한다.
실제 배포 환경에서 FPR의 값이 조금만 높아 도악성이라고 판정한 정상 프로그램들이 많아지기 때문에 탐지 시스템에 치명적이다. 따라서 본 논문에서는 FPR의 값이 1% 이하일 때의 TPR 최대값과 AUC를 평가 지표로 사용한다.
분류기에 대한 학습은 RTX 2080Ti에서 이루어졌으며, 최적화 알고리즘으로는 SGD(StochasticGradient Descent) 계열의 ADAM[15]을 사용했으며, 목적 함수로는 이진 교차 엔트로피 함수를 사용했다. 이진 교차 엔트로피 함수는 식 (2)와 같다.
따라서, 본 논문의 해당 부분에서는 스크립트 길이를 n으로 제안하였을 때 얻는 이점에 대한 정성적인 설명을 제시한다. 역난독화가 해제된 스크립트에 대해 TF-IDF (Term Frequency-Inverse Document Frequency) 알고리즘을 적용해 주어진 스크립트를 임베딩한 뒤, UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) 알고리즘을 적용해 차원을 축소시켜 2차원 상에서 표현했다.
CNN 층은 1차원 컨볼루션 층 – 최댓값 풀링 층으로 구성된다. 컨볼루션 층에는 활성화 함수가 포함되며 본 논문에서는 ReLU[14]를 사용했다.
파워쉘 스크립트를 파싱할 때는 Microsoft 공식문서[18]의 파싱 규칙을 따라 script block을 파싱하였고, 이후 이를 다음과 같이 실행하였다. 이를 수행하게 되면 각 스크립트 블록의 역난독화 스크립트를 확인할 수 있는데, 그 결과가 ‘iex’ 또는‘invoke-expression’의 경우를 제외하면 역난독화 스크립트를 확인할 수 있다.
CNN 계층과 임베딩 층을 합쳐서 훈련하기 전, 각 토큰의 의미를 미리 학습시키기 위해 Word2Vec으로 사전 학습(pretrain) 된[17] 임베딩을 사용한다. 훈련 방식은 CBOW[3] 방식을 사용하며, 훈련과정에서 네거티브 샘플링(negative sampling)[16]방식을 사용한다.

성능/효과

위의 결과를 통해 본 논문에서 제안한 모델이 더 나은 일반화 능력을 지니고 있음을 알 수 있다.
제안한 모델을 주어진 악성과 정상 파워쉘 스크립트로 구성된 데이터 세트에서 훈련 및 검증, 테스트 하였을 때, 테스트 데이터 세트에서 기존의 모델보다 뛰어난 성능을 보이는 데 성공했다.

후속연구

본 논문은 향후 기업과 기관의 파워쉘 기반 악성코드에 대한 대응책에 도움이 될 것이며, 악성코드 분류에 있어서 딥러닝 기법을 활용하는 연구에 있어 지침이 될 것이라 기대한다.

참고문헌 (23)

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Advances in Neural Information Processing Systems, pp. 1106-1114, Dec. 2012
Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, and Yujia Li, Razvan Pascanu, "Relational inductive biases, deep learning, and graph networks," arXiv preprint arXiv:1806.01261v3, Oct. 2018
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv preprint arXiv:1301.3781v3, Sep. 2013
D. Hendler, S. Kels, and A. Rubin, "Detecting malicious powershell commands using deep neural networks," Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 187-197, Jun. 2018
G. Rusak, A. Al-Dujaili, and U.M. O' Reilly, "AST-Based Deep Learning for Detecting Malicious PowerShell," Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 2276-2278, Oct. 2018
D. Hendler, S. Kels, and A. Rubin, "AMSI-Based Detection of Malicious PowerShell Code Using Contextual Embeddings," Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, pp. 679-693, Oct. 2020
Coveware, "Ransomware attackers down shift to 'Mid-Game' hunting in Q3 2021" https://www.coveware.com/blog/2021/10/20/ransomware-attacks-continue-as-pressure-mounts, Apr 01. 2022
KrCERT, "2021 Ransomware Special Report," https://www.krcert.or.kr/filedownload.do?attach_file_seq3278&attach_file_idEpF3278.pdf, Apr 01. 2022
McAfee, "McAfee Labs COVID-19 Threats Report" https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-july-2020.pdf, Apr 01. 2022
Microsoft, "Antimalware scan interface" https://docs.microsoft.com/en-us/windows/desktop/AMSI/antimalware-scan-interface-portal, Apr 01. 2022
C. Liu, B. Xia, M. Yu and Y. Liu, "PSDEM: A Feasible De-Obfuscation Method for Malicious PowerShell Detection," 2018 IEEE Symposium on Computers and Communications, pp. 825-831, Jun. 2018
M. AbdelKhalek, and A. Shosha, "JSDES: An Automated De-Obfuscation System for Malicious JavaScript," Proceedings of the 12th International Conference on Availability, Reliability and Security, pp. 1-13, Aug. 2017
GitHub, "PSDecode" https://github.com/R3MRUM/PSDecode, Apr 01. 2022
V. Nair and G.E. Hinton, "Rectified linear units improve restricted boltzmann machines," Proceedings of the 27th international conference on machine learning, pp. 807-814, Jun. 2010
D.P. Kingma and J.L. Ba, "Adam: a method for stochastic optimization," arXiv preprint arXiv:1412.6980v9, Jan. 2017
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean, "Distributed Representations of Words and Phrases and their Compositionality", Advances in Neural Information Processing Systems, pp. 3111-3119, Dec. 2013
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio, "Why Does Unsupervised Pre-training Help Deep Learning?," Journal of Machine Learning Research, pp. 625-660, Mar. 2010
Microsoft, "about_Parsing," https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_parsing?viewpowershell-7.2, Apr 01. 2022
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, Jun. 2015
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa, "Natural language processing (almost) from scratch," Journal of Machine Learning Research, vol. 12, no. 76, pp. 2493-2537, Jun. 2011
Y. Kim. "Convolutional neural networks for sentence classification," Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746-1751, Oct. 2014
KISA, "Cyber security AI/bigdata challenge 2021" https://aibigdatasec.kr/main, Apr 01. 2022
McAfee, "McAfee Labs Threats Report" https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-apr-2021.pdf, Apr 01. 2022

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증