[논문]네트워크 데이터 셋의 클래스 불균형 해결을 위한 데이터 생성 및 분류 프레임워크

이우호

네트워크 데이터 셋의 클래스 불균형 해결을 위한 데이터 생성 및 분류 프레임워크
Data generation and classification framework for resolving class imbalances in network data sets 원문보기

이우호 (전남대학교 대학원 정보보안협동과정 국내박사)

초록 ▼
AI-Helper

최근 정부와 기업에서의 인공지능을 활용한 연구는 다양하게 진행되고 있다. 인공지능을 응용한 보안 연구는 악성코드 탐지, 침입탐지 등으로 다양하다. 보안 분야 중 인공지능을 활용한 네트워크 침입탐지는 전통적인 기계학습 알고리즘 뿐만 아니라 딥러닝을 이용한 침입탐지에 대한 연구가 진행되고 있다. 딥러닝을 이용한 네트워크 트래픽 분류는 포트기반, 심층 분석 기반 등으로 나눌 수 있다. 딥러닝 모델을 이용한 분류는 기존의 텍스트, 이미지 기반의 인공신경망(Neural Network)을 이용하는 연구를 넘어 생성적 적대 신경망(Generative Adversarial Network, GAN)을 이용한 연구가 진행되고 있다. 현재 국내외 다양한 기관들은 트래픽 분류의 성능을 향상시키기 위해 다양한 방면에 대한 연구가 진행되고 있다. 딥러닝 모델의 기술 고도화에도 불구하고 인공지능 기반의 보안기술은 높은 오탐률과 데이터 확보, 클래스 불균형, 알고리즘 구현, 대용량 정보의 축소, 실시간 탐지의 어려움에 대한 한계점을 가진다. 본 연구에서는 클래스 불균형을 해결하기 위한 방법과 네트워크 트래픽의 특징을 기반으로 한 분류 모델에 대해 제안한다.

딥러닝을 이용한 네트워크 트래픽 분류 모델 연구에 가장 중요한 학습 데이터셋은 다양하게 공유되고 있다. 대부분의 공개된 데이터셋의 경우 합성 표 구조(Synthetic Table Data)의 데이터를 제공한다. 표 데이터(Table Data)는 데이터 과학 플랫폼인 캉글(Data science platform kaggle)에서 실시한 조사에 따르면 딥러닝 기반 보안 연구에서 가장 많이 접하는 데이터 유형이며, 학계에서 두 번째로 가장 일반적인 형식이기도하다. 특히 공유된 대표적인 데이터셋 대부분이 합성 표 구조를 가지고 있으며 연구자는 합성 표 데이터셋을 제공하고 있다. 합성 표 데이터 셋의 경우 원본 데이터를 연구자가 필요한 특성을 추출하여 필요한 정보를 정리한 데이터 셋을 의미한다. 표 데이터란 구조화 질의어(Structured Query Language, SQL), 표 기반 쉼표로 구분 된 파일(Comma-Separated Values, CSV)을 의미한다. 네트워크 트래픽 데이터의 경우 탐지 목적에 따라 특성공학을 기반으로 특징에 맞는 특성(Feature)을 추출하고 데이터 가공 및 분석을 통해 데이터 유형에 적합한 데이터 전처리 방법을 적용하고 합성 표 데이터셋을 생성한다.

첫 번째, 딥러닝 기반 공격 트래픽 분류 연구의 한계점 중 공유된 표 데이터셋을 대상으로 하여 클래스 불균형을 위한 해결방안을 제안한다. 공유된 합성 데이터셋은 실제 네트워크 환경에서 발생하는 데이터에 비해 공격 데이터 수가 부족하고 각각의 공격 특징에 따라 특성이 의존적이다. 네트워크 환경에서 발생되는 많은 정상트래픽에 비해 비정상트래픽이 매우 적어 데이터 셋의 클래스 불균형이 심해저 잘못 학습되어지는 문제를 발생시킨다. 클래스 불균형은 보안영역 뿐만 아니라 인공지능을 활용하는 모델 연구에서 문제점이 되고 있다. 현재 다양한 딥러닝 기반의 침입탐지의 연구의 경우 분류 성능향상에 대해 진행되고 있지만 클래스 불균형에 대한 연구는 부족하다. 본 연구에서는 네트워크 트래픽의 합성 표 데이터셋에 대한 클래스 불균형을 해결하기 위한 유사 트래픽 생성모델인 TSGAN(Table Synthetic Generative Adversarial Network)을 이용한 유사 트래픽 생성방법을 제안한다.

기존의 클래스 불균형 문제를 해결하기 위한 방법으로는 합성 소수 오버 샘플링 기법(Synthetic Minority Over-sampling Technique, SMOTE) 등의 통계적인 방법을 이용하여 해결하게 된다. 하지만 통계적인 방법의 경우 유사 트래픽과 특징에 따른 각 특성(Feature)대상의 상관관계 분석 및 추적에 대한 해결하고자한다. 제안하는 TSGAN은 합성 표 데이터의 특성에 대한 데이터 유형에 따른 분포를 분석하고 데이터 전처리 방법을 통해 원본 데이터에 대한 유사 네트워크 트래픽을 생성한다. 합성 표 데이터셋은 각 특성의 데이터 유형에 따라 각각 다른 분포를 보이며 다중분포(Multinomial Distribution)를 가지거나 편향적인 데이터 분포를 발생되어 가우시안 혼합모델(Gaussian Mixture Model, GMM)을 이용한 데이터 가공 및 군집화를 진행한다. 데이터 분석을 바탕으로 특성공학을 이용한 각 특성의 데이터 유형에 따라 데이터 전처리를 수행하고, 합성 표 데이터셋의 소수클래스를 대상으로 데이터를 생성하였다. 생성된 데이터는 상호의존정보(Normal Mutual Information, NMI)를 이용한 원본 데이터와의 유사도측정기반의 분석을 하였다. 또한 기존의 표 구조 기반의 생성적 적대 신경망과의 데이터 생성 비교를 통해 표 기반 생성적 적대 신경망 대상의 생성에 대한 정밀성 비교 분석을 통해 우수성을 보였다.

TSGAN에 대한 성능평가를 위해 정밀성 비교와 생성된 데이터의 유사성을 측정하였다. 또한 생성된 표 데이터셋을 분류 모델에 적용하여 소수 클래스에 대한 데이터 증가가 클래스 불균형을 해소됨에 따른 분류 성능 향상에 대해 분석하였다. 성능비교는 F1-스코어를 이용하여 분류 성능 분석을 실시하였다. 기존의 데이터 셋을 사용하였을 때보다 12%향상된 분류 정확도를 보였으며, 평균제곱근과 평균 절대오차가 0.1~0.4가 시킬 수 있었다. 이는 제안하는 표기반의 생성적 적대 신경망 모델에 비해 정밀성 및 유사도가 크게 개선됨을 확인할 수 있었다.

두 번째, 기존의 문자열, 이미지를 대상으로 한 분류 모델의 한계점을 해결하는 표 합성 데이터셋을 대상으로 한 분류모델을 제안한다. 기존 분류 모델은 컨볼루셔널 신경망 (Convolutional Neural Network, CNN), 순환신경망(Recurrent Neural Network, RNN)등을 사용하고 있다. 하지만 두 분류모델의 경우 문자열 및 이미지를 분류 및 생성하는데 최적화되어 있으며, 각각의 특성을 고려하지 않고 전체 데이터의 의미를 분석하기에 네트워크 트래픽과 같은 합성 표 데이터에 대한 분류에는 적합하지 않다. 이는 각 특성간의 상관관계를 고려하지 못한 분류하지 못하고 특징에 대한 오탐은 성능저하의 원인이 된다. 또한 네트워크 트래픽의 경우 분류하고자 하는 대상의 특징에 따라 특성이 달라진다. 본 연구에서는 패킷의 데이터 영역에 대한 가중치를 높게 설정하여 페이로드 기반 분류 방법을 적용방법을 통해 한계점을 해결한다. 이를 위해 본 연구에서는 네트워크 트래픽을 대상으로 한 심층 패킷 분석 기반의 DPIHAN(Deep Packet Inspection Hierarchical Attention Network)를 제안한다.

DPIHAN은 계층적 어텐션 네트워크 구조를 기반으로 하여 문서분류에 사용되는 알고리즘을 네트워크 트래픽의 합성 표 데이터를 학습할 수 있도록 변형한 알고리즘이다. 기존 분류모델의 한계점을 해결하기 위한 네트워크 트래픽의 특성간의 상관관계를 적용하기 위해 장단기메모리 모델을 이용하여 각각의 관계를 분석한다. 따라서 본 연구에서는 시계열 변화에 대한 분석은 고려하지 않는다. 심층 패킷 분석 기반(Deep Packets Inspection)의 분류 모델의 우수성을 입증하기 위해 TSGAN을 이용하여 생성된 유사 데이터셋을 이용하여 기존 분류 학습 알고리즘과의 학습속도 비교와 분류 성능을 분석하였다. 실험은 3가지로 진행하였으며 먼저 클래스 불균형문제에 따른 학습모델에 미치는 영향을 분석하기 위한 분류실험과, 인공신경망(Neural Network) 모델의 경우 텍스트, 이미지기반의 분류 모델에 대한 합성 표 데이터셋을 대상으로 한 성능을 비교 분석(정밀성, 손실률, 소요시간)하였다. 심층패킷 분석기반 분류모델의 네트워크의 계층적 구조(hierarchical structure)를 도입한 DPIHAN을 이용한 분류 모델의 학습속도 및 손실값의 비교평가를 진행하였다. 이를 통해 심층패킷 분석 기반의 데이터영역 기반 분류 모델의 학습 성능의 우수성을 보였다. 기존의 분류모델에 비해 12%향상된 정확도를 볼 수 있었으며 다수클래스에 대한 분류정확도는 낮아졌으나 소수클래스의 정밀성이 증가됨에 따라 향상된 분류 성능을 볼 수 있었다. 실험을 통해 네트워크 트래픽의 데이터영역의 변화가 있는 공격(Botnet, SSH등 )일수록 정밀도와 오탐률이 줄어드는 것을 볼 수 있었으며, 데이터영역의 변화가 많은 공격일수록 탐지의 정확성 5% 증가되고 및 오탐률7% 재현율은 8%가 개선되는 것을 확인 할 수 있었다.
본 연구의 결과는 표 구조의 합성 데이터를 대상으로 심층 패킷 분석기반의 트래픽 분류 시스템에서 사용할 수 있도록 모듈형태로 구현하였다. 또한 공유된 합성 데이터셋의 클래스 불균형을 해결을 위한 트래픽 생성기와 심층 패킷 분석기반의 분류 모델 연구를 진행하였다. 이를 통해 공유된 합성 표 데이터셋을 기반의 한 딥러닝 기반 침입탐지시스템의 기술고도화에 기여하였다. 향후 연구로는 암호화 트래픽에 대한 탐지 성능을 위한 모델 고도화와 TSGAN을 이용해 생성된 데이터에 대한 실시간 동기화 기술에 대해 연구할 것이며, 데이터 분석기에 대한 성능 고도화를 바탕으로 한 최적 클래스 불균형 자동화 기술에 대해 연구할 것이다.

Abstract ▼ AI-Helper

Recently, researches using artificial intelligence in governments and corporations have been conducted in various ways. Security research using artificial intelligence has been diversified into malware detection and intrusion detection. Network intrusion detection using artificial intelligence among security fields is being researched on intrusion detection using deep learning as well as traditional machine learning algorithms. Network traffic classification using deep learning can be divided into port based and deep analysis based. The classification using deep learning model is being studied using the Genetic Adversarial Network (GAN) beyond the existing text and image based Neural Network. Currently, various domestic and foreign institutions are researching various aspects to improve the performance of traffic classification. Despite the advanced technology of deep learning model, AI-based security technology has limitations of high false positive rate and data acquisition, class imbalance, algorithm implementation, reduction of large amount of information, and difficulty of real-time detection. In this paper, we propose a classification model based on network traffic characteristics and a method for solving class imbalances.

The learning datasets that are most important for the study of network traffic classification model using deep learning are widely shared. Most published datasets provide data in Synthetic Table Data. Table Data is the most common data type in deep learning-based security research, according to a survey conducted by Data science platform kaggle, and is the second most common format in academic circles. In particular, most shared common datasets have a composite table structure, and the researcher provides a composite table dataset. In the case of a composite table data set, it refers to a data set in which original data is extracted from the researcher and the necessary information is arranged. Table data refers to structured query language (SQL) and table-based comma-separated files (Comma-Separated Values, CSV). In the case of network traffic data, the feature is extracted based on the feature engineering according to the purpose of detection, and the data preprocessing method suitable for the data type is applied through data processing and analysis, and the composite table dataset is generated.

First, we propose a solution for class imbalance in the shared table dataset among the limitations of the deep learning based attack traffic classification study. Shared synthetic datasets have less attack data compared to data generated in real network environment and depend on characteristics of each attack. Compared to many normal traffic generated in the network environment, the abnormal traffic is very small and the class imbalance of the data set is severe. Class imbalance is a problem in model research that utilizes artificial intelligence as well as security domain. Currently, various deep learning-based intrusion detection studies are progressing on classification performance improvement, but the research on class imbalance is insufficient. In this study, we used TSGAN (Table Synthetic Generative Adversarial Network), a similar traffic generation model to solve class imbalances, based on the synthetic table dataset among the extracted traffic based on analysis based on characterization of network traffic. We propose a similar traffic generation method.

In order to solve the existing class imbalance problem, it is solved by statistical methods such as Synthetic Minority Over-sampling Technique (SMOTE). However, the statistical method has a limitation in that it is difficult to track the correlation between similar traffic and each feature. The proposed TSGAN analyzes the distribution according to the data type of the characteristics of the composite table data and generates similar network traffic for the original data by applying the data preprocessing method. In the case of synthetic table datasets, data distribution and clustering are performed using a Gaussian Mixture Model (GMM) by generating a multinomial distribution or generating a biased data distribution according to each data type. Based on the data analysis, data preprocessing was performed according to the data type of each characteristic using characteristic engineering, and data was generated for the minority class of the composite table dataset. The generated data was analyzed based on the similarity measurement with the original data using normal mutual information (NMI). In addition, the data generation comparison with the existing antagonistic neural network based on the existing table structure showed the superiority through the comparative analysis on the precision of the generation of the targets based on the table-based antagonistic neural network.

For performance evaluation of TSGAN, precision comparison and similarity of generated data were measured. In addition, we applied the generated table dataset to the classification model to analyze the improvement of classification performance as the increase of the data for minor classes resolves the class imbalance. For performance comparison, classification performance analysis was performed using F1-score, and the classification accuracy was improved by 12% compared with the conventional data set. It can be confirmed that the precision and similarity are greatly improved compared to the generated antagonistic neural network model of the proposed label.

Second, we propose a classification model for a table-synthesized dataset that solves the limitations of existing classification models for character strings and images. Existing classification models use the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN). However, the two classification models are optimized for classifying and generating character strings and images, and are not suitable for classification of synthetic table data such as network traffic, because the meaning of the entire data is analyzed without considering each characteristic. This is not classified without considering the correlation between the characteristics, and false positives to the characteristics cause performance degradation. In addition, the characteristics of network traffic vary depending on the characteristics of the object to be classified. This study solves the limitation through applying the classification method centering on the data area by setting the weight of the packet data area high. To this end, this study proposes Deep Packet Inspection Hierarchical Attention Network (DPIHAN) based on deep packet analysis for network traffic.

DPIHAN is an algorithm transformed from the algorithm used for document classification based on hierarchical attention network structure to learn composite table data of network traffic. In order to apply the correlation between the characteristics of network traffic to solve the limitations of the existing classification model, we analyze each relationship using the short and long term memory model. Therefore, this study does not consider the analysis of time series changes. In order to prove the superiority of the deep packet inspection based classification model, we compared the learning speed and classification performance with existing classification learning algorithms using similar datasets created using TSGAN. Three experiments were conducted. First, we analyzed classification experiments to analyze the effect on the learning model according to the class imbalance problem, and in the case of the neural network model, the composite table dataset for the text and image-based classification models. The performance was compared and analyzed (precision, loss rate, time required). A comparative evaluation of the learning speed and loss value of the classification model using DPIHAN, which introduces the hierarchical structure of the network of the deep packet analysis-based classification model, was conducted. Through this, we show the superiority of the learning performance of data domain based classification model based on deep packet analysis. Compared with the existing classification model, the accuracy was improved by 12%, and the accuracy of classification for many classes was lowered, but the classification performance was improved as the precision of minor classes was increased. Experiments have shown that precision and error rates decrease with changes in the data area of network traffic (Botnet, SSH, etc.), and that the more changes in the data area, the greater the accuracy and error rate of detection.

The results of this study were implemented in modular form for use in traffic classification systems based on deep packet analysis based on composite data in table structure. In addition, the class imbalance of shared synthetic datasets was addressed by the study of classification models based on traffic generators and deep packet analysis. This contributed to the technical elevation of deep learning based intrusion detection system based on the shared composite table dataset. Future studies will study model upgrading for detection performance for cryptographic traffic and real-time synchronization technology for data generated using TSGAN, and the optimal class imbalance automation technology based on performance upgrading for data analyzers.

주제어

학위논문 정보

저자	이우호
학위수여기관	전남대학교 대학원
학위구분	국내박사
학과	정보보안협동과정
지도교수	노봉남
발행연도	2020
총페이지	141
키워드	인공지능 침입탐지시스템 데이터 불균형 인공신경망 생성적 적대 신경망 계층 어텐션 네트워크
언어	kor
원문 URL	http://www.riss.kr/link?id=T15509437&outLink=K
정보원	한국교육학술정보원

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

네트워크 데이터 셋의 클래스 불균형 해결을 위한 데이터 생성 및 분류 프레임워크
Data generation and classification framework for resolving class imbalances in network data sets 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

네트워크 데이터 셋의 클래스 불균형 해결을 위한 데이터 생성 및 분류 프레임워크 Data generation and classification framework for resolving class imbalances in network data sets 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

네트워크 데이터 셋의 클래스 불균형 해결을 위한 데이터 생성 및 분류 프레임워크
Data generation and classification framework for resolving class imbalances in network data sets 원문보기

초록 ▼
AI-Helper