[논문]데이터셋 유형 분류를 통한 클래스 불균형 해소 방법 및 분류 알고리즘 추천

김정훈; 곽기영

doi:10.13088/jiis.2022.28.3.023

데이터셋 유형 분류를 통한 클래스 불균형 해소 방법 및 분류 알고리즘 추천
Class Imbalance Resolution Method and Classification Algorithm Suggesting Based on Dataset Type Segmentation 원문보기

지능정보연구 = Journal of intelligence and information systems, v.28 no.3, 2022년, pp.23 - 43

김정훈 (국민대학교 비즈니스IT전문대학원 4단계 BK21 교육연구팀) , 곽기영 (국민대학교 경영대학)

초록
AI-Helper

AI(Artificial Intelligence)를 다양한 산업에서 접목하기 위해 알고리즘 선택에 대한 관심이 증가하고 있다. 알고리즘 선택은 대부분 데이터 과학자의 경험에 의해 결정되는 경우가 많다. 하지만 경험이 부족한 데이터 과학자의 경우 데이터셋 특성 기반의 메타학습(meta learning) 을 통해 알고리즘을 선택한다. 기존의 알고리즘 추천은 선정 과정이 블랙박스이기 때문에 어떠한 근거에 의해 도출되는지 알 수 없었다. 이에 따라 본 연구에서는 k-평균 군집분석을 활용하여 데이터셋 특성에 따라 유형을 나누고 적합한 분류 알고리즘과 클래스 불균형 해소 방법을 탐색한다. 본 연구 결과 네 가지 유형을 도출하였으며 데이터셋 유형에 따라 적합한 클래스 불균형 해소 방법과 분류 알고리즘을 추천하였다.

Abstract ▼ AI-Helper

In order to apply AI (Artificial Intelligence) in various industries, interest in algorithm selection is increasing. Algorithm selection is largely determined by the experience of a data scientist. However, in the case of an inexperienced data scientist, an algorithm is selected through meta-learning based on dataset characteristics. However, since the selection process is a black box, it was not possible to know on what basis the existing algorithm recommendation was derived. Accordingly, this study uses k-means cluster analysis to classify types according to data set characteristics, and to explore suitable classification algorithms and methods for resolving class imbalance. As a result of this study, four types were derived, and an appropriate class imbalance resolution method and classification algorithm were recommended according to the data set type.

주제어

표/그림 (16)

그림 연구모형
표 Characteristics of Dataset Used
그림 Meta Dataset Collection Process
그림 Example of Meta Dataset
그림 Example of Summarized Meta Dataset
그림 Final Summarized Meta Dataset
그림 Frequency of Best Class Imbalance Resolution Methods Abbreviations: RF = Random Forest, LR = Logistic Regression, knn = k-Nearest Neighbor, SVM = Support Vector Machine, ANN = Artificial Neural Network, NB = Naïve Bayes, Adasyn = Adaptive Synthetic Sampling, CNN = Condensed Nearest Neighbor, ENN = Edited Nearest Neighbor, SMOTE = Synthetic Minority Over-sampling Technique, NCR =Neighborhood Cleaning Rule, Tomek = Tomek Link, ROS = Random Over Sampling, RUS = Random Under Sampling, origin = None sampling method
그림 Frequency of Best Classification Algorithms
그림 Boxplot of Classification Performance
그림 Optimal Number of Clusters
그림 Silhouette Score Plot
표 Result of ANOVA(F1)
표 Results of ANOVA(HHI)
표 Cluster Characteristics
표 Paired t-test(F1-Score)
그림 Results of Dataset Personality Segmentation

참고문헌 (39)

Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., ... & Hussain, A. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4, 7940-7957.

상세보기
Anwar, N., Jones, G., & Ganesh, S. (2014). Measurement of data complexity for classification problems with unbalanced data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 7(3), 194-211.
Blagus, R., & Lusa, L. (2013). Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC bioinformatics, 14(1), 1-13.
Cano, J. R. (2013). Analysis of data complexity measures for classification. Expert systems with applications, 40(12), 4820-4831.

상세보기
Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124.
Feng, S., Keung, J., Yu, X., Xiao, Y., Bennin, K. E., Kabir, M. A., & Zhang, M. (2021). COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Information and Software Technology, 129, 106432.

상세보기
George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of management Journal, 57(2), 321-326.

상세보기
Ho, T. K. (2002). A data complexity analysis of comparative advantages of decision forest constructors. Pattern Analysis & Applications, 5(2), 102-112.

상세보기
Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE transactions on pattern analysis and machine intelligence, 24(3), 289-300.

상세보기
Huang, Y. M., Hung, C. M., & Jiau, H. C. (2006). Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Analysis: Real World Applications, 7(4), 720-747.

상세보기
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
Khan, I., Zhang, X., Rehman, M., & Ali, R. (2020). A literature survey and empirical study of meta-learning for classifier selection. IEEE Access, 8, 10262-10281.

상세보기
Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 41(3), 552-568.
Kim, J., & Kwon, O. (2021). A model for rapid selection and covid-19 prediction with dynamic and imbalanced data. Sustainability, 13(6), 3099.

상세보기
Kim, E., & Hong, T. (2015). Response Modeling for the Marketing Promotion with Weighted Case Based Reasoning Under Imbalanced Data Distribution. Journal of Intelligence and Information Systems, 21(1), 29-45.

원문보기 상세보기
Kim, J., Kim, M. Y., & Kwon, O. (2020). The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms. Journal of Intelligence and Information Systems, 26(1), 23-45.

원문보기 상세보기
Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.

상세보기
Lee, S., & Shin, T. (2018). Development and application of prediction model of hyperlipidemia using SVM and meta-learning algorithm. Journal of Intelligence and Information Systems, 24(2), 111-124.

원문보기 상세보기
Leyva, E., Gonzalez, A., & Perez, R. (2014). A set of complexity measures designed for applying meta-learning to instance selection. IEEE Transactions on Knowledge and Data Engineering, 27(2), 354-367.
Lopez, V., Fernandez, A., Garcia, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information sciences, 250, 113-141.

상세보기
Lorena, A. C., Maciel, A. I., de Miranda, P. B., Costa, I. G., & Prudencio, R. B. (2018). Data complexity meta-features for regression problems. Machine Learning, 107(1), 209-246.

상세보기
Lu, W. Z., & Wang, D. (2008). Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Science of the total environment, 395(2-3), 109-116.
Matsumoto, A., Merlone, U., & Szidarovszky, F. (2012). Some notes on applying the Herfindahl-Hirschman Index. Applied Economics Letters, 19(2), 181-184.

상세보기
Merz, P. (2004). Advanced fitness landscape analysis and the performance of memetic algorithms. Evolutionary Computation, 12(3), 303-325.

상세보기
Munoz, M. A., Sun, Y., Kirley, M., & Halgamuge, S. K. (2015). Algorithm selection for black-box continuous optimization problems: A survey on methods and challenges. Information Sciences, 317, 224-245.

상세보기
Park, G. U., & Jung, I. (2019). Comparison of resampling methods for dealing with imbalanced data in binary classification problem. The Korean Journal of Applied Statistics, 32(3), 349-374.

원문보기 상세보기
Pascual-Triana, J. D., Charte, D., Andres Arroyo, M., Fernandez, A., & Herrera, F. (2021). Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowledge and Information Systems, 63(7), 1961-1989.

상세보기
Pasupa, K., Vatathanavaro, S., & Tungjitnob, S. (2020). Convolutional neural networks based focal loss for class imbalance problem: a case study of canine red blood cells morphology classification. Journal of Ambient Intelligence and Humanized Computing, 1-17.
Pfahringer, B., Bensusan, H., & Giraud-Carrier, C. G. (2000, June). Meta-Learning by Landmarking Various Learning Algorithms. In ICML (pp. 743-750).
Pimentel, B. A., & De Carvalho, A. C. (2019). A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, 477, 203-219.

상세보기
Qureshi, S. R., & Gupta, A. (2014, March). Towards efficient Big Data and data analytics: A review. In 2014 Conference on IT in Business, Industry and Government (CSIBIG) (pp. 1-6). IEEE.
Rossi, A. L. D., de Leon Ferreira, A. C. P., Soares, C., & De Souza, B. F. (2014). MetaStream: A meta-learning based method for periodic algorithm selection in time-changing data. Neurocomputing, 127, 52-64.

상세보기
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243.
Sun, A., Lim, E. P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191-201.

상세보기
Van der Walt, C. M., & Barnard, E. (2007). Data characteristics that determine classifier performance. SAIEE Africa Research Journal, 98(3), 87-93.

상세보기
Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of artificial intelligence research, 19, 315-354.

상세보기
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82.

상세보기
Zhang, X., Li, R., Zhang, B., Yang, Y., Guo, J., & Ji, X. (2019). An instance-based learning recommendation algorithm of imbalance handling methods. Applied Mathematics and Computation, 351, 204-218.

상세보기

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증