[논문]비대칭 오류 비용을 고려한 XGBoost 기반 재범 예측 모델

원하람; 심재승; 안현철

doi:10.13088/jiis.2019.25.1.127

비대칭 오류 비용을 고려한 XGBoost 기반 재범 예측 모델
A Recidivism Prediction Model Based on XGBoost Considering Asymmetric Error Costs 원문보기

지능정보연구 = Journal of intelligence and information systems, v.25 no.1, 2019년, pp.127 - 137

원하람 (국민대학교 비즈니스IT전문대학원) , 심재승 (국민대학교 비즈니스IT전문대학원) , 안현철 (국민대학교 비즈니스IT전문대학원)

초록
AI-Helper

재범예측은 70년대 이전부터 전문가들에 의해서 꾸준히 연구되어온 분야지만, 최근 재범에 의한 범죄가 꾸준히 증가하면서 재범예측의 중요성이 커지고 있다. 특히 미국과 캐나다에서 재판이나 가석방심사 시 재범 위험 평가 보고서를 결정적인 기준으로 채택하게 된 90년대를 기점으로 재범예측에 관한 연구가 활발해졌으며, 비슷한 시기에 국내에서도 재범요인에 관한 실증적인 연구가 시작되었다. 지금까지 대부분의 재범예측 연구는 재범요인 분석이나 재범예측의 정확성을 높이는 연구에 집중된 경향을 보이고 있다. 그러나 재범 예측에는 비대칭 오류 비용 구조가 있기 때문에 경우에 따라 예측 정확도를 최대화함과 동시에 예측 오분류 비용을 최소화하는 연구도 중요한 의미를 가진다. 일반적으로 재범을 저지르지 않을 사람을 재범을 저지를 것으로 오분류하는 비용은 재범을 저지를 사람을 재범을 저지르지 않을 것으로 오분류하는 비용보다 낮다. 전자는 추가적인 감시 비용만 증가되는 반면, 후자는 범죄 발생에 따른 막대한 사회적, 경제적 비용을 야기하기 때문이다. 이러한 비대칭비용에 따른 비용 경제성을 반영하여, 본 연구에서 비대칭 오류 비용을 고려한 XGBoost 기반 재범 예측모델을 제안한다. 모델의 첫 단계에서 최근 데이터 마이닝 분야에서 높은 성능으로 각광받고 있는 앙상블 기법, XGBoost를 적용하였고, XGBoost의 결과를 로지스틱 회귀 분석(Logistic Regression Analysis), 의사결정나무(Decision Trees), 인공신경망(Artificial Neural Networks), 서포트 벡터 머신(Support Vector Machine)과 같은 다양한 예측 기법과 비교하였다. 다음 단계에서 임계치의 최적화를 통해 FNE(False Negative Error)와 FPE(False Positive Error)의 가중 평균인 전체 오분류 비용을 최소화한다. 이후 모델의 유용성을 검증하기 위해 모델을 실제 재범예측 데이터셋에 적용하여 XGBoost 모델이 다른 비교 모델 보다 우수한 예측 정확도를 보일 뿐 아니라 오분류 비용도 가장 효과적으로 낮춘다는 점을 확인하였다.

Abstract ▼ AI-Helper

Recidivism prediction has been a subject of constant research by experts since the early 1970s. But it has become more important as committed crimes by recidivist steadily increase. Especially, in the 1990s, after the US and Canada adopted the 'Recidivism Risk Assessment Report' as a decisive criterion during trial and parole screening, research on recidivism prediction became more active. And in the same period, empirical studies on 'Recidivism Factors' were started even at Korea. Even though most recidivism prediction studies have so far focused on factors of recidivism or the accuracy of recidivism prediction, it is important to minimize the prediction misclassification cost, because recidivism prediction has an asymmetric error cost structure. In general, the cost of misrecognizing people who do not cause recidivism to cause recidivism is lower than the cost of incorrectly classifying people who would cause recidivism. Because the former increases only the additional monitoring costs, while the latter increases the amount of social, and economic costs. Therefore, in this paper, we propose an XGBoost(eXtream Gradient Boosting; XGB) based recidivism prediction model considering asymmetric error cost. In the first step of the model, XGB, being recognized as high performance ensemble method in the field of data mining, was applied. And the results of XGB were compared with various prediction models such as LOGIT(logistic regression analysis), DT(decision trees), ANN(artificial neural networks), and SVM(support vector machines). In the next step, the threshold is optimized to minimize the total misclassification cost, which is the weighted average of FNE(False Negative Error) and FPE(False Positive Error). To verify the usefulness of the model, the model was applied to a real recidivism prediction dataset. As a result, it was confirmed that the XGB model not only showed better prediction accuracy than other prediction models but also reduced the cost of misclassification most effectively.

주제어

표/그림 (7)

그림 Flow Chart of the Research Model
표 Candidate Independent Variables
표 Selected Independent Variables Applied to the Model
표 Experimental Results for each Classification Methods
표 Two-Sample Test for Proportions (Z-values)
표 Comparison of Results of Fixed and Optimized Classification Threshold
그림 Comparison of Total Social Cost using Fixed and Optimized Threshold

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

제안 방법

After comparing the results of XGB, an experiment to reflect the asymmetric error cost to the model is conducted. The method of the experiment is as follows.
Bagging is a method developed by Breiman, which makes learning algorithms into multiple copies, then learns each of them and combines the results(Breiman, 1994). And, boosting is a method to construct a committee of weak learners that lowers the error rate in classification and prediction error in regression. Boosting works by iteratively constructing weak learners whose training set is conditioned on the performance of the previous members of the ensemble(Sharkey, 1999).
From the theoretical point of view, this study has the theoretical implication that the asymmetric error cost was reflected in the recidivism prediction and XGB, the latest classification prediction method, was applied to the recidivism prediction to consider the social cost. In addition, from the practical point of view, it is possible to utilize the proposed model in the present study as a reference for criminal judgment or review of parole, so that it is possible to proactively respond to the potential problem of recidivism.
Therefore, this paper investigated crime prediction, which is a proactive response through data analysis, and focused on recidivism prediction among crime predictions. In addition, the study focused on the asymmetric error cost, which is mainly used in the research field for detection model, in view of social cost of recidivism.
In the field of data analysis, crime prediction is a very interesting subject in that it takes a scientific approach to a wide variety of data. There are various research fields in crime prediction, but this study focuses on a prediction of criminal recidivism. There is no consistent definition of recidivism, but it is defined as “reengaging in criminal behavior after receiving a sanction or intervention" in general(King and Elderbroom, 2014).
As shown in this table, XGB model outperformed LOGIT and SVM at the 5% statistical significance level, and surpassed DT, and ANN at the 1% statistical significance level. Therefore, The XGB model was verified to be the optimal model, and the experiments reflecting the asymmetric error costs were performed using XGB model.
In that sense, proactive prevention is a much more efficient and effective method than post-counteraction. Therefore, this paper investigated crime prediction, which is a proactive response through data analysis, and focused on recidivism prediction among crime predictions. In addition, the study focused on the asymmetric error cost, which is mainly used in the research field for detection model, in view of social cost of recidivism.
This study proposed a novel recidivism prediction model that considers the asymmetric error cost structure. Using an open dataset from the ICSPR, we applied the recidivism prediction to the XGB model and compared it with other statistical and machine learning classification methods to verify that XGB is the best model for recidivism prediction accuracy.
The dataset was classified into training and validation datasets after preprocessing and were apply to XGboost(XGB). To validate XGB performance after application, the results are compared with the statistical model, Logistic Regression Analysis(LOGIT), Decision Trees(DT), Artificial Neural Networks(ANN), and Support Vector Machine(SVM), which are machine learning models.
This study proposed a novel recidivism prediction model that considers the asymmetric error cost structure. Using an open dataset from the ICSPR, we applied the recidivism prediction to the XGB model and compared it with other statistical and machine learning classification methods to verify that XGB is the best model for recidivism prediction accuracy. And then, we searched for the optimal classification threshold minimized the total cost, which is a weighted average of FPE and FNE.
Using the selected 15 final variables in [Table 2], we applied XGB model, and compared the results with LOGIT, DT, ANN, SVM models presented above. In [Table 3], which shows the results of the classification models, the validation data set accuracy of XGB model was the highest at 69.

대상 데이터

The data used in this project consisted of information from prisoners released from the North Carolina Prison in the United States from July 1, 1978 to June 30, 1979 and were collected at the ICSPR (Inter-university Consortium for Political and Social Research) website. To build the model, a total of 13,002 data were set with 1:1 ratio (6,501:6,501) of the recidivist and non-recidivist.
First, a dataset was built up at 1: 1 ratio of recidivist and non-recidivist data. The dataset was classified into training and validation datasets after preprocessing and were apply to XGboost(XGB). To validate XGB performance after application, the results are compared with the statistical model, Logistic Regression Analysis(LOGIT), Decision Trees(DT), Artificial Neural Networks(ANN), and Support Vector Machine(SVM), which are machine learning models.
The data used in this project consisted of information from prisoners released from the North Carolina Prison in the United States from July 1, 1978 to June 30, 1979 and were collected at the ICSPR (Inter-university Consortium for Political and Social Research) website. To build the model, a total of 13,002 data were set with 1:1 ratio (6,501:6,501) of the recidivist and non-recidivist.

이론/모형

XGBoost, the abbreviated name for “eXtream Gradient Boosting”, which was used in this study, is a decision trees tree algorithm that uses a boosting method to reduce the error value by grouping several CART(classification and regression trees).

성능/효과

After this, two-sample test for proportions was performed to determine whether the differences in prediction accuracy between XGB and the other methods are statistically significant. The null hypothesis (H0) for this test is PA = PB, and the alternative hypothesis (Ha) is PA > PB (PA: the average predicted accuracy rate for verification data sets in Model A).
[Table 4] presents the results of the two-sample test for proportions. As shown in this table, XGB model outperformed LOGIT and SVM at the 5% statistical significance level, and surpassed DT, and ANN at the 1% statistical significance level. Therefore, The XGB model was verified to be the optimal model, and the experiments reflecting the asymmetric error costs were performed using XGB model.
5' is used as the classification threshold, but in this case, it is unlikely to be optimal from the viewpoint of the total cost since it basically does not consider differences in costs incurred by FPE and FNE. This means that the optimal classification threshold varies depending on the relative difference between FPE and FNE, and FNE was over-weighted in this experiment because FNE is usually more fatal than FPE. Thus, 10 scenarios are set up that change the weights for FNE from 1 to 10 times the FPE, and for the selected classification model, the optimal FPE and FNE values and the classification threshold to minimize the total cost are found.

후속연구

In addition, though the cost for recidivism may be different according to the type of crime, our study did not consider it. Thus, it is needed to consider developing multiple recidivism prediction models according to the type of crime in the further study.

참고문헌 (13)

Breiman, L., "Bagging Predictors," Machine Learning, Vol.24, No.2(1996), 123-140.

상세보기
Chen, T., and C, Guestrin, "Xgboost: A scalable tree boosting system," Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, (2016).
Joo, D., Hong, T., and I. Han, "The neural network models for IDS based on the asymmetric costs of false negative errors and false positive errors," Expert Systems with Applications, Vol.25(2003), 69-75.

상세보기
Jung, S., "A Study on the Use of Big data in Criminal Law," Journal of Public Policy Studies, Vol.29, No. 2(2012), 161-184.
King, R. S., and B. Elderbroom, Improving recidivism as a performance measure, Washington, DC: Urban Institute, 2014.
Lee, H.-U., and H. Ahn, "An intelligent intrusion detection model based on support vector machines and the classification threshold optimization for considering the asymmetric error cost," Journal of Intelligence and Information Systems, Vol.17, No.4(2011), 157-173.
Nam, S., and S. Park, "Study on recidivism factors of prisoners," Corrections Review, Vol.50 (2011), 115-139.
New York Times, Recidivism's high cost and a way to cut it, 2011, Available at https://www.nytimes.com/2011/04/28/opinion/28thu3.html (Accessed 21 January 2019).
Prison Education News, The Cost of Recidivism: Victims, the Economy, and American Prisons, 2014, Available at https://prisoneducation.com/prison-education-news/the-cost-of-recidivism-victims-the-economy-and-american-pris-html (Accessed 21 January, 2019).
Schmidt, P., and A. D. Witte, "Predicting criminal recidivism using 'Split Population' survival time models", Journal of Econometrics, Vol.40, No.1(1989) 141-159.

상세보기
Seong, H. G., "Methods and tasks in the prediction of criminal recidivism," Proceeding of the 2006 Annual Conference of Korean Psychological Association, (2006), 404-405.
Sharkey A.J., Combining Artificial Neural Nets: ensemble and modular multi-net systems, (Ed.), Springer Science & Business Media, 2012.
Turgut O., "Predicting recidivism through machine learning," Ph.D. dissertation, University of Texas at Dallas, 2017.

저자의 다른 논문 :

LOADING...

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증