[논문]랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구

김판준

doi:10.3743/kosim.2019.36.2.057

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구
An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.36 no.2 = no.112, 2019년, pp.57 - 77

초록
AI-Helper

대표적인 앙상블 기법으로서 랜덤포레스트(RF)를 문헌정보학 분야의 학술지 논문에 대한 자동분류에 적용하였다. 특히, 국내 학술지 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 트리 수, 자질선정, 학습집합 크기 등 주요 요소들에 대한 다각적인 실험을 수행하였다. 이를 통해, 실제 환경의 불균형 데이터세트(imbalanced dataset)에 대하여 랜덤포레스트(RF)의 성능을 최적화할 수 있는 방안을 모색하였다. 결과적으로 국내 학술지 논문의 자동분류에서 랜덤포레스트(RF)는 트리 수 구간 100~1000(C)과 카이제곱통계량(CHI)으로 선정한 소규모의 자질집합(10%), 대부분의 학습집합(9~10년)을 사용하는 경우에 가장 좋은 분류 성능을 기대할 수 있는 것으로 나타났다.

Abstract ▼ AI-Helper

Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms of classification performance that automatically assigns class labels to domestic journals. Through this, I explored ways to optimize the performance of random forests (RF) for imbalanced datasets in real environments. Consequently, for the automatic classification of domestic journal articles, Random Forest (RF) can be expected to have the best classification performance when using tree number interval 100~1000(C), small feature set (10%) based on chi-square statistic (CHI), and most learning sets (9-10 years).

주제어

표/그림 (11)

그림 <그림 1> 실험 단계별 변수와 평가 방법
표 <표 1> 랜덤포레스트(RF)의 트리 수 구간별 성능: mac_F1, mic_F1
그림 <그림 2> 랜덤포레스트(RF)의 트리 수 구간별 성능: 처리 시간(단위 : ms)
표 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 단일-범주, mac_F1
표 <표 3> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 단일-범주, mic_F1
표 <표 4> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 복수-범주, mac_F1
표 <표 5> 자질선정을 적용한 랜덤포레스트(RF) 분류 성능: 복수-범주, mic_F1
그림 <그림 3> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 단일_범주, mac_F1
그림 <그림 4> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 단일_범주, mic_F1
그림 <그림 5> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 복수_범주, mac_F1
그림 <그림 6> 학습집합 크기에 따른 랜덤포레스트(RF) 분류 성능: 복수_범주, mic_F1

참고문헌 (59)

Kang, S., Jeon, H., Kim, J., & Song, J. (2015). A study on domestic drama rating prediction. The Korean Journal of Applied Statistics, 28(5), 933-949. http://dx.doi.org/10.5351/KJAS.2015.28.5.933

원문보기 상세보기
Kwon, A. (2013). Variable selection using Random Forest. unpublished master's thesis, Inha University.
Kim, S., & Ahn, H. (2016). Application of Random Forests to corporate credit rating prediction. The Journal of Business and Economics, 32(1), 187-211.
Kim, P. J. (2006). A Study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279

원문보기 상세보기
Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033
Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037
Nam, S., Oh, M., Kim, S., Kang, C., Kim, G., & Choi, S. (2017). Comparison of machine learning models for classification into user-oriented groups. Journal of the Korean Data Analysis Society, 19(5), 2501-1507.
Suh, J. (2016). Foreign exchange rate forecasting using the GARCH extended Random Forest model. Journal of Industrial Economics and Business, 29(5), 1607-1628.
Yoo, J. (2015). Random forests, an alternative data mining technique to decision tree. Journal of Educational Evaluation, 28(2), 427-448.
Yun, Taegyun, & Yi, Gwan-Su (2008). Application of Random Forest algorithm for the decision support system of medical diagnosis with the selection of significant clinical test. The Transactions of the Korean Institute of Electrical Engineers, 57(6), 1058-1062.
Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123

원문보기 상세보기
Lee, C., Yoo, K., Mun, B., & Bae, S. (2017). Informal quality data analysis via sentimental analysis and Word2vec method. Journal of Korean Society for Quality Management, 45(1), 117-127. http://dx.doi.org/10.7469/JKSQM.2017.45.1.117

원문보기 상세보기
Lee, H., Shin, D., Park, H., Kim, S., & Shin, D. (2011). Research on the modified algorithm for improving accuracy of Random Forest classifier which identifies automatically arrhythmia. The KIPS Transactions: Part B, 18(6), 341-348.
Jeong, S., Choi, M., & Kim, H. (2016). Coreference resolution for Korean using Random Forests. Journal of KIISE, 5(11), 535-540.
Jeong, J., Jang, K., & Kim, J. (2016). Target classification method using Random Forest and genetic algorithm. 2016 IEIE Fall Conference, 601-604.
Jo, H., & Park, C. (2018). Analysis of reporting characteristics of newspapers in the 19th presidential election based on random forest. Journal of the Korean data & information science society, 29(2), 367-375. http://dx.doi.org/10.7465/jkdi.2018.29.2.367

상세보기
Choi, H., Choi, S., & Han, K. (2012). Prediction of DNA binding sites in proteins using a Random Forest. Journal of KIISE, 39(7), 515-522.
Hong, J., Ko, B., & Nam, J. (2013). Human action recognition in still image using weighted bag-of-features and ensemble decision trees. The Journal of Korean Institute of Communications and Information Sciences, 38(1), 1-9. https://doi.org/10.7840/kics.2013.38A.1.1
Afianto, M. F., Adiwijaya, & Al-Faraby, S. (2017). Text categorization on Hadith Sahih Al-Bukhari using Random Forest, International Conference on Data and Information Science, IOP Conference Series: Journal of Physics: Conf. Series 971. http://doi.org/10.1088/1742-6596/971/1/012037
Amaratunga, D., Cabrera, J., & Lee, Y. (2008). Enriched random forests. Bioinformatics, 24(18), 2010-2014. https://doi.org/10.1093/bioinformatics/btn356

상세보기
Aung, W. T., Myanmar, Y., & Hla, K. H. M. S. (2009). Random forest classifier for multi-category classification of web pages. In Services Computing Conference, APSCC 2009. IEEE Asia-Pacific, 372-376. http://doi.org/10.1109/APSCC.2009.5394100
Austin, P. C., Tu, J. V., Ho, J. E., Levy, D., & Lee, D. S. (2013). Using methods from the data-mining and machine-learning literature for disease classification and prediction: A case study examining classification of heart failure subtypes. Journal of Clinical Epidemiology, 66(4), 398-407. http://doi.org/10.1016/j.jclinepi.2012.11.008

상세보기
Berk, R., Li, A., & Hickman, L. J. (2005). Statistical difficulties in determining the role of race in capital cases: A re-analysis of data from the state of Maryland. Journal of Quantitative Criminology, 21(4), 365-390. https://doi.org/10.1007/s10940-005-7354-7

상세보기
Boinee, P., Angelis, A. D., & Foresti, G. L. (2005). Meta random forests. International Journal of Computational Intelligence, 2(3), 138-147.
Brandenburg, Minke (2017). Text classification of Dutch police records. Unpublished master's thesis, Utrecht University Artificial Intelligence, Netherlands.
Breiman L. (2002). Random forests. Machine Learning, 45(1), 5-32.
Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453. http://doi.org/10.1016/j.eswa.2011.09.033

상세보기
Choi, S., & Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data &Information Science Society, 27(1), 255-264. http://dx.doi.org/10.7465/jkdi.2016.27.1.255

원문보기 상세보기
Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., & Lawler, J. J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783-2792. http://doi.org/10.1890/07-0539.1

상세보기
Dogan, T., & Uysal, A. K. (2018). The impact of feature selection on urban land cover classification. International Journal of Intelligent Systems and Applications in Engineering(IJISAE), 6(1), 59-64. http://doi.org/10.18201/ijisae.2018637933
Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: From early developments to recent advancements. Systems Science & Control Engineering, 2(1), 602-609. http://doi.org/10.1080/21642583.2014.956265

상세보기
Gao, D., Zhang, Y., & Zhao, Y. (2009). Random forest algorithm for classification of multiwavelength data. Research in Astronomy and Astrophysics, 9(2), 220-226. http://doi.org/101088/1674-4527/9/2/011

상세보기
Kim, M. J., Kang, D. K., &. Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3), 1074-1082. https://doi.org/10.1016/j.eswa.2014.08.025

상세보기
Klassen, M., & Paturi, N. (2010). Web document classification by keywords using Random Forests. In: Zavoral F., Yaghob J., Pichappan P., El-Qawasmeh E. (eds) Networked Digital Technologies. NDT 2010. Communications in Computer and Information Science, vol 88. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14306-9_26
Kong, Q., Gong, H., Ding, X., & Hou, R. (2017). Classification application based on mutual information and Random Forest method for high dimensional data. 9th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, 171-174. https://doi.org/10.1109/IHMSC.2017.45
Latinne, P., Debeir, O., & Decaestecker, C. (2001). Limiting the number of trees in Random Forests. In: Kittler J., Roli F. (eds) Multiple Classifier Systems. MCS 2001. Lecture Notes in Computer Science, vol 2096. Springer, Berlin, Heidelberg, 178-187. https://doi.org/10.1007/3-540-48219-9_18
Lee, Jaesung, & Kim, Dae-Won (2015). Mutual information-based multi-label feature selection using interaction information. Expert Systems with Applications, 42(4), 2013-2025. https://doi.org/10.1016/j.eswa.2014.09.063

상세보기
Liparas D., HaCohen-Kerner Y., Moumtzidou A., Vrochidis S., & Kompatsiaris I. (2014). News articles classification using Random Forests and weighted multimodal features. In: Lamas D., Buitelaar P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6
Lok, C. (2010). Speed reading. Nature 463, 28. http://doi.org/10.1038/463416a
Low, F., Schorcht, G., Michel, U., Dech, S., & Conrad, C. (2012). Per-field crop classification in irrigated agricultural regions in middle Asia using random forest and support vector machine ensemble. Proc. SPIE 8538, Earth Resources and Environmental Remote Sensing/GIS Applications III, 85380R (25 October 2012). http://doi.org/10.1117/12.974588
Ma, L. (2017). A multi-label text classification framework: Using supervised and unsupervised feature selection strategy. Unpublished doctoral dissertation, Georgia State University. retrieved from https://scholarworks.gsu.edu/cs_diss/134
Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 18(1), 169. https://doi.org/10.1186/s12859-017-1578-z

상세보기
Ma, L., Zhang, Y., Sunderraman, R., Fox, P., Laird, A., Turner, J., & Turner, M. (2015). Hybrid feature selection methods for online biomedical publication classification. 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, Canada, 1-8. https://doi.org/10.1109/CIBCB.2015.7300320
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084-3104. https://doi.org/10.1016/j.patcog.2012.03.004

상세보기
Manning, Christopher, Raghavan, & Prabhakar (2008). Introduction to information retrieval. NY, USA: Cambridge University Press.
Nayak, S., Ramesh, R., & Shah, S. (2013). A study of multi-label text classification and the effect of label hierarchy. CS224N Project Report, USA: Stanford University. retrieved from https://nlp.stanford.edu/courses/cs224n/2013/reports/nayak.pdf
Robnik-Sikonja M. (2004). Improving Random Forests. In: Boulicaut JF., Esposito F., Giannotti F., Pedreschi D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science, vol 3201. Springer, Berlin. https://doi.org/10.1007/978-3-540-30115-8_34
Roul, R. K., & Rai, P. (2016). A new feature selection technique combined with elm feature space for text classification. In Proceedings of the 13th International Conference on Natural Language Processing, 285-292.
Siroky, D. S. (2009). Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3, 147-163.

상세보기
Trieschnigg, D., Pezik, P., Lee, V., Jong, F. D., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: Effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418. https://doi.org/10.1093/bioinformatics/btp249

상세보기
Tsymbal, A., Pechenizkiy, M., & Cunningham, P. (2006) Dynamic integration with random forests. In: Furnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science, vol 4212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871842_82
Turner, M. D., Chakrabarti, C., Jones, T. B., Xu, J. F., Fox, P. T., Luger, G. F., Laird, A. R., & Turner, J. A. (2013). Automated annotation of functional imaging experiments via multi-label classification. Frontiers in neuroscience, 7, 240. http://doi.org/10.3389/fnins.2013.00240

상세보기
Ward, M. S., Pajevic, J., Dreyfuss, J., & Malley, J. (2006). Short-term prediction of mortality in patients with systemic lupus erythematosus: Classification of outcomes using random forests. Arthritis and Rheumatism, 55(1), 74-80. http:/doi.org/10.1002/art.21695

상세보기
Wu, Q., Ye, Y., Zhang, H., Ng, M. K., & Ho, Shen-Shyang. (2014). Fores texter: An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based System, 67, 105-116. http://doi.org/10.1016/j.knosys.2014.06.004

상세보기
Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computers, 7(12), 2913-2920. http://dx.doi.org/10.4304/jcp.7.12.2913-2920.
Xu, B., Huang, J. Z., Williams, G., & Ye, Y. (2012). Hybrid weighted random forests for classifying very high dimensional data. International Journal of Data Warehousing and Mining, 8(2), 44-63. http://dx.doi.org/10.4018/jdwm.2012040103

상세보기
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420.
Yao, D., Yang, J., & Zhan, X. (2013). An improved random forest algorithm for class-imbalanced data classification and its application in PAD risk factors analysis. The Open Electrical & Electronic Engineering Journal, 7, (Supple 1: M7) 62-70. http://dx.doi.org/10.2174/1874129001307010062

상세보기
Zhou Q., Zhou H., & Li, T. (2016). Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features. Knowledge-Based Systems, 95, 1-11. https://doi.org/10.1016/j.knosys.2015.11.010

상세보기

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증