$\require{mediawiki-texvc}$

연합인증

연합인증 가입 기관의 연구자들은 소속기관의 인증정보(ID와 암호)를 이용해 다른 대학, 연구기관, 서비스 공급자의 다양한 온라인 자원과 연구 데이터를 이용할 수 있습니다.

이는 여행자가 자국에서 발행 받은 여권으로 세계 각국을 자유롭게 여행할 수 있는 것과 같습니다.

연합인증으로 이용이 가능한 서비스는 NTIS, DataON, Edison, Kafe, Webinar 등이 있습니다.

한번의 인증절차만으로 연합인증 가입 서비스에 추가 로그인 없이 이용이 가능합니다.

다만, 연합인증을 위해서는 최초 1회만 인증 절차가 필요합니다. (회원이 아닐 경우 회원 가입이 필요합니다.)

연합인증 절차는 다음과 같습니다.

최초이용시에는
ScienceON에 로그인 → 연합인증 서비스 접속 → 로그인 (본인 확인 또는 회원가입) → 서비스 이용

그 이후에는
ScienceON 로그인 → 연합인증 서비스 접속 → 서비스 이용

연합인증을 활용하시면 KISTI가 제공하는 다양한 서비스를 편리하게 이용하실 수 있습니다.

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구
An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.36 no.2 = no.112, 2019년, pp.57 - 77  

김판준 (신라대학교 문헌정보학과)

초록
AI-Helper 아이콘AI-Helper

대표적인 앙상블 기법으로서 랜덤포레스트(RF)를 문헌정보학 분야의 학술지 논문에 대한 자동분류에 적용하였다. 특히, 국내 학술지 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 트리 수, 자질선정, 학습집합 크기 등 주요 요소들에 대한 다각적인 실험을 수행하였다. 이를 통해, 실제 환경의 불균형 데이터세트(imbalanced dataset)에 대하여 랜덤포레스트(RF)의 성능을 최적화할 수 있는 방안을 모색하였다. 결과적으로 국내 학술지 논문의 자동분류에서 랜덤포레스트(RF)는 트리 수 구간 100~1000(C)과 카이제곱통계량(CHI)으로 선정한 소규모의 자질집합(10%), 대부분의 학습집합(9~10년)을 사용하는 경우에 가장 좋은 분류 성능을 기대할 수 있는 것으로 나타났다.

Abstract AI-Helper 아이콘AI-Helper

Random Forest (RF), a representative ensemble technique, was applied to automatic classification of journal articles in the field of library and information science. Especially, I performed various experiments on the main factors such as tree number, feature selection, and learning set size in terms...

주제어

표/그림 (11)

참고문헌 (59)

  1. Kang, S., Jeon, H., Kim, J., & Song, J. (2015). A study on domestic drama rating prediction. The Korean Journal of Applied Statistics, 28(5), 933-949. http://dx.doi.org/10.5351/KJAS.2015.28.5.933 

  2. Kwon, A. (2013). Variable selection using Random Forest. unpublished master's thesis, Inha University. 

  3. Kim, S., & Ahn, H. (2016). Application of Random Forests to corporate credit rating prediction. The Journal of Business and Economics, 32(1), 187-211. 

  4. Kim, P. J. (2006). A Study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279 

  5. Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033 

  6. Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037 

  7. Nam, S., Oh, M., Kim, S., Kang, C., Kim, G., & Choi, S. (2017). Comparison of machine learning models for classification into user-oriented groups. Journal of the Korean Data Analysis Society, 19(5), 2501-1507. 

  8. Suh, J. (2016). Foreign exchange rate forecasting using the GARCH extended Random Forest model. Journal of Industrial Economics and Business, 29(5), 1607-1628. 

  9. Yoo, J. (2015). Random forests, an alternative data mining technique to decision tree. Journal of Educational Evaluation, 28(2), 427-448. 

  10. Yun, Taegyun, & Yi, Gwan-Su (2008). Application of Random Forest algorithm for the decision support system of medical diagnosis with the selection of significant clinical test. The Transactions of the Korean Institute of Electrical Engineers, 57(6), 1058-1062. 

  11. Lee, Jae-Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123 

  12. Lee, C., Yoo, K., Mun, B., & Bae, S. (2017). Informal quality data analysis via sentimental analysis and Word2vec method. Journal of Korean Society for Quality Management, 45(1), 117-127. http://dx.doi.org/10.7469/JKSQM.2017.45.1.117 

  13. Lee, H., Shin, D., Park, H., Kim, S., & Shin, D. (2011). Research on the modified algorithm for improving accuracy of Random Forest classifier which identifies automatically arrhythmia. The KIPS Transactions: Part B, 18(6), 341-348. 

  14. Jeong, S., Choi, M., & Kim, H. (2016). Coreference resolution for Korean using Random Forests. Journal of KIISE, 5(11), 535-540. 

  15. Jeong, J., Jang, K., & Kim, J. (2016). Target classification method using Random Forest and genetic algorithm. 2016 IEIE Fall Conference, 601-604. 

  16. Jo, H., & Park, C. (2018). Analysis of reporting characteristics of newspapers in the 19th presidential election based on random forest. Journal of the Korean data & information science society, 29(2), 367-375. http://dx.doi.org/10.7465/jkdi.2018.29.2.367 

  17. Choi, H., Choi, S., & Han, K. (2012). Prediction of DNA binding sites in proteins using a Random Forest. Journal of KIISE, 39(7), 515-522. 

  18. Hong, J., Ko, B., & Nam, J. (2013). Human action recognition in still image using weighted bag-of-features and ensemble decision trees. The Journal of Korean Institute of Communications and Information Sciences, 38(1), 1-9. https://doi.org/10.7840/kics.2013.38A.1.1 

  19. Afianto, M. F., Adiwijaya, & Al-Faraby, S. (2017). Text categorization on Hadith Sahih Al-Bukhari using Random Forest, International Conference on Data and Information Science, IOP Conference Series: Journal of Physics: Conf. Series 971. http://doi.org/10.1088/1742-6596/971/1/012037 

  20. Amaratunga, D., Cabrera, J., & Lee, Y. (2008). Enriched random forests. Bioinformatics, 24(18), 2010-2014. https://doi.org/10.1093/bioinformatics/btn356 

  21. Aung, W. T., Myanmar, Y., & Hla, K. H. M. S. (2009). Random forest classifier for multi-category classification of web pages. In Services Computing Conference, APSCC 2009. IEEE Asia-Pacific, 372-376. http://doi.org/10.1109/APSCC.2009.5394100 

  22. Austin, P. C., Tu, J. V., Ho, J. E., Levy, D., & Lee, D. S. (2013). Using methods from the data-mining and machine-learning literature for disease classification and prediction: A case study examining classification of heart failure subtypes. Journal of Clinical Epidemiology, 66(4), 398-407. http://doi.org/10.1016/j.jclinepi.2012.11.008 

  23. Berk, R., Li, A., & Hickman, L. J. (2005). Statistical difficulties in determining the role of race in capital cases: A re-analysis of data from the state of Maryland. Journal of Quantitative Criminology, 21(4), 365-390. https://doi.org/10.1007/s10940-005-7354-7 

  24. Boinee, P., Angelis, A. D., & Foresti, G. L. (2005). Meta random forests. International Journal of Computational Intelligence, 2(3), 138-147. 

  25. Brandenburg, Minke (2017). Text classification of Dutch police records. Unpublished master's thesis, Utrecht University Artificial Intelligence, Netherlands. 

  26. Breiman L. (2002). Random forests. Machine Learning, 45(1), 5-32. 

  27. Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446-3453. http://doi.org/10.1016/j.eswa.2011.09.033 

  28. Choi, S., & Kim, H. (2016). Tree size determination for classification ensemble. Journal of the Korean Data &Information Science Society, 27(1), 255-264. http://dx.doi.org/10.7465/jkdi.2016.27.1.255 

  29. Cutler, D. R., Edwards, T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., & Lawler, J. J. (2007). Random forests for classification in ecology. Ecology, 88(11), 2783-2792. http://doi.org/10.1890/07-0539.1 

  30. Dogan, T., & Uysal, A. K. (2018). The impact of feature selection on urban land cover classification. International Journal of Intelligent Systems and Applications in Engineering(IJISAE), 6(1), 59-64. http://doi.org/10.18201/ijisae.2018637933 

  31. Fawagreh, K., Gaber, M. M., & Elyan, E. (2014). Random forests: From early developments to recent advancements. Systems Science & Control Engineering, 2(1), 602-609. http://doi.org/10.1080/21642583.2014.956265 

  32. Gao, D., Zhang, Y., & Zhao, Y. (2009). Random forest algorithm for classification of multiwavelength data. Research in Astronomy and Astrophysics, 9(2), 220-226. http://doi.org/101088/1674-4527/9/2/011 

  33. Kim, M. J., Kang, D. K., &. Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3), 1074-1082. https://doi.org/10.1016/j.eswa.2014.08.025 

  34. Klassen, M., & Paturi, N. (2010). Web document classification by keywords using Random Forests. In: Zavoral F., Yaghob J., Pichappan P., El-Qawasmeh E. (eds) Networked Digital Technologies. NDT 2010. Communications in Computer and Information Science, vol 88. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14306-9_26 

  35. Kong, Q., Gong, H., Ding, X., & Hou, R. (2017). Classification application based on mutual information and Random Forest method for high dimensional data. 9th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, 171-174. https://doi.org/10.1109/IHMSC.2017.45 

  36. Latinne, P., Debeir, O., & Decaestecker, C. (2001). Limiting the number of trees in Random Forests. In: Kittler J., Roli F. (eds) Multiple Classifier Systems. MCS 2001. Lecture Notes in Computer Science, vol 2096. Springer, Berlin, Heidelberg, 178-187. https://doi.org/10.1007/3-540-48219-9_18 

  37. Lee, Jaesung, & Kim, Dae-Won (2015). Mutual information-based multi-label feature selection using interaction information. Expert Systems with Applications, 42(4), 2013-2025. https://doi.org/10.1016/j.eswa.2014.09.063 

  38. Liparas D., HaCohen-Kerner Y., Moumtzidou A., Vrochidis S., & Kompatsiaris I. (2014). News articles classification using Random Forests and weighted multimodal features. In: Lamas D., Buitelaar P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6 

  39. Lok, C. (2010). Speed reading. Nature 463, 28. http://doi.org/10.1038/463416a 

  40. Low, F., Schorcht, G., Michel, U., Dech, S., & Conrad, C. (2012). Per-field crop classification in irrigated agricultural regions in middle Asia using random forest and support vector machine ensemble. Proc. SPIE 8538, Earth Resources and Environmental Remote Sensing/GIS Applications III, 85380R (25 October 2012). http://doi.org/10.1117/12.974588 

  41. Ma, L. (2017). A multi-label text classification framework: Using supervised and unsupervised feature selection strategy. Unpublished doctoral dissertation, Georgia State University. retrieved from https://scholarworks.gsu.edu/cs_diss/134 

  42. Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 18(1), 169. https://doi.org/10.1186/s12859-017-1578-z 

  43. Ma, L., Zhang, Y., Sunderraman, R., Fox, P., Laird, A., Turner, J., & Turner, M. (2015). Hybrid feature selection methods for online biomedical publication classification. 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, Canada, 1-8. https://doi.org/10.1109/CIBCB.2015.7300320 

  44. Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084-3104. https://doi.org/10.1016/j.patcog.2012.03.004 

  45. Manning, Christopher, Raghavan, & Prabhakar (2008). Introduction to information retrieval. NY, USA: Cambridge University Press. 

  46. Nayak, S., Ramesh, R., & Shah, S. (2013). A study of multi-label text classification and the effect of label hierarchy. CS224N Project Report, USA: Stanford University. retrieved from https://nlp.stanford.edu/courses/cs224n/2013/reports/nayak.pdf 

  47. Robnik-Sikonja M. (2004). Improving Random Forests. In: Boulicaut JF., Esposito F., Giannotti F., Pedreschi D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science, vol 3201. Springer, Berlin. https://doi.org/10.1007/978-3-540-30115-8_34 

  48. Roul, R. K., & Rai, P. (2016). A new feature selection technique combined with elm feature space for text classification. In Proceedings of the 13th International Conference on Natural Language Processing, 285-292. 

  49. Siroky, D. S. (2009). Navigating random forests and related advances in algorithmic modeling. Statistics Surveys, 3, 147-163. 

  50. Trieschnigg, D., Pezik, P., Lee, V., Jong, F. D., Kraaij, W., & Rebholz-Schuhmann, D. (2009). MeSH Up: Effective MeSH text classification for improved document retrieval. Bioinformatics, 25(11), 1412-1418. https://doi.org/10.1093/bioinformatics/btp249 

  51. Tsymbal, A., Pechenizkiy, M., & Cunningham, P. (2006) Dynamic integration with random forests. In: Furnkranz, J., Scheffer, T., Spiliopoulou, M. (eds) Machine Learning: ECML 2006. ECML 2006. Lecture Notes in Computer Science, vol 4212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11871842_82 

  52. Turner, M. D., Chakrabarti, C., Jones, T. B., Xu, J. F., Fox, P. T., Luger, G. F., Laird, A. R., & Turner, J. A. (2013). Automated annotation of functional imaging experiments via multi-label classification. Frontiers in neuroscience, 7, 240. http://doi.org/10.3389/fnins.2013.00240 

  53. Ward, M. S., Pajevic, J., Dreyfuss, J., & Malley, J. (2006). Short-term prediction of mortality in patients with systemic lupus erythematosus: Classification of outcomes using random forests. Arthritis and Rheumatism, 55(1), 74-80. http:/doi.org/10.1002/art.21695 

  54. Wu, Q., Ye, Y., Zhang, H., Ng, M. K., & Ho, Shen-Shyang. (2014). Fores texter: An efficient random forest algorithm for imbalanced text categorization. Knowledge-Based System, 67, 105-116. http://doi.org/10.1016/j.knosys.2014.06.004 

  55. Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computers, 7(12), 2913-2920. http://dx.doi.org/10.4304/jcp.7.12.2913-2920. 

  56. Xu, B., Huang, J. Z., Williams, G., & Ye, Y. (2012). Hybrid weighted random forests for classifying very high dimensional data. International Journal of Data Warehousing and Mining, 8(2), 44-63. http://dx.doi.org/10.4018/jdwm.2012040103 

  57. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420. 

  58. Yao, D., Yang, J., & Zhan, X. (2013). An improved random forest algorithm for class-imbalanced data classification and its application in PAD risk factors analysis. The Open Electrical & Electronic Engineering Journal, 7, (Supple 1: M7) 62-70. http://dx.doi.org/10.2174/1874129001307010062 

  59. Zhou Q., Zhou H., & Li, T. (2016). Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features. Knowledge-Based Systems, 95, 1-11. https://doi.org/10.1016/j.knosys.2015.11.010 

저자의 다른 논문 :

관련 콘텐츠

오픈액세스(OA) 유형

BRONZE

출판사/학술단체 등이 한시적으로 특별한 프로모션 또는 일정기간 경과 후 접근을 허용하여, 출판사/학술단체 등의 사이트에서 이용 가능한 논문

이 논문과 함께 이용한 콘텐츠

저작권 관리 안내
섹션별 컨텐츠 바로가기

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

AI-Helper 아이콘
AI-Helper
안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

선택된 텍스트

맨위로