[논문]자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구

김판준

doi:10.3743/kosim.2022.39.1.069

자질선정을 통한 국내 학술지 논문의 자동분류에 관한 연구
An Experimental Study on the Automatic Classification of Korean Journal Articles through Feature Selection 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.39 no.1, 2022년, pp.69 - 90

초록
AI-Helper

국내 학술연구의 동향을 구체적으로 파악하여 연구개발 활동의 체계적인 지원 및 평가는 물론 현재와 미래의 연구 방향을 설정할 수 있는 기초 데이터로서, 개별 학술지 논문에 표준화된 주제 범주(통제키워드)를 부여할 수 있는 효율적인 방안을 모색하였다. 이를 위해 한국연구재단 「학술연구분야분류표」 상의 분류 범주를 국내학술지 논문에 자동 할당하는 과정에서, 자질선정 기법을 중심으로 자동분류의 성능에 영향을 미치는 주요 요소들에 대한 다각적인 실험을 수행하였다. 그 결과, 실제 환경의 불균형 데이터세트(imbalanced dataset)인 국내 학술지 논문의 자동분류에서는 보다 단순한 분류기와 자질선정 기법, 그리고 비교적 소규모의 학습집합을 사용하여 상당히 좋은 수준의 성능을 기대할 수 있는 것으로 나타났다.

Abstract ▼ AI-Helper

As basic data that can systematically support and evaluate R&D activities as well as set current and future research directions by grasping specific trends in domestic academic research, I sought efficient ways to assign standardized subject categories (control keywords) to individual journal papers. To this end, I conducted various experiments on major factors affecting the performance of automatic classification, focusing on feature selection techniques, for the purpose of automatically allocating the classification categories on the National Research Foundation of Korea's Academic Research Classification Scheme to domestic journal papers. As a result, the automatic classification of domestic journal papers, which are imbalanced datasets of the real environment, showed that a fairly good level of performance can be expected using more simple classifiers, feature selection techniques, and relatively small training sets.

주제어

표/그림 (18)

그림 <그림 1> 국내 학술지 논문 수: 2011년∼2021년
표 <표 1> 문헌빈도에 기초한 2*2 분할표
표 <표 2> 자질선정 기법과 자질 순위화 공식
그림 <그림 2> 실험 단계와 주요 성능 요소
표 <표 3> 자질선정 기법을 적용한 4개 분류기의 분류 성능: 단일-범주명, mac_F1
표 <표 4> 자질선정 기법을 적용한 4개 분류기의 분류 성능: 단일-범주명, mic_F1
표 <표 5> 자질선정 기법을 적용한 4개 분류기의 분류 성능: 복수-범주명, mac_F1
표 <표 6> 자질선정 기법을 적용한 4개 분류기의 분류 성능: 복수-범주명, mic_F1
표 <표 7> 자질선정 기법을 적용한 Rocchio 분류기의 분류 성능: 단일-범주명, mac_F1
표 <표 8> 자질선정 기법을 적용한 Rocchio 분류기의 분류 성능: 단일-범주명, mic_F1
표 <표 9> 자질선정 기법을 적용한 Rocchio 분류기의 분류 성능: 복수-범주명, mac_F1
표 <표 10> 자질선정 기법을 적용한 Rocchio 분류기의 분류 성능: 복수-범주명, mic_F1
표 <표 11> 자질선정 기법을 적용한 SVM 분류기의 분류 성능: 단일-범주명, mac_F1
표 <표 12> 자질선정 기법을 적용한 SVM 분류기의 분류 성능: 단일-범주명, mic_F1
표 <표 13> 자질선정 기법을 적용한 SVM 분류기의 분류 성능: 복수-범주명, mac_F1
표 <표 14> 자질선정 기법을 적용한 SVM 분류기의 분류 성능: 복수-범주명, mic_F1
그림 <그림 3> 자질선정을 적용한 분류기의 평균성능: 단일-범주명, Rocchio vs. SVM
그림 <그림 4> 자질선정을 적용한 분류기의 평균성능: 복수-범주명, Rocchio vs. SVM

참고문헌 (49)

Chung, Eunkyung (2009). A semantic-based feature expansion approach for improving the effectiveness of text categorization by using WordNet. Journal of the Korean Society for information Management, 26(3), 261-278. https://doi.org/10.3743/KOSIM.2009.26.3.261

원문보기 상세보기
KCI(Korea Citation Index) (2022). Data Statistics. National Research Foundation of Korea. Available: https://www.kci.go.kr/kciportal/po/statistics/poStatisticsMain.kci?tab_codeTab3
Kim, Pan Jun & Lee, Jae Yun (2012). A study on the reclassification of author keywords for automatic assignment of descriptors. Journal of the Korean Society for Information Management, 29(2), 225-246. https://doi.org/10.3743/KOSIM.2012.29.2.225

원문보기 상세보기
Kim, Pan Jun & Lee, Jae Yun (2018). An experimental study on the performance improvement of automatic classification for the articles of Korean journals based on controlled keywords in international database. Journal of the Korean Library and Information Science, 48-3, 491-510. https://doi.org/10.4275/KSLIS.2014.48.3.491

원문보기 상세보기
Kim, Pan Jun (2006). A study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for Information Management, 23(1), 279-299. https://doi.org/10.3743/KOSIM.2006.23.1.279

원문보기 상세보기
Kim, Pan Jun (2016). An analytical study on performance factors of automatic classification based on machine learning. Journal of the Korean Society for Information Management, 33(2), 33-59. http://dx.doi.org/10.3743/KOSIM.2016.33.2.033

원문보기 상세보기
Kim, Pan Jun (2018). An analytical study on automatic classification of domestic journal articles based on machine learning. Journal of the Korean Society for Information Management, 35(2), 37-62. https://doi.org/10.3743/KOSIM.2018.35.2.037

원문보기 상세보기
Kim, Pan Jun (2019). An analytical study on automatic classification of domestic journal articles using random forest. Journal of the Korean Society for Information Management, 36(2), 37-62. https://doi.org/10.3743/KOSIM.2019.36.2.057

원문보기 상세보기
Kim, Pan Jun (2021a). A study on the characteristics by keyword types in the intellectual structure analysis based on co-word analysis: focusing on overseas open access field. Journal of the Korean Library and Information Science, 55-3, 103-129. http://dx.doi.org/10.4275/KSLIS.2021.55.3.103

원문보기 상세보기
Kim, Pan Jun (2021b). A study on the intellectual structure analysis by keyword type based on profiling: focusing on overseas open access field. Journal of the Korean Library and Information Science, 55-4, 115-140. http://dx.doi.org/10.4275/KSLIS.2021.55.4.115

원문보기 상세보기
Kim, Seon-Wu, Ko, Gun-Woo, Choi, Won-Jun, Jeong, Hee-Seok, Yoon, Hwa-Mook, & Choi, Sung-Pil (2018). Semi-automatic construction of learning set and integration of automatic classification for academic literature in technical sciences. Journal of the Korean Society for Information Management, 35(4), 141-164. http://dx.doi.org/10.3743/KOSIM.2018.35.4.141

원문보기 상세보기
Lee, Jae Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123

원문보기 상세보기
National Research Foundation of Korea (2016). Academic Research Classification Scheme. Available: https://www.nrf.re.kr/biz/doc/class/view?menu_no323
Yuk, Jee Hee & Song, Min (2018). A study of research on methods of automated biomedical document classification using topic modeling and deep learning. Journal of the Korean Society for information Management, 35(2), 63-88. https://doi.org/10.3743/KOSIM.2018.35.2.063

원문보기 상세보기
Abiodun, E. O., Alabdulatif, A., Abiodun, O. I., Alawida, M., Alabdulatif, A., & Alkhawaldeh, R. S. (2021). A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities. Neural Computing & Applications, 33(4), 1-28. https://doi.org/10.1007/s00521-021-06406-8

상세보기
Cai, J., Luo, J., Wang, S., & Yang, S. (2018). Feature selection in machine learning: a new perspective. Neurocomputing, 300, 70-79. https://doi.org/10.1016/j.neucom.2017.11.077

상세보기
Chandrashekar, G. & Sahin, F. (2014) A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024

상세보기
Chang, F., Guo, J., Xu, W., & Yao, K. (2015). A feature selection method to handle imbalanced data in text classification. Journal of Digital Information Management, 13, 169-175. Available: https://www.dline.info/fpaper/jdim/v13i3/v13i3_6.pdf

상세보기
Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: a review. Multimedia Tools and Applications, 78, 3797-3816. https://doi.org/10.1007/s11042-018-6083-5

상세보기
Drotar, P., Gazda, J., & Smekal, Z. (2015). An experimental comparison of feature selection methods on two-class biomedical datasets. Computers in Biology and Medicine, 66, 1-10. https://doi.org/10.1016/j.compbiomed.2015.08.010

상세보기
Drotar, P., Gazda, M., & Vokorokos, L. (2019). Ensemble feature selection using election methods and ranker clustering. Information Sciences, 480, 365-380. https://doi.org/10.1016/j.ins.2018.12.033

상세보기
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289-1305. Available: https://www.jmlr.org/papers/volume3/forman03a/forman03a_full.pdf
Fragoudis, D., Meretakis, D., & Likothanassis, S. (2005). Best terms: an efficient feature-selection algorithm for text categorization. Knowledge and Information Systems, 8(1), 16-33. https://doi.org/10.1007/s10115-004-0177-2

상세보기
Gunal, S. (2012). Hybrid feature selection for text classification. Turkish Journal of Electrical Engineering and Computer Science, 20(Sup.2), 1296-1311. Available: https://dergipark.org.tr/en/pub/tbtkelektrik/issue/12058/144170
Gutkin, M., Shamir, R., & Dror, G. (2009). SlimPLS: a method for feature selection in gene expression-based disease classification. PloS One, 4(7), e6416. https://doi.org/10.1371/journal.pone.0006416

상세보기
Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182. Available: https://dl.acm.org/doi/pdf/10.5555/944919.944968
Harish, B. & Revanasiddappa, M. (2017). A comprehensive survey on various feature selection methods to categorize text documents. International Journal of Computer Applications, 164, 1-7. http://doi.org/10.5120/ijca2017913711

상세보기
Iqbal, M., Abid, M. M., Khalid, M. N., & Manzoor, A. (2020). Review of feature selection methods for text classification. International Journal of Advanced Computer Research, 10(49), 138-152. http://dx.doi.org/10.19101/IJACR.2020.1048037

상세보기
Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML '97), 143-151. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi10.1.1.45.6977&reprep1&typepdf
Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, theory and algorithms. USA: Kluwer Academic Publishers.
Kashef, S., Nezamabadi-pour, H., & Nikpour, B. (2018). Multi-label feature selection: a comprehensive review and guiding experiments. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), e1240. https://doi.org/10.1002/widm.1240

상세보기
Kragelj, M. & Kljajic Borstnar, M. (2021). Automatic classification of older electronic texts into the Universal Decimal Classification-UDC. Journal of Documentation, 77(3), 755-776. https://doi.org/10.1108/JD-06-2020-0092

상세보기
Kumar, V. & Minz, S. (2014). Feature selection: a literature review. Smart Computing Review, 4(3), 211-229. Available: https://faculty.cc.gatech.edu/~hic/CS7616/Papers/Kumar-Minz-2014.pdf
Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. NY, USA: Cambridge University Press.
Mengle, S. S. R. & Goharian, N. (2009). Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science & Technology, 60(5), 1037-1050. https://doi.org/10.1002/asi.21023

상세보기
Mironczuk, M. & Protasiewicz, J. (2018). A recent overview of the state-of-the-art elements of text classification. Expert Systems with Applications, 106, 36-54. https://doi.org/10.1016/j.eswa.2018.03.058

상세보기
Pereira, R. B., Plastino, A., Zadrozny, B., & Merschmann, L. H. (2018). Correlation analysis of performance measures for multi-label classification. Information Processing & Management, 54(3), 359-369. https://doi.org/10.1016/j.ipm.2018.01.002

상세보기
Pinheiro, R. H. W., Cavalcanti, G. D. C., & Ren, T. I. (2015). Data-driven global-ranking local feature selection methods for text categorization, Expert Systems with Applications, 42 (4), 1941-1949. https://doi.org/10.1016/j.eswa.2014.10.011

상세보기
Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54, 6149-6200. https://doi.org/10.1007/s10462-021-09970-6

상세보기
Rehman, A., Javed, K., Babri, H. A., & Asim, N. (2018). Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Systems with Applications, 114, 78-96. https://doi.org/10.1016/j.eswa.2018.07.028

상세보기
Salton, G. & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0

상세보기
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1-47. https://doi.org/10.1145/505282.505283

상세보기
Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In International Symposium on Intelligent Data Analysis. Springer, Berlin, Heidelberg, 440-451. https://doi.org/10.1007/11552253_40

상세보기
Uysal, A. K. (2016). An improved global feature selection scheme for text classification. Expert Systems with Applications, 43(1), 82-92, https://doi.org/10.1016/j.eswa.2015.08.050

상세보기
Venkatesh, B. & Anuradha, J. (2019). A review of feature selection and its methods. Cybernetics and Information Technologies, 19(1), 3-26. https://doi.org/10.2478//cait-2019-0001

상세보기
Wang, D., Zhang, H., Liu, R., Liu, X., & Wang, J. (2016). Unsupervised feature selection through gram-Schmidt orthogonalization-A word co-occurrence perspective. Neurocomputing, 173(P3), 845-854. https://doi.org/10.1016/j.neucom.2015.08.038

상세보기
Wang, D., Zhang, H., Liu, R., Lv, W., & Wang, D. (2014). t-test feature selection approach based on term frequency for text categorization. Pattern Recognition Letters, 45, 1-10. https://doi.org/10.1016/j.patrec.2014.02.013

상세보기
Wu, Y. & Zhang, A. (2004). Feature selection for classifying high-dimensional numerical data. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, 2, 251-258. http://doi.org/10.1109/CVPR.2004.1315171
Yang, Y. & Pedersen. J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, July 08-12, 412-420. Available: http://nyc.lti.cs.cmu.edu/yiming/Publications/yang-icml97.pdf

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증