[논문]기계학습에 기초한 자동분류의 성능 요소에 관한 연구

김판준

doi:10.3743/kosim.2016.33.2.033

[국내논문] 기계학습에 기초한 자동분류의 성능 요소에 관한 연구
An Analytical Study on Performance Factors of Automatic Classification based on Machine Learning 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.33 no.2 = no.100, 2016년, pp.33 - 59

초록
AI-Helper

국내 학술회의 논문으로 구성된 문헌집합을 대상으로 기계학습에 기초한 자동분류의 성능에 영향을 미치는 요소들을 검토하였다. 특히 구현이 쉽고 컴퓨터 처리 속도가 빠른 로치오 알고리즘을 사용하여 "한국정보관리학회 학술대회 논문집"의 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 분류기 생성 방법, 학습집합 규모, 가중치부여 기법, 범주 할당 방법 등 주요 요소들의 특성을 다각적인 실험을 통해 살펴보았다. 결과적으로 분류 환경 및 문헌집합의 특성에 따라 파라미터(${\beta}$, ${\lambda}$)와 학습집합의 크기(5년 이상)를 적절하게 적용하는 것이 효과적이며, 동등한 성능 수준이라면 보다 단순한 단일 가중치부여 기법을 사용하여 분류의 효율성을 높일 수 있음을 발견하였다. 또한 국내 학술회의 논문의 분류는 특정 논문에 하나 이상의 범주가 부여되는 복수-범주 분류(multi-label classification)가 실제 환경에 부합한다고 할 수 있으므로, 이러한 환경을 고려하여 주요 성능 요소들의 특성에 기초한 최적의 분류 모델을 개발할 필요가 있다.

Abstract ▼ AI-Helper

This study examined the factors affecting the performance of automatic classification for the domestic conference papers based on machine learning techniques. In particular, In view of the classification performance that assigning automatically the class labels to the papers in Proceedings of the Conference of Korean Society for Information Management using Rocchio algorithm, I investigated the characteristics of the key factors (classifier formation methods, training set size, weighting schemes, label assigning methods) through the diversified experiments. Consequently, It is more effective that apply proper parameters (${\beta}$, ${\lambda}$) and training set size (more than 5 years) according to the classification environments and properties of the document set. and If the performance is equivalent, I discovered that the use of the more simple methods (single weighting schemes) is very efficient. Also, because the classification of domestic papers is corresponding with multi-label classification which assigning more than one label to an article, it is necessary to develop the optimum classification model based on the characteristics of the key factors in consideration of this environment.

주제어

참고문헌 (67)

강승식 (2002). 한국어 형태소 분석과 정보검색. 서울: 홍릉출판사. (Kang, Seung-Shik (2002). Korean Morphology and Information Retrieval. Hongrung Publishing Company.)
김성희, 엄재은 (2008). 기계학습을 이용한 문서 자동분류에 관한 연구. 정보관리연구, 39(4), 47-66. http://dx.doi.org/10.1633/jim.2008.39.4.047 (Kim, Seong-Hee, & Eom, Jae-Eun (2008). A study on the documents' automatic classification using machine learning. Journal of Information Management, 39(4), 47-66. http://dx.doi.org/10.1633/JIM.2008.39.4.047)

원문보기 상세보기
김용환, 정영미 (2012). 위키피디아를 이용한 분류자질 선정에 관한 연구. 정보관리학회지, 29(2), 155-171. http://dx.doi.org/10.3743/kosim.2012.29.2.155 (Kim, Yong-Hwan, & Chung, Young-Mee (2012). An experimental study on feature selection using Wikipedia for text categorization. Journal of the Korean Society for Information Management, 29(2), 155-171. http://dx.doi.org/10.3743/kosim.2012.29.2.155)

원문보기 상세보기
김종민, 유창동 (2014). 특징 추출 비용에 민감한 분류를 위한 선형 분류기 최적화 알고리즘. 2014년도 대한전자공학회 하계학술대회 논문집, 37(1), 2021-2024. (Kim, Jong-Min, & Yoo, Chang D. (2014). Linear classifier optimization for feature acquisition cost-sensitive classification. Proceedings of the IEEK Conference, 37(1), 2021-2024.)
김판준 (2006a). 기계학습을 통한 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(1), 279-299. http://dx.doi.org/10.3743/kosim.2006.23.1.279 (Kim, Pan Jun (2006a). A study on automatic assignment of descriptors using machine learning. Journal of the Korean Society for Information Management, 23(1), 279-299. http://dx.doi.org/10.3743/kosim.2006.23.1.279)

원문보기 상세보기
김판준 (2006b). 로치오 알고리즘을 이용한 학술지 논문의 디스크립터 자동부여에 관한 연구. 정보관리학회지, 23(3), 69-89. http://dx.doi.org/10.3743/kosim.2006.23.3.069 (Kim, Pan Jun (2006b). A study on the automatic descriptor assignment for scientific journal articles uing rocchio algorithm. Journal of the Korean Society for Information Management, 23(3), 69-89. http://dx.doi.org/10.3743/kosim.2006.23.3.069)

원문보기 상세보기
김판준 (2008). 용어 가중치부여 기법을 이용한 로치오 분류기의 성능 향상에 관한 연구. 정보관리학회지, 25(1), 211-233. http://dx.doi.org/10.3743/kosim.2008.25.1.211 (Kim, Pan Jun (2008). A study on the performance improvement of rocchio classifier with term weighting methods. Journal of the Korean Society for Information Management, 25(1), 211-233. http://dx.doi.org/10.3743/kosim.2008.25.1.211)

원문보기 상세보기
김판준, 이재윤 (2007). 문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구. 정보관리학회지, 24(1), 251-271. http://dx.doi.org/10.3743/kosim.2007.24.1.251 (Kim, Pan Jun, & Lee, Jae Yun (2007). Utilizing unlabeled documents in automatic classification with inter-document similarities. Journal of the Korean Society for Information Management, 24(1), 251-271. http://dx.doi.org/10.3743/kosim.2007.24.1.251)

원문보기 상세보기
김판준, 이재윤 (2012). 디스크립터 자동 할당을 위한 저자키워드의 재분류에 관한 실험적 연구. 정보관리학회지, 29(2), 225-246. http://dx.doi.org/10.3743/kosim.2012.29.2.225 (Kim, Pan Jun, & Lee, Jae Yun (2012). A study on the reclassification of author keywords for automatic assignment of descriptors. Journal of the Korean Society for Information Management, 29(2), 225-246. http://dx.doi.org/10.3743/kosim.2012.29.2.225)

원문보기 상세보기
김판준, 이재윤 (2014). 해외 데이터베이스의 통제키 워드에 기초한 국내 학술지 논문의 자동분류 성능향상에 관한 실험적 연구. 한국문헌정보학회지, 48(3), 491-510. http://dx.doi.org/10.4275/kslis.2014.48.3.491 (Kim, Pan Jun, & Lee, Jae Yun (2014). An experimental study on the performance improvement of automatic classification for the articles of Korean journals based on controlled keywords in international database. Journal of the Korean Society for Library and Information Science, 48(3), 491-510. http://dx.doi.org/10.4275/kslis.2014.48.3.491)

원문보기 상세보기
송성전, 정영미 (2012). 용어의 문맥활용을 통한 문헌 자동 분류의 성능 향상에 관한 연구. 정보관리학회지, 29(2), 205-224. http://dx.doi.org/10.3743/kosim.2012.29.2.205 (Song, Sung-Jeon, & Chung, Young-Mee (2012). A study on improving the performance of document classification using the context of terms. Journal of the Korean Society for Information Management, 29(2), 205-224. http://dx.doi.org/10.3743/kosim.2012.29.2.205)

원문보기 상세보기
심 경 (2006). 문헌범주화에서 학습문헌수 최적화에 관한 연구. 정보관리학회지, 23(4), 277-294. http://dx.doi.org/10.3743/kosim.2006.23.4.277 (Shim, Kyung (2006). Optimization of number of training documents in text categorization. Journal of the Korean Society for Information Management, 23(4), 277-294. http://dx.doi.org/10.3743/kosim.2006.23.4.277)

원문보기 상세보기
심경, 정영미 (2006). 학습문헌집합에 기 부여된 범주의 정확성과 문헌 범주화 성능. 정보관리학회지, 23(2), 265-285. http://dx.doi.org/10.3743/kosim.2006.23.2.265 (Shim, Kyung, & Chung, Young-Mee (2006). The effect of the quality of pre-assigned subject categories on the text categorization performance. Journal of the Korean Society for Information Management, 23(2), 265-285. http://dx.doi.org/10.3743/kosim.2006.23.2.265)

원문보기 상세보기
이용구 (2009). 기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석. 한국문헌정보학회지, 43(1), 313-332. http://dx.doi.org/10.4275/kslis.2009.43.1.313 (Lee, Yong-Gu (2009). Classification performance analysis of cross-language text categorization using machine translation. Journal of the Korean Society for Library and Information Science, 43(1), 313-332. http://dx.doi.org/10.4275/kslis.2009.43.1.313)

원문보기 상세보기
이용구 (2013). 문헌빈도와 장서빈도를 이용한 kNN 분류기의 자질선정에 관한 연구. 한국도서관.정보학회지, 44(1), 27-47. http://dx.doi.org/10.16981/kliss.44.1.201303.27 (Lee, Yong-Gu (2013). A study on feature selection for kNN classifier using document frequency and collection frequency. Journal of Korean Library and Information Science Society, 44(1), 27-47. http://dx.doi.org/10.16981/kliss.44.1.201303.27)

원문보기 상세보기
이재윤 (2005a) 문서측 자질선정을 이용한 고속 문서분류기의 성능향상에 관한 연구. 정보관리연구, 36(4), 51-69. http://dx.doi.org/10.1633/jim.2005.36.4.051 (Lee, Jae Yun (2005a). Improving the performance of a fast text classifier with document-side feature selection. Journal of Information Management, 36(4), 51-69. http://dx.doi.org/10.1633/jim.2005.36.4.051)

원문보기 상세보기
이재윤 (2005b). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구. 한국문헌정보학회지, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123 (Lee, Jae Yun (2005b). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123)

원문보기 상세보기
정은경 (2009). 문서범주화 성능 향상을 위한 의미기반 자질확장에 관한 연구. 정보관리학회지, 26(3), 261-278. http://dx.doi.org/10.3743/kosim.2009.26.3.261 (Chung, Eun-Kyung (2009). A semantic-based feature expansion approach for improving the effectiveness of text categorization by using wordNet. Journal of the Korean Society for Information Management, 26(3), 261-278. http://dx.doi.org/10.3743/kosim.2009.26.3.261)

원문보기 상세보기
한국연구재단 학술연구분야 분류표 (2015). Retrieved from http://www.nrf.re.kr
한국학술지인용색인 웹사이트 (2016). Retrieved from https://www.kci.go.kr
AI-Salemi, B., Aziz, M., Juzaiddin, A., & Noah, S. (2015). Boosting algorithms with topic modeling for multi-label text categorization: A comparative empirical study. Journal of Information Science, 41(5), 732-746. http://dx.doi.org/10.1177/0165551515590079

상세보기
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010a). Local causal and markov blanket induction for causal discovery and feature selection for classification. Part I: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11, 171-234.
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S., & Koutsoukos, X. D. (2010b). Local causal and markov blanket induction for causal discovery and feature selection for classification. Part II: Analysis and extensions. Journal of Machine Learning Research, 11, 235-284.
Aphinyanaphongs, Y., Fu, L., Li, Z., & Peskin, E. R. (2014). A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization. Journal of the Association for Information Science and Technology, 65(10), 1964-1987. http://dx.doi.org/10.1002/asi.23110

상세보기
Chen, E., Lin, Y., Xiong, H., Luo, Q., & Ma, H. (2011). Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing and Management, 47(2), 202-214. http://dx.doi.org/10.1016/j.ipm.2010.07.003

상세보기
Cohen, W. W., & Singer, Y. (1999). Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), 141-173. http://dx.doi.org/10.1145/306686.306688

상세보기
Debole, F., & Sebastiani, F. (2003). Supervised term weighting for automated text categorization. Proceedings of the 18th ACM Symposium on Applied Computing (SAC) 2003, 784-788. http://dx.doi.org/10.1145/952532.952688
Devi, P. R., Suganya, B. R., & Abirami, S. (2015). Multi-label learning with class-based features using extended centroid-based classification technique (CCBF). Procedia Computer Science, 54, 405-411. http://dx.doi.org/10.1016/j.procs.2015.06.047

상세보기
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
Forman, G., & Kirshenbaum, E. (2008). Extremely fast text feature extraction for classification and indexing. Proceedings of the 17th ACM Conference on Information and Knowledge Mining (CIKM) 2008, 26-30. http://dx.doi.org/10.1145/1458082.1458243
Foulds, J., & Frank, E. (2010). A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1), 1-25. http://dx.doi.org/10.1017/s026988890999035x

상세보기
Genkin, A., Lewis, D. D., & Madigan, D. (2007). Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3), 291-304. http://dx.doi.org/10.1198/004017007000000245

상세보기
Harish B. S., Guru D. S., & Manjunath, S. (2010). Representation and classification of text documents: A brief review. Proceedings of the IJCA Special Issue on Recent Trends in Image Processing and Pattern Recognition, RTIPPR, 110-119.
Hull, D. A. (1994). Improving text retrieval for the routing problem using latent semantic indexing. SIGIR-94, 282-291. http://dx.doi.org/10.1007/978-1-4471-2099-5_29
Ittner, J. D., Lewis, D. D., & Ahn, D. D. (1995). Text categorization of low quality images. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR) 1995, 301-315.
Jain, R., & Nitin, P. (2015). Feature selection for effective text classification using semantic information. International Journal of Computer Applications, 113(10), 18-25. http://dx.doi.org/10.5120/19861-1818
Jiang, S., Pang, G., Wu, M., & Kuang, L. (2012). An improved k-nearest-neighbor algorithm for text categorization. Expert Systems with Applications, 39(1), 1503-1509. http://dx.doi.org/10.1016/j.eswa.2011.08.040

상세보기
Joachims, T. (1997). A probabilistic analysis of the rocchio algorithm with tdf for text categorization. Proceedings of the International Conference on Machine Learning (ICML) 1997, 143-151.
Khan, A., Baharudin, B., & Lee, L. H. (2010). A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology, 1(1), 4-20. http://dx.doi.org/10.4304/jait.1.1.4-20
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273-324. http://dx.doi.org/10.1016/s0004-3702(97)00043-x

상세보기
Korde, V., & Mahender, C. N. (2012). Text classification and classifiers: A survey. International Journal of Artificial Intelligence & Applications (IJAIA), 3(2), 85-99.
Kumar, M. A., & Gopal, M. (2010). A comparison study on multiple binary-class SVM methods for unilabel text categorization. Pattern Recognition Letters, 31(11), 1437-1444. http://dx.doi.org/10.1016/j.patrec.2010.02.015

상세보기
Li, C. H., & Park, S. C. (2009). An efficient document classification model using an improved back propagation neural network and singular value decomposition. Expert Systems with Applications, 36(2), 3208-3215. http://dx.doi.org/10.1016/j.eswa.2008.01.014

상세보기
Liu, Y., Loh, H. T., Yousef-Toumi, K., & Tor, S. B. (2007). Handling of imbalanced data in text classification: Category-based term weights. Natural Language Processing and Text Mining, 171-192. http://dx.doi.org/10.1007/978-1-84628-754-1_10
Moschitti, A. (2003). Study on optimal parameter tuning for rocchio text classifier. Lecture Notes in Computer Science, (2633), 420-435. http://dx.doi.org/10.1007/3-540-36618-0_30

상세보기
Pang, G., & Jiang, S. (2013). A generalized cluster centroid based classifier for text categorization. Information Processing and Management, 49(2), 576-586. http://dx.doi.org/10.1016/j.ipm.2012.10.003

상세보기
Patra, A., & Singh, D. (2013). A survey report on text classification with different term weighing methods and comparison between classification algorithms. International Journal of Computer Applications, 75(7), 14-18. http://dx.doi.org/10.5120/13122-0472
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333-359. http://dx.doi.org/10.1007/s10994-011-5256-5

상세보기
Rogati, M., & Yang, Y. (2002). High-performing feature selection for text classification. Proceedings of the 11th International Conference on Information and knowledge management (CIKM) 2002, 4-9. http://dx.doi.org/10.1145/584792.584911
Schapire, R. E., & Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2-3), 135-168.
Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and rocchio applied to text filtering. Proceedings of the 21st Annual International ACM SIGIR conference on research and development in information retrieval (SIGIR) 1998, 215-223. http://dx.doi.org/10.1145/290941.290996
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.

상세보기
Singhal, A., Mitra, M., & Buckley, C. (1997). Learning routing queries in a query zone. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) 1997, 25-32. http://dx.doi.org/10.1145/258525.258530
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45(4), 427-437. http://dx.doi.org/10.1016/j.ipm.2009.03.002

상세보기
Tan, S. (2008). An improved centroid classifier for text categorization. Expert Systems with Applications, 35(1-2), 279-285. http://dx.doi.org/10.1016/j.eswa.2007.06.028

상세보기
Tarrago, D. S., Cornelis, C., Bello, R., & Herrera, F. (2014). A multi-instance learning wrapper based on the Rocchio classifier for web index recommendation. Knowledge-Based Systems, 59, 173-181. http://dx.doi.org/10.1016/j.knosys.2014.01.008

상세보기
Torii, M., Yin, L., Nguyen, T., Mazumdar, C. T., Liu, H., Hartley, D. M., & Nelson, N. P. (2011). An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics. International Journal of Medical Informatics, 80(1), 56-66. http://dx.doi.org/10.1016/j.ijmedinf.2010.10.015

상세보기
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3), 1-13. http://dx.doi.org/10.4018/jdwm.2007070101

상세보기
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing and Management, 50(1), 104-112. http://dx.doi.org/10.1016/j.ipm.2013.08.006

상세보기
Villena-Roman, J., Collada-Perez, S., Lana-Serrano, S., & Gonzalez-Cristobal, J. C. (2011). Hybrid approach combining machine learning and a rule-based expert system for text categorization. Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, 323-328.
Wu, C. (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(3), 4321-4330. http://dx.doi.org/10.1016/j.eswa.2008.03.002

상세보기
Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1-2), 69-90.
Yang, Y., & Liu, X. (1999). A re-examination for text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) 1999, 42-49. http://dx.doi.org/10.1145/312624.312647
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning (ICML) 1997, 412-420.
Yu, B., Xu, Z., & Li, C. (2008). Latent semantic analysis for text categorization using neural network. Knowledge-Based Systems, 21(8), 900-904. http://dx.doi.org/10.1016/j.knosys.2008.03.045

상세보기
Zeng, A., & Huang, Y. (2011). A text classification algorithm based on rocchio and hierarchical clustering. Lecture Notes in Computer Science, 432-439. http://dx.doi.org/10.1007/978-3-642-24728-6_59

상세보기
Zhang, W., Yoshida, T., & Tang, X. (2011). A comparative study of TF*IDF, LSI and multiwords for text classification. Expert Systems with Applications, 38(3), 2758-2765. http://dx.doi.org/10.1016/j.eswa.2010.08.066

상세보기

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증