[논문]목차 정보와 kNN 분류기를 이용한 사회과학 분야 도서 자동 분류에 관한 연구

이용구

doi:10.3743/kosim.2020.37.1.001

목차 정보와 kNN 분류기를 이용한 사회과학 분야 도서 자동 분류에 관한 연구
A Study on Book Categorization in Social Sciences Using kNN Classifiers and Table of Contents Text 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.37 no.1, 2020년, pp.1 - 21

이용구 (계명대학교 문헌정보학과)

초록
AI-Helper

이 연구에서는 한 대학도서관의 신착 도서 리스트 중 사회 과학 분야 6,253권에 대해 목차 정보를 이용하여 자동 분류를 적용하였다. 분류기는 kNN 알고리즘을 사용하였으며 자동 분류의 범주로 도서관에서 도서에 부여한 DDC 300대 강목을 사용하였다. 분류 자질은 도서의 서명과 목차를 사용하였으며, 목차는 인터넷 서점으로부터 Open API를 통해 획득하였다. 자동 분류 실험 결과, 목차 자질은 분류 재현율과 분류 정확률 모두를 향상시키는 좋은 자질임을 알 수 있었다. 또한 목차는 풍부한 자질로 불균형인 데이터의 과적합 문제를 완화시키는 것으로 나타났다. 법학과 교육학은 사회 과학 분야에서 특정성이 높아 서명 자질만으로도 좋은 분류 성능을 가져오는 점도 파악할 수 있었다.

Abstract ▼ AI-Helper

This study applied automatic classification using table of contents (TOC) text for 6,253 social science books from a newly arrived list collected by a university library. The k-nearest neighbors (kNN) algorithm was used as a classifier, and the ten divisions on the second level of the DDC's main class 300 given to books by the library were used as classes (labels). The features used in this study were keywords extracted from titles and TOCs of the books. The TOCs were obtained through the OpenAPI from an Internet bookstore. As a result, it was found that the TOC features were good for improving both classification recall and precision. The TOC was shown to reduce the overfitting problem of imbalanced data with its rich features. Law and education have high topic specificity in the field of social sciences, so the only title features can bring good classification performance in these fields.

주제어

질의응답

핵심어	질문	논문에서 추출한 답변
	kNN 알고리즘의 장점은?	이때 도서관 환경에서 활용 가능한지 알아보기 위해, 도서에 부여된 DDC(Dewey Decimal Classification)의 분류기호를 자동 분류의 범주로 삼았다. 또한 분류기로는 텍스트를 대상으로 자동 분류를 수행하는데 많이 사용되며, 비교적 이해하기 쉽고 구현이 간단한 kNN(k-Nearest Neighbor) 알고리즘을 적용하였다.
	인터넷 서점들이 목차정보를 구축하고 제공하는 이유는?	다수의 국내 대학도서관도 그들의 홈페이지에서 직접 또는 간접적으로 목차 정보를 제공하고 있다. 사실 인터넷 서점들도 적극적으로 목차정보를 구축하고 제공하는데, 이는 구매 자가 도서를 구입할 때 목차 정보를 통해 구입 여부에 대한 판단에서 도움주기 때문인 것으로 보인다.
	어떤 대상물을 기계에 의해 자동으로 분류하거나 범주화 하기 위해서는 적절한 자질이 필요한 이유는?	어떤 대상물을 기계에 의해 자동으로 분류하거나 범주화(categorization) 하기 위해서는 일반적으로 그에 따른 적절한 자질(feature)이 필요하다. 이는 좋은 자질이 분류 성능에 직접적으로 영향을 미치기 때문이다. 예를 들어 전문 (full-text)으로 구성된 텍스트를 미리 지정된 범주(주제)로 자동 분류하고자 할 때, 일반적으로 이들 텍스트에 출현한 많은 수의 단어들을 자질로 사용한다.

참고문헌 (18)

Lee, Yong-Gu (2013). A study on feature selection for kNN classifier using document frequency and collection frequency. Journal of Korean Library and Information Science Society, 44(1), 27-47. http://dx.doi.org/10.16981/kliss.44.1.201303.27
Lee, Yong-Gu (2019). A study on the statistical characteristics for table of contents text of the books in social sciences field. Journal of the Korean Society for Information Management, 36(2), 255-273. http://dx.doi.org/10.3743/KOSIM.2019.36.2.255

원문보기 상세보기
Lee, Jae Yun (2005). An empirical study on improving the performance of text categorization considering the relationships between feature selection criteria and weighting methods. Journal of the Korean Society for Library and Information Science, 39(2), 123-146. http://dx.doi.org/10.4275/kslis.2005.39.2.123

원문보기 상세보기
Chung, Young-Mee (2012). Research in information retrieval. Seoul: Yonsei University Press.
Altman, N. S. (1992). An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. The American Statistician, 46(3), 175-185. http://dx.doi.org/10.1080/00031305.1992.10475879

상세보기
Azam, N., & Yao, J. (2012). Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications, 39(5), 4760-4768.

상세보기
Campos, G. O., Zimek, A., Sander, J., Campello, R. J. G. B., Micenkova, B., Schubert, E., ... & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4), 891-927. https://doi.org/10.1007/s10618-015-0444-8

상세보기
Chercourt, M., & Marshall, L. (2013). Making keywords work: Connecting patrons to resources through enhanced bibliographic records. Technical Services Quarterly, 30(3), 285-295. http://dx.doi.org/10.1080/07317131.2013.785786

상세보기
Dillon, M., & Wenzel, P. (1990). Retrieval effectiveness of enhanced bibliographic records. Library Hi Tech, 8(3), 43-46. https://doi.org/10.1108/eb047797

상세보기
Frank, E., & Paynter, G. W. (2004). Predicting library of congress classifications from library of congress subject headings. Journal of the American Society for Information Science and Technology, 55(3), 214-227. https://doi.org/10.1002/asi.10360

상세보기
Godby, C. J., & Stuler, J. (2003). The library of congress classification as a knowledge base for automatic subject categorization. In Subject Retrieval in a Networked Environment: Proceedings of the IFLA Satellite Meeting, Dublin, OH, 14-16.
Larson, R. R. (1992). Experiments in automatic library of congress classification. Journal of the American Society for Information Science, 43(2), 130-148. https://doi.org/10.1002/(SICI)1097-4571(199203)43:2 3.0.CO;2-S

상세보기
Pappas, E., & Herendeen, A. (2000). Enhancing bibliographic records with tables of contents derived from OCR technologies at the american museum of natural history library. Cataloging & Classification Quarterly, 29(4), 61-72. http://dx.doi.org/10.1300/J104v29n04_05

상세보기
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825-2830.
Van Orden, R. (1990). Content-enriched access to electronic information: Summaries of selected research. Library Hi Tech, 8(3), 27-32. https://doi.org/10.1108/eb047795

상세보기
Wang, J. (2009). An extensive study on automated dewey decimal classification. Journal of the American Society for Information Science and Technology, 66(11), 2269-2286. https://doi.org/10.1002/asi.21147
Winke, R. C. (1999). An analysis of tables of contents in recent english-language books. Library Resources & Technical Services, 43(1), 14-27. http://dx.doi.org/10.5860/lrts.43n1.14

상세보기
Yang, Y., & Lin, X. (1999). A re-examination of text categorization methods, In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in the information retrieval(1999), 42-49.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증