[논문]딥러닝 기법을 활용한 산업/직업 자동코딩 시스템

임정우; 문현석; 이찬희; 우찬균; 임희석

doi:10.15207/jkcs.2021.12.4.023

딥러닝 기법을 활용한 산업/직업 자동코딩 시스템
An Automated Industry and Occupation Coding System using Deep Learning 원문보기

한국융합학회논문지 = Journal of the Korea Convergence Society, v.12 no.4, 2021년, pp.23 - 30

임정우 (고려대학교 컴퓨터학과) , 문현석 (고려대학교 컴퓨터학과) , 이찬희 (고려대학교 컴퓨터학과) , 우찬균 (통계청 조사시스템관리과) , 임희석 (고려대학교 컴퓨터학과)

초록
AI-Helper

본 산업/직업 자동코딩 시스템은 조사 대상자들이 응답한 방대한 양의 산업/직업을 설명하는 자연어 데이터에 통계 분류 코드를 자동으로 부여하는 시스템이다. 본 연구는 기존의 정보검색 기반의 산업/직업 자동코딩시스템과 다르게 딥러닝을 이용하여 색인 DB가 필요하지 않고 분류 수준에 상관없이 코드를 부여할 수 있는 시스템을 제안한다. 또한, 자연어 처리에 특화된 딥러닝 기법인 KoBERT를 적용한 제안 모델은 인구주택총조사 산업/직업 코드 분류, 그리고 사업체기초조사 산업 코드 분류에서 각각 95.65%, 91.45%, 97.66%의 Top 10 정확도를 보인다. 제안한 모델 실험 후 향후 개선 가능성을 데이터/모델링 관점으로 분석한다.

Abstract ▼ AI-Helper

An Automated Industry and Occupation Coding System assigns statistical classification code to the enormous amount of natural language data collected from people who write about their industry and occupation. Unlike previous studies that applied information retrieval, we propose a system that does not need an index database and gives proper code regardless of the level of classification. Also, we show our model, which utilized KoBERT that achieves high performance in natural language downstream tasks with deep learning, outperforms baseline. Our method achieves 95.65%, 91.51%, and 97.66% in Occupation/Industry Code Classification of Population and Housing Census, and Industry Code Classification of Census on Basic Characteristics of Establishments. Moreover, we also demonstrate future improvements through error analysis in the respect of data and modeling.

주제어

표/그림 (7)

그림 Fig. 1. Proposed Model for Automated Industry and Occupation Coding System
표 Table 1. Data Statistics
표 Table 2. Population and Housing Census - Industry Code
표 Table 3. Population and Housing Census - Occupation Code
표 Table 5. Data Error - Census on Basic Characteristics of Establishments
표 Table 6. Model Prediction Error - Census on Basic Characteristics of Establishments
표 Table 4. Census on Basic Characteristics of Establishments - Industry Code

참고문헌 (20)

Y. K. Kang. (2001). Automatic coding system for industry and occupation classification. The Korean Association for Survey Research. Fall Conference 2001, 33-45.
Population and Housing Census. (2020) Understanding of the Census. https://www.census.go.kr/cui/cuiDefView.do?q_menu3&q_sub1
Statistics Korea. (Year Unknown) Statistics Korea Census on Establishments . https://kostat.go.kr/understand/info/info_kost/1/index.action?bmoderead&cdS010004
H. S. Lim. (2004). An automated Classification System of Standard Industry and Occupation Codes by Using Information Retrieval Techniques. The Journal of Korean Association of Computer Education 7(4), 51-60.
C. K. Woo. (2020). A Study on Automatic Coding of Korean Standard Industrial Classification Based on Deep Learning. Masters dissertation. Korea University, Seoul.
H. D. Cheol. (2007). A Research on the Design and Implementation of the Automated Industry and Occupation Coding System. Masters dissertation. Hannam University, Daejeon
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez & I. Polosukhin. (2017, December). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000-6010).
M. Thompson, M. E. Kornbau & J. Vesely. (2012). Creating an Automated Industry and Occupation Coding Process for the American Community Survey. Seattle : U.S Census Bureau.
S. Wood, R. Muthyala, Y. Jin, Y. Qin, N. Rukadikar, A. Rai & H. Gao. (2017, December). Automated industry classification with deep learning. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 122-129). IEEE. DOI : 10.1109/bigdata.2017.8257920
K. He, X. Zhang, S. Ren & J. Sun. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). DOI : 10.1109/cvpr.2016.90
J. S. Lee, S. P, Jun, & H. S. Yoo. (2018). A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification. Journal of Intelligence and Information Systems, 24(3), 221-241 DOI : 10.13088/jiis.2018.24.3.221

원문보기 상세보기
S. Hochreiter & J. Schmidhuber. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. DOI : 10.1162/neco.1997.9.8.1735

상세보기
S. M. Park, C. W. Na, M. S. Choi, D. H, Lee & B. W. On. (2018). KNU Korean Sentiment Lexicon - Bi-LSTM-based Method for Building a Korean Sentiment Lexicon -. Journal of Intelligence and Information Systems, 24(4), 219-240. DOI : 10.13088/jiis.2018.24.4.219

원문보기 상세보기
M. S. Choi, & B. W. On. (2019). A Comparative Study on the Accuracy of Sentiment Analysis of Bi-LSTM Model by Morpheme Feature. Proceedings of KIIT Conference, 2019(6), 307-309.
Y. T. Oh, M. T. Kim & W. J. Kim (2019). Korean Movie-review Sentiment Analysis Using Parallel Stacked Bidirectional LSTM Model. Journal of KIISE, 46(1), 45-49 DOI : 10.5626/JOK.2019.46.1.45

상세보기
J. Devlin, M. W. Chang, K. Lee & K. Toutanova. (2019, June). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186). DOI : 10.18653/v1/N19-1423
H. J. Park & K. S, Shin. (2020). Aspect-Based Sentiment Analysis Using BERT: Developing Aspect Category Sentiment Classification Models. Journal of Intelligence and Information Systems, 26(4), 1-25 DOI : 10.13088/jiis.2020.26.4.001

원문보기 상세보기
K. H. Kim, C. E. Park, C. K. Lee, & H. K. Kim. (2020). Korean End-to-end Neural Coreference Resolution with BERT. Journal of KIISE, 47(10), 942-947. DOI : 10.5626/JOK.2020.47.10.942

상세보기
Y. S. Choi & K. J. Lee. (2020). Performance Analysis of Korean Morphological Analyzer based on Transformer and BERT. Journal of KIISE, 47(8), 730-741. DOI : 10.5626/JOK.2020.47.8.730

상세보기
T. Kudo & J. Richardson. (2018, November). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 66-71). DOI : 10.18653/v1/D18-2012

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증