[논문]한글 텍스트 감정 이진 분류 모델 생성을 위한 미세 조정과 전이학습에 관한 연구

김종수

doi:10.9723/jksiis.2023.28.5.015

한글 텍스트 감정 이진 분류 모델 생성을 위한 미세 조정과 전이학습에 관한 연구
A Study on Fine-Tuning and Transfer Learning to Construct Binary Sentiment Classification Model in Korean Text 원문보기

한국산업정보학회논문지 = Journal of the Korea Industrial Information Systems Research, v.28 no.5, 2023년, pp.15 - 30

김종수 ((주)골드브릿지 기업부설연구소)

초록
AI-Helper

근래에 트랜스포머(Transformer) 구조를 기초로 하는 ChatGPT와 같은 생성모델이 크게 주목받고 있다. 트랜스포머는 다양한 신경망 모델에 응용되는데, 구글의 BERT(bidirectional encoder representations from Transformers) 문장생성 모델에도 사용된다. 본 논문에서는, 한글로 작성된 영화 리뷰에 대한 댓글이 긍정적인지 부정적인지를 판단하는 텍스트 이진 분류모델을 생성하기 위해서, 사전 학습되어 공개된 BERT 다국어 문장생성 모델을 미세조정(fine tuning)한 후, 새로운 한국어 학습 데이터셋을 사용하여 전이학습(transfer learning) 시키는 방법을 제안한다. 이를 위해서 104 개 언어, 12개 레이어, 768개 hidden과 12개의 집중(attention) 헤드 수, 110M 개의 파라미터를 사용하여 사전 학습된 BERT-Base 다국어 문장생성 모델을 사용했다. 영화 댓글을 긍정 또는 부정 분류하는 모델로 변경하기 위해, 사전 학습된 BERT-Base 모델의 입력 레이어와 출력 레이어를 미세 조정한 결과, 178M개의 파라미터를 가지는 새로운 모델이 생성되었다. 미세 조정된 모델에 입력되는 단어의 최대 개수 128, batch_size 16, 학습 횟수 5회로 설정하고, 10,000건의 학습 데이터셋과 5,000건의 테스트 데이터셋을 사용하여 전이 학습시킨 결과, 정확도 0.9582, 손실 0.1177, F1 점수 0.81인 문장 감정 이진 분류모델이 생성되었다. 데이터셋을 5배 늘려서 전이 학습시킨 결과, 정확도 0.9562, 손실 0.1202, F1 점수 0.86인 모델을 얻었다.

Abstract ▼ AI-Helper

Recently, generative models based on the Transformer architecture, such as ChatGPT, have been gaining significant attention. The Transformer architecture has been applied to various neural network models, including Google's BERT(Bidirectional Encoder Representations from Transformers) sentence generation model. In this paper, a method is proposed to create a text binary classification model for determining whether a comment on Korean movie review is positive or negative. To accomplish this, a pre-trained multilingual BERT sentence generation model is fine-tuned and transfer learned using a new Korean training dataset. To achieve this, a pre-trained BERT-Base model for multilingual sentence generation with 104 languages, 12 layers, 768 hidden, 12 attention heads, and 110M parameters is used. To change the pre-trained BERT-Base model into a text classification model, the input and output layers were fine-tuned, resulting in the creation of a new model with 178 million parameters. Using the fine-tuned model, with a maximum word count of 128, a batch size of 16, and 5 epochs, transfer learning is conducted with 10,000 training data and 5,000 testing data. A text sentiment binary classification model for Korean movie review with an accuracy of 0.9582, a loss of 0.1177, and an F1 score of 0.81 has been created. As a result of performing transfer learning with a dataset five times larger, a model with an accuracy of 0.9562, a loss of 0.1202, and an F1 score of 0.86 has been generated.

주제어

표/그림 (22)

그림 Fig. 1 Example of an English text sentiment binary classification model
그림 Fig. 2 Graph of learning and validation accuracy(left), training and validation loss(right)
그림 Fig. 3 RNN(recurrent neural network) model for text binary classification
그림 Fig. 4 Structure of a pre-trained BERT-based model for an English text binary classification
그림 Fig. 5 A high-level overview of the self-instruct
그림 Fig. 6 GPT's proximal policy optimization algorithm
표 Table 1 List of InstructGPT models provided by OpenAI
표 Table 2 Key features of artificial intelligence for Korean natural language processing
그림 Fig. 7 A fine tuning model of BERT
그림 Fig. 8 Comparison of model input layers before(top) and after(bottom) refinement
그림 Fig. 9 Comparison of model output layers before(top) and after(bottom) refinement
그림 Fig. 10 Structure of a fine-tuned Korean sentence sentiment classification model
표 Table 3 Example of model training dataset
그림 Fig. 12 Example of embedding the dataset
그림 Fig. 13 BERT embedding procedure (①Subword Token Embedding → ②Segment Embedding → ③Positional Encoding)
그림 Fig. 11 An example of token generation
그림 Fig. 14 Predicted results of the model(F1 score = 0.81)
표 Table 4 Performance comparison according to the amount of learning data
그림 Fig. 15 Predicted results of the model(F1 score = 0.86)
그림 Fig. 16 Tensor flow diagram of the re-fine-tuned model for a 3D graph visualization
표 Table 5 Dataset of news article comments
그림 Fig. 17 Distribution of prediction result from two text sentiment classification models

참고문헌 (18)

O. Dongsuk, P. Sungjin, L. Hanna, J. Yoonna,？and L. Heuiseok, (2021), KoDialoGPT2 :？Modeling Chit-Chat Dialog in Korean,？Proceedings of the 33th Korean Language？and Korean Information Processing Conference,？Oct. 14-15, 457-460, Korea
K. EunJung. (2022), A study on the difficulty？adjustment of programming language？multiple-choice problems using machine？learning, Journal of Korea Industrial？Information Systems Research, 27(2), 11-24
K. SeongAn, K. SoHui and R. Min Ho.？(2022), Analysis of Hypertension Risk？Factors by Life Cycle Based on Machine？Learning, Journal of Korea Industrial？Information Systems Research, 27(5), 73-82
L. DoegGyu, Kyungkeun B, L. HyungDong,？and S. Sunhee. (2023), The Prediction of？Survival of Breast Cancer Patients Based on？Machine Learning Using Health Insurance？Claim Data, Journal of Korea Industrial？Information Systems Research, 28(2), 1-9
S. John, W. Filip, D. Prafulla, R. Alec and K.？Oleg, (2017), Proximal Policy Optimization？Algorithms, Aug. 28, https://doi.org/10.48550/arXiv.1707.06347
B. Tom B., M. Benjamin, R. Nick, et al.,？(2020), Language Models are Few-Shot？Learners. NIPS'20: Proceedings of the 34th？International Conference on Neural？Information Processing, Jun. 22, 1877-1901,？https://doi.org/10.48550/arXiv.2005.14165
L. Ouyang, W. Jeff, X. Jiang, A. Diogo, et？al., (2022), Training language models to？follow instructions with human feedback,？Journal of Advances in Neural Information？Processing Systems, 35, 27730-27744,？https://doi.org/10.48550/arXiv.2203.02155
B. Sebastien, C. Varun, E. Ronen, et al., (2023),？Sparks of Artificial General Intelligence:？Early experiments with GP T-4, Apr. 13,？https://doi.org/10.48550/arXiv.2303.12712
D. Jacob, C. Ming-Wei, L. Kenton and T.？Kristina, (2019), BERT: Pre-training of？Deep Bidirectional Transformers for Language？Understanding, 2019 Annual Conference of？the North American Chapter of the？Association for Computational Linguistics,？May. 24, https://doi.org/10.48550/arXiv.1810.04805
TensorFlow Authors, (2019), Basic text？classification | TensorFlow Core, https://www.tensorflow.org/tutorials/keras/text_classification?hlko(Accessed on Oct. 3th, 2023)
Gooble Colab, (2023), Text classification with？an RNN, https://www.tensorflow.org/text/tutorials/text_classification_rnn?hlen(Access？ed on Oct. 3th, 2023)
TensorFlow Hub Authors, (2020), Classify text？with BERT, https://github.com/tensorflow/text/blob/master/docs/tutorials/classify_text_with_bert.ipynb(Accessed on Oct. 3th, 2023)
K. Jared, M. Sam, H. Tom, B. Tom B, et al.,？(2020), Scaling Laws for Neural Language？Models, OpenAI, Jan. 23, https://doi.org/10.48550/arXiv.2001.08361
V. Ashish, S. Noam, P. Niki, U. Jakob, J.？Llion, G. Aidan N., K. Lukasz and P. Illia,？(2017), Attention Is All You Need, The？Thirty-first Annual Conference on Neural？Information Processing Systems, Dec. 6,？https://doi.org/10.48550/arXiv.1706.03762
W. Yizhong, K. Yeganeh, M. Swaroop, et al.,？(2022), SELF-INSTRUCT: Aligning Language？Model with Self Generated Instructions,？Dec. 20, https://doi.org/10.48550/arXiv.2212.10560
Team with members from UC Berkeley,？CMU, Stanford, and UC San Diego, (2023).？Vicuna: An Open-Source Chatbot Impressing？GP T-4 with 90%* ChatGP T Quality,？https://vicuna.lmsys.org/(Accessed on Jun.？25th, 2023)
S. Chang-Uk, (2020), Awesome Korean NLP？Papers, https://github.com/changukshin/Awesome-Korean-NLP-Papers(Accessed on？Oct. 4th, 2023)
L. Eunchan, L. Changhyeon and A. Sangtae,？(2022), Comparative Study of Multiclass？Text Classification in Research Proposals？Using Pretrained Language Models, applied？sciences, https://www.mdpi.com/2076-3417/12/9/4522(Accessed on Oct. 4th, 2023)

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증