[논문]거대언어모델과 문서검색 알고리즘을 활용한 한국원자력연구원 규정 질의응답 시스템 개발

김홍비; 유용균

doi:10.9723/jksiis.2023.28.5.031

거대언어모델과 문서검색 알고리즘을 활용한 한국원자력연구원 규정 질의응답 시스템 개발
Development of a Regulatory Q&A System for KAERI Utilizing Document Search Algorithms and Large Language Model 원문보기

한국산업정보학회논문지 = Journal of the Korea Industrial Information Systems Research, v.28 no.5, 2023년, pp.31 - 39

김홍비 (한국원자력연구원) , 유용균 (한국원자력연구원)

초록
AI-Helper

최근 자연어 처리(NLP) 기술, 특히 ChatGPT를 비롯한 거대 언어 모델(LLM)의 발전으로 특정 전문지식에 대한 질의응답(QA) 시스템의 연구개발이 활발하다. 본 논문에서는 거대언어모델과 문서검색 알고리즘을 활용하여 한국원자력연구원(KAERI)의 규정 등 다양한 문서를 이해하고 사용자의 질문에 답변하는 시스템의 동작 원리에 대해서 설명한다. 먼저, 다수의 문서를 검색과 분석이 용이하도록 전처리하고, 문서의 내용을 언어모델에서 처리할 수 있는 길이의 단락으로 나눈다. 각 단락의 내용을 임베딩 모델을 활용하여 벡터로 변환하여 데이터베이스에 저장하고, 사용자의 질문에서 추출한 벡터와 비교하여 질문의 내용과 가장 관련이 있는 내용들을 추출한다. 추출된 단락과 질문을 언어 생성 모델의 입력으로 사용하여 답변을 생성한다. 본 시스템을 내부 규정과 관련된 다양한 질문으로 테스트해본 결과 복잡한 규정에 대하여 질문의 의도를 이해하고, 사용자에게 빠르고 정확하게 답변을 제공할 수 있음을 확인하였다.

Abstract ▼ AI-Helper

The evolution of Natural Language Processing (NLP) and the rise of large language models (LLM) like ChatGPT have paved the way for specialized question-answering (QA) systems tailored to specific domains. This study outlines a system harnessing the power of LLM in conjunction with document search algorithms to interpret and address user inquiries using documents from the Korea Atomic Energy Research Institute (KAERI). Initially, the system refines multiple documents for optimized search and analysis, breaking the content into managable paragraphs suitable for the language model's processing. Each paragraph's content is converted into a vector via an embedding model and archived in a database. Upon receiving a user query, the system matches the extracted vectors from the question with the stored vectors, pinpointing the most pertinent content. The chosen paragraphs, combined with the user's query, are then processed by the language generation model to formulate a response. Tests encompassing a spectrum of questions verified the system's proficiency in discerning question intent, understanding diverse documents, and delivering rapid and precise answers.

주제어

표/그림 (6)

그림 Fig. 1 The process of KAERI Question-Answering System
그림 Fig. 2 The process of calculating embedding and cosine similarity
그림 Fig. 3 The process of categorizing questions and generating answers to each category
표 Table 1 Prompt
그림 Fig. 4 Service images for the questions and answers in Table 1-1.
표 Table 2 Example questions and answers

참고문헌 (10)

Arroyo, J. et al. (2010), Using BM25F for？semantic search, Proceedings of the 3rd？International Semantic Search Workshop,？April. 26, New York, US.
Kasneci, E. et al. (2023). ChatGPT for good?？On opportunities and challenges of large？language models for education, Learning and？Individual Differences, 103, https://doi.org/10.1016/j.lindif.2023.102274.

상세보기
Kim, H. and Oh, Y. (2023). Design of a？Mirror for Fragrance Recommendation based？on Personal Emotion Analysis, J ournal of？the Korea Industrial Information Systems？Research, 28(4), 11-19.
Lewis, P. et al. (2020). Retrieval-Augmented？Generation for Knowledge-Intensive NLP？Tasks, Advances in Neural Information？P rocessing Systems, 33,9459-9474. https://doi.org/10.48550/arXiv.2005.11401
Mavi, V. et al. (2022). A Survey on Multi-hop？Question Answering and Generation, arXiv？preprint https://doi.org/10.48550/arXiv.2204.09140.
OpenAI. (2021). New and Improved Embedding？Models, https://openai.com/blog/new-and-improved-embedding-model/ (May. 14th, 2021)
OpenAI. (2021). GP T-3.5 (Turbo) - API？Documentation, https://platform.openai.com/docs/models/gpt-3-5.
Ramos, J. (2003), Using tf-idf to determine？word relevance in document queries,？Proceedings of the first International？Conference on Machine Learning, Dec. 3,？New Jersey, USA, pp. 29-48.
Rahutomo, F. et al. (2012). Semantic Cosine？Similarity, The 7th International Student？Conference on Advanced Science and？Technology ICAST, Oct. 29-30, Seoul,？South Korea, pp. 1.
Robertson, S. and Zaragoza, H. (2009). The？probabilistic relevance framework: BM25？and beyond, Foundations and Trends® in？Information Retrieval, 3(4), 333-389.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증