[논문]연구데이터 관점에서 본 거대언어모델 품질 평가 기준 제언

한나은; 서수정; 엄정호

doi:10.3743/kosim.2023.40.3.077

연구데이터 관점에서 본 거대언어모델 품질 평가 기준 제언
A Proposal of Evaluation of Large Language Models Built Based on Research Data 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.40 no.3, 2023년, pp.77 - 98

한나은 (한국과학기술정보연구원) , 서수정 (한국과학기술정보연구원) , 엄정호 (한국과학기술정보연구원)

초록
AI-Helper

본 연구는 지금까지 제안된 거대언어모델 가운데 LLaMA 및 LLaMA 기반 모델과 같이 연구데이터를 주요 사전학습데이터로 활용한 모델의 데이터 품질에 중점을 두어 현재의 평가 기준을 분석하고 연구데이터의 관점에서 품질 평가 기준을 제안하였다. 이를 위해 데이터 품질 평가 요인 중 유효성, 기능성, 신뢰성을 중심으로 품질 평가를 논의하였으며, 거대언어모델의 특성 및 한계점을 이해하기 위해 LLaMA, Alpaca, Vicuna, ChatGPT 모델을 비교하였다. 현재 광범위하게 활용되는 거대언어모델의 평가 기준을 분석하기 위해 Holistic Evaluation for Language Models를 중심으로 평가 기준을 살펴본 후 한계점을 논의하였다. 이를 바탕으로 본 연구는 연구데이터를 주요 사전학습데이터로 활용한 거대언어모델을 대상으로 한 품질 평가 기준을 제시하고 추후 개발 방향을 논의하였으며, 이는 거대언어모델의 발전 방향을 위한 지식 기반을 제공하는데 의의를 갖는다.

Abstract ▼ AI-Helper

Large Language Models (LLMs) are becoming the major trend in the natural language processing field. These models were built based on research data, but information such as types, limitations, and risks of using research data are unknown. This research would present how to analyze and evaluate the LLMs that were built with research data: LLaMA or LLaMA base models such as Alpaca of Stanford, Vicuna of the large model systems organization, and ChatGPT from OpenAI from the perspective of research data. This quality evaluation focuses on the validity, functionality, and reliability of Data Quality Management (DQM). Furthermore, we adopted the Holistic Evaluation of Language Models (HELM) to understand its evaluation criteria and then discussed its limitations. This study presents quality evaluation criteria for LLMs using research data and future development directions.

주제어

참고문헌 (46)

An, Seong-Won, Yu, Jae-Hong, Jo, Won-Young, No, Jae-Won, & Son, Ho-Hyun (2023). Rise of？Hyper-scale LLM(Large Language Model) and issues. Gyeonggi: Software Policy Research？Institute.？
Azma Yukinaga (2018). Deep Learning that is Tangible, Practical Programming from the Basics.？Tokyo:SBクリエイティブ.？
Han, Na-Eun (2023). Proposal of process model for research data quality management. Korean？Society for Information Society, 40(1), 51-71.？https://doi.org/10.3743/KOSIM.2023.40.1.051？

원문보기 상세보기
Jo, Tae-Ho (2022). Deep Learning for Everyone - Deep Learning that Anyone can Easily Understand.？Seoul: Gilbut.？
Kim, Hyung-Sub (2020). A study on the data quality management evaluation model. Journal？of the Korea Convergence Society, 11(7), 217-222.？https://doi.org/10.15207/JKCS.2020.11.7.217？

원문보기 상세보기
Kim, Seon-Tae, Lee, Jeong-Hoon, & Jeong, Han-Min (2017). Understanding and Managing？Research Data. Daejeon: Korea Institute of Science and Technology Information.？
Korea Data Agency (2006). Data Quality Management Guidelines (Ver 2.1).？
Lee, Gi-Chang (2021). (Do it!) Learning Natural Language Processing with BERT and GPT:？Transformer Core Principles and How to Use the Hugging Face Package. Seoul: Easyspublishing.？
Lee, Kyong-NIm & Ho, Eun-Kyoung (2023). AI dialogue interface based on large language？models: the state of the art AI dialogue models and seeking linguistic research topics. The？Society of Korean Linguistics, 105, 345-374. https://doi.org/10.15811/jkl.2023..105.010？

상세보기
Lee, Su-Hyeon & Jeon, Sang-Hong (2023). ChatGPT State of the Technology Industry Report.？Korea Copyright Commission.？
Ministry of Security and Public Administration (2014). Government Data Management Guidelines. No. 2014-13.？
National Research and Development Information Processing Standards, Ministry of Science and？ICT Notice No. 2020-102.？
National Research Council of Science and Technology (2019). Research Data Management Guidelines？(2019-07).？
Park, Hyung-Kyung (2020). A study on the use of copyrightable works in machine learning.？The Korean Association of Sports and Entertainment Law, 23(1), 129-152.？http://doi.org/10.19051/kasel.2020.23.1.129？
Park, Seong-Ho (2020). A study on whether collecting and using other people's copyrighted works？for the purpose of text and data mining falls under the copyright limitations: focusing？on the use of big data in artificial intelligence. Human Rights and Justice, 494, 39-69.？http://doi.org/10.22999/hraj..494.202012.003？
我妻幸長 (2018). はじめてのディ？プラ？ニング -Pythonで？ぶニュ？ラルネットワ？クと？バックプロパゲ？ション- (Machine Learning). 최재원 옮김(2019). 실체가 손에 잡히는 딥？러닝, 기초부터 실천 프로그래밍. 서울: 책만.？
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,？P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child,？R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin,？M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., &？Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information？Processing Systems, 33, 1877-1901.？
Buchanan, B., Lohn, A., Musser, M., & Sedova, K. (2021). Truth, lies, and automation. Center？for Security and Emerging Technology, 1(1), 2.？
Chiang, W. L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang,？Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: An Open-source Chatbot？Impressing Gpt-4 with 90%* Chatgpt Quality. Available:？https://lmsys.org/blog/2023-03-30-vicuna/？
Chomsky, N. (1957). Logical structure in language. Journal of the American Society for Information？Science, 8(4), 284.？
Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A.？M., Koohang, A., Raghavan, V., Ahuja, M., Albanna, H., Albashrawi, M. A., Al-Busaidi,？A. S., Balakrishnan, J., Barlette, Y., Basu, S., Bose, I., Brooks, L., Buhalis, D., Carter, L.,？& Wright, R. (2023). "So what if ChatGPT wrote it?" multidisciplinary perspectives on？opportunities, challenges and implications of generative conversational ai for research, practice？and policy. International Journal of Information Management, 71, 102642.？
English, L. P. (2009). Information Quality Applied: Best Practices for Improving Business Information,？Processes and Systems. New Jersey: Wiley.？
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). Realtoxicityprompts:？Evaluating Neural Toxic Degeneration in Language Models.？https://doi.org/10.48550/arXiv.2009.11462
Hale, J. (2001). A Probabilistic Earley Parser as a Psycholinguistic Model. In Second Meeting？of the North American Chapter of the Association for Computational Linguistics.？
International Organization for Standardization (2015). ISO/IEC 25024: 2015: Systems and Software？Engineering-Systems and Software Quality Requirements and Evaluation (SQuaRE)-Measurement？of Data Quality. ISO/IEC.？
Jurafsky, D. & James H. M. (2021). Speech and Language Processing (3rd ed.). California:？Standford University.？
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford,？A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models.？https://doi.org/10.48550/arXiv.2001.08361？
Kindling, M. & Strecker, D. (2022). Data Quality Assurance at Research Data Repositories. Data？Science Journal, 21(1). http://doi.org/10.5334/dsj-2022-018？

상세보기
Lee, P., Goldberg, C., & Kohane, I. (2023). The AI Revolution in Medicine: GPT-4 and beyond.？London: Pearson.？
Lemley, M. A. & Casey, B. (2020). Fair learning. Texas Law Review, 99(4), 743-785.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126-1177.？https://doi.org/10.1016/j.cognition.2007.05.006？

상세보기
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A.,？Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., & Misra, V.？(2022). Solving quantitative reasoning problems with language models. Advances in Neural？Information Processing Systems, 35, 3843-3857.？
Lin, S., Hilton, J., & Evans, O. (2021). Truthfulqa: Measuring How Models Mimic Human Falsehoods.？https://doi.org/10.48550/arXiv.2109.07958？
OpenAI (2023). GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
Peng, B., Li, C., He, P., Galley, M., & Gao, J. (2023). Instruction Tuning with Gpt-4.？https://doi.org/10.48550/arXiv.2304.03277？
Pennycook, G., Epstein, Z., Mosleh, M., Arechar, A. A., Eckles, D., & Rand, D. G. (2021). Shifting？attention to accuracy can reduce misinformation online. Nature, 592(7855), 590-595.？
Percy, L., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan,？D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning,？C. D., Re, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F.,？Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul,？M., Suzgun, M., Kim, N., Guha, N., Chatterji, N., Khattab, O., Henderson, P., Huang, Q., Chi, R., Xie, S. M., Santurkar, S., Ganguli, S., Hashimoto, T., Icard, T., Zhang, T., Chaudhary,？V., Wang, W., Li, X., Mai, Y., Zhang, Y., & Koreeda, Y. (2022), Holistic Evaluation of？Language Models. https://doi.org/10.48550/arXiv.2211.09110？
Petroni, F., Rocktaschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019).？Language Models as Knowledge Bases?. https://doi.org/10.48550/arXiv.1909.01066？
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models？are unsupervised multitask learners. OpenAI blog, 1(8), 9.？
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson,？S., Ring, R., Young, S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A., Powell, R.,？Driessche, G., Hendricks, L. A., Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S.,？Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell, A., McAleese, N., Wu, A., Elsen,？E., Jayakumar, S., Buchatskaya, E., Budden, D., Sutherland, E., Simonyan, K., Paganini,？M., Sifre, L., Martens, L., Li, X. L., Kuncoro, A., Nematzadeh, A., Gribovskaya, E., Donato,？D., Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli, M., Grigorev, N., Fritz, D., Sottiaux,？T., Pajarskas, M., Pohlen, T., Gong, Z., Toyama, D., d'Autume, C. M., Li, Y., Terzi, T.,？Mikulik, V., Babuschkin, I., Clark, A., Casas, D. L., Guy, A., Jones, C., Bradbury, J., Johnson,？M., Hechtman, B., Weidinger, L., Gabriel, I., Isaac, W., Lockhart, E., Osindero, S., Rimell,？L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu,？K., & Irving, G. (2021). Scaling Language Models: Methods, Analysis & Insights from？Training Gopher. https://doi.org/10.48550/ARXIV.2112.11446？
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., & Hashimoto, T.？B. (2023). Stanford Alpaca: An Instruction-following Llama Model. Available:？https://github.com/tatsu-lab/stanford_alpaca？
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., Poulton, A., Kerkez,？V., & Stojnic, R. (2022). Galactica: A Large Language Model for Science.？https://doi.org/10.48550/arXiv.2211.09085？
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal,？N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023).？Llama: Open and Efficient Foundation Language Models.？https://doi.org/10.48550/arXiv.2302.13971？
Wilcox, E., Qian, P., Futrell, R., Kohita, R., Levy, R., & Ballesteros, M. (2020). Structural Supervision？Improves Few-shot Learning and Syntactic Generalization in Neural Language Models.？https://doi.org/10.48550/arXiv.2010.05725
Yarowsky, D. (1995, June). Unsupervised word sense disambiguation rivaling supervised methods.？In 33rd Annual Meeting of the Association for Computational Linguistics, 189-196.？
Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P. S., & Leskovec, J.？(2022). Deep Bidirectional language-knowledge graph pretraining. Advances in Neural？Information Processing Systems, 35, 37309-37323.？

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증