[논문]딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구

한유경; 최원석; 이민철

doi:10.3743/kosim.2023.40.2.115

딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구
Automatic Extraction of References for Research Reports using Deep Learning Language Model 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.40 no.2, 2023년, pp.115 - 135

한유경 (정보통신정책연구원) , 최원석 (정보통신정책연구원) , 이민철 (카카오엔터프라이즈)

초록
AI-Helper

본 연구는 단행본, 학술지, 보고서 등 다양한 종류의 발간물로 구성된 연구보고서의 참고문헌 데이터베이스를 효율적으로 구축하기 위한 것으로 딥러닝 언어 모델을 이용하여 참고문헌의 자동추출 성능을 비교 분석하고자 한다. 연구보고서는 학술지와는 다르게 기관마다 양식이 상이하여 참고문헌 자동추출에 어려움이 있다. 본 연구에서는 참고문헌 자동추출에 널리 사용되는 연구인 메타데이터 추출과 더불어 참고문헌과 참고문헌이 아닌 문구가 섞여 있는 환경에서 참고문헌만을 분리해내는 원문 분리 연구를 통해 이 문제를 해결하였다. 자동 추출 모델을 구축하기 위해 특정 연구기관의 연구보고서 내 참고문헌셋, 학술지 유형의 참고문헌셋, 학술지 참고문헌과 비참고문헌 문구를 병합한 데이터셋을 구성했고, 딥러닝 언어 모델인 RoBERTa+CRF와 ChatGPT를 학습시켜 메타데이터 추출과 자료유형 구분 및 원문 분리 성능을 측정하였다. 그 결과 F1-score 기준 메타데이터 추출 최대 95.41%, 자료유형 구분 및 원문 분리 최대 98.91% 성능을 달성하는 등 유의미한 결과를 얻었다. 이를 통해 비참고문헌 문구가 포함된 연구보고서의 참고문헌 추출에 대한 딥러닝 언어 모델과 데이터셋 유형별 참고문헌 구축 방향을 제안하였다.

Abstract ▼ AI-Helper

The purpose of this study is to assess the effectiveness of using deep learning language models to extract references automatically and create a reference database for research reports in an efficient manner. Unlike academic journals, research reports present difficulties in automatically extracting references due to variations in formatting across institutions. In this study, we addressed this issue by introducing the task of separating references from non-reference phrases, in addition to the commonly used metadata extraction task for reference extraction. The study employed datasets that included various types of references, such as those from research reports of a particular institution, academic journals, and a combination of academic journal references and non-reference texts. Two deep learning language models, namely RoBERTa+CRF and ChatGPT, were compared to evaluate their performance in automatic extraction. They were used to extract metadata, categorize data types, and separate original text. The research findings showed that the deep learning language models were highly effective, achieving maximum F1-scores of 95.41% for metadata extraction and 98.91% for categorization of data types and separation of the original text. These results provide valuable insights into the use of deep learning language models and different types of datasets for constructing reference databases for research reports including both reference and non-reference texts.

주제어

표/그림 (14)

그림 <그림 1> 데이터셋 구축 과정
표 <표 1> BIO 태깅 방식
표 <표 2> 실험 대상 KISDI 발간물 종류
표 <표 3> 학습/평가데이터셋 건수
표 <표 4> 참고문헌 메타데이터 유형 구성
표 <표 5> 참고문헌 자료유형 구성
표 <표 6> ChatGPT 프롬프트 및 답변 예시
표 <표 7> 메타데이터 추출 성능 비교
표 <표 8> 메타데이터 유형별 데이터 원문 완전일치 F1-score 비교
표 <표 9> 메타데이터 유형별 데이터 원문 부분일치 F1-score 비교
표 <표 10> 자료유형 구분 및 문장 분리 성능 비교
표 <표 11> 자료유형 구분 및 문장 분리별 데이터 원문 완전일치 F1-score 비교
표 <표 12> 자료유형 구분 및 문장 분리별 데이터 원문 부분일치 F1-score 비교
표 <표 13> 자동추출 오류 유형 및 대표 사례

참고문헌 (35)

Ji, Seon-yeong & Choi, Sung-pil (2021). A study on recognition of citation metadata using？bidirectional GRU-CRF model based on pre-trained language model. Journal of the Korean？Society for information Management, 38(1), 221-242.？https://doi.org/10.3743/KOSIM.2021.38.1.221

원문보기 상세보기
Lee, Kangsandajeong, Lee, Hyejin, & Hyun, Mihwan (2022). A study on national r&d report？reference technological improvement. Journal of the Korea Convergence Society, 13(1),？31-42. https://doi.org/10.15207/JKCS.2022.13.01.031

원문보기 상세보기
Besagni, D., Belaid, A., & Benet, N. (2003). A segmentation method for bibliographic references？by contextual tagging of fields. Seventh International Conference on Document Analysis？and Recognition, 384-388. https://doi.org/10.1109/ICDAR.2003.1227694
Chen, C. (2006). CiteSpace II: Detecting and visualizing emerging trends and transient patterns in？scientific literature. Journal of the American Society for Information Science and Technology,？57(3), 359-377. https://doi.org/10.1002/asi.20317

상세보기
Choi, W., Yoon, H. M., Hyun, M. H., Lee, H. J., Seol, J. W., Lee, K. D., Yoon, Y. J., & Kong,？H. (2023). Building an annotated corpus for automatic metadata extraction from multilingual？journal article references. PloS one, 18(1), e0280637.？https://doi.org/10.1371/journal.pone.0280637

상세보기
Councill, I., Giles, C., & Kan, M. (2008). ParsCit: an Open-source CRF Reference String Parsing？Package. LREC, 8, 661-667.
Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., & Bai, X. (2019). Named entity recognition using BERT？BiLSTM CRF for Chinese electronic health records. 2019 12th international congress on？image and signal processing, biomedical engineering and informatics, 1-5.？https://doi.org/10.1109/CISP-BMEI48845.2019.8965823
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional？transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
Fritzler, A., Logacheva, V., & Kretov, M. (2019). Few-shot classification in named entity recognition？task. Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 993-1000.？https://doi.org/10.1145/3297280.3297378
Gonzalez-Gallardo, C., Boros, E., Girdhar, N., Hamdi, A., Moreno, J., & Doucet, A. (2023). Yes？but.. Can ChatGPT Identify Entities in Historical Documents?.？https://doi.org/10.48550/arXiv.2303.17322
Hetzner, E. (2008). A simple method for citation metadata extraction using hidden markov models.？Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, 280-284.？https://doi.org/10.1145/1378889.1378937
Hollingsworth, B., Lewin, I., & Tidhar, D. (2005). Retrieving hierarchical text structure from？typeset scientific articles: a prerequisite for e-science text mining. Proc. of the 4th UK？E-Science All Hands Meeting, 67-273.
Hu, Y., Ameer, I., Zuo, X., Peng, X., Zhou, Y., Li, Z., Li, Y., Li, J., Jiang, X., & Xu, H. (2023).？Zero-shot Clinical Entity Recognition using ChatGPT.？https://doi.org/10.48550/arXiv.2303.16416
Huang, I.., Ho, J., Kao, H., & Lin, W. (2004). Extracting citation metadata from online publication？lists using BLAST. Advances in Knowledge Discovery and Data Mining: 8th Pacific-Asia？Conference, 539-548. https://doi.org/10.1007/978-3-540-24775-3_64

상세보기
Kim, J., Choi, N., Lim, S., Kim, J., Chung, S., Woo, H., Song, M., & Choi, J. D. (2021). Analysis？of Zero-Shot Crosslingual Learning between English and Korean for Named Entity Recognition.？Proceedings of the 1st Workshop on Multilingual Representation Learning, 224-237.？https://doi.org/10.18653/v1/2021.mrl-1.19
Korea Institute of Science and Technology Information (2022). DeepData-REFMETA Version？1.0. http://doi.org/10.23057/47
Lauscher, A., Ravishankar, V., Vulic, I., & Glavas, G. (2020). From zero to hero: on the limitations？of zero-shot cross-lingual transfer with multilingual transformers.？https://doi.org/10.48550/arXiv.2005.00633
Liu, X., Chen, H., & Xia, W. (2022). Overview of named entity recognition. Journal of Contemporary？Educational Research, 6(5), 65-68. https://doi.org/10.26689/jcer.v6i5.3958
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy O., Lewis, M., Zettlemoyer, L.,？& Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.？https://doi.org/10.48550/arXiv.1907.11692
Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction？for scholarship publications. Research and Advanced Technology for Digital Libraries: 13th？European Conference, 473-474. https://doi.org/10.1007/978-3-642-04346-8_62

상세보기
OpenAI (2022). Introducing ChatGPT. Available: https://openai.com/blog/chatgpt/
Park, S., Moon, J., Kim, S., Cho, W. I., Han, J., Park, J., Song, C., Kim, J., Song, Y., Oh, T.,？Lee, J., Oh, J., Lyu, S., Jeong, Y., Lee, I., Seo, S., Lee, D., Kim, H., Lee, M., Jang, S.,？Do, S., Kim, S., Lim, K., Lee, J., Park, K., Shin, J., Kim, S., Park, L., Oh, A., Ha, J.,？& Cho, K. (2021). Klue: Korean Language Understanding Evaluation.？https://doi.org/10.48550/arXiv.2105.09680
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding？by generative pre-training.
Rodrigues A. D., Colavizza, G., & Kaplan, F. (2018). Deep reference mining from scholarly？literature in the arts and humanities. Frontiers in Research Metrics and Analytics, 21.？https://doi.org/10.3389/frma.2018.00021

상세보기
Segura-Bedmar, I., Martinez Fernandez, P., & Herrero-Zazo, M. (2013). Semeval-2013 task 9:？Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association？for Computational Linguistics, 341-350.
Souza, F., Nogueira, R., & Lotufo, R. (2019). Portuguese named entity recognition using BERTCRF. https://doi.org/10.48550/arXiv.1909.10649
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L. (2015). CERMINE:？automatic extraction of structured metadata from scientific literature. International Journal？on Document Analysis and Recognition, 18, 317-335.？https://doi.org/10.1007/s10032-015-0249-8

상세보기
Van Eck, N. & Waltman, L. (2010). Software survey: VOSviewer, a computer program for？bibliometric mapping. Scientometrics, 84(2), 523-538.？https://doi.org/10.1007/s11192-009-0146-3

상세보기
Voskuil, K. & Verberne, S. (2021). Improving reference mining in patents with BERT.？https://doi.org/10.48550/arXiv.2101.01039
Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., & Wang, G. (2023). GPT-NER:？Named Entity Recognition via Large Language Models.？https://doi.org/10.48550/arXiv.2304.10428
Wei, X., Cui, X., Cheng, N., Wang, X., Zhang, X., Huang, S., Xie, P., Xu, J., Chen, Y., Zhang,？M., Jiang, Y., & Han, W. (2023). Zero-shot information extraction via chatting with chatgpt.？https://doi.org/10.48550/arXiv.2302.10205
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith,？J., & Schmidt, D. (2023). A prompt pattern catalog to enhance prompt engineering with？chatgpt. https://doi.org/10.48550/arXiv.2302.11382
Wu, Y., Huang, J., Xu, C., Zheng, H., Zhang, L., & Wan, J. (2021). Research on named entity？recognition of electronic medical records based on roberta and radical-level feature. Wireless？Communications and Mobile Computing, 2021, 1-10.？https://doi.org/10.1155/2021/2489754

상세보기
Yang, Y. & Katiyar, A. (2020). Simple and effective few-shot named entity recognition with？structured nearest neighbor learning. https://doi.org/10.48550/arXiv.2010.02405
Zhang, X., Zou, J., Le, D. X., & Thoma, G. R. (2011). A structural SVM approach for reference？parsing. BMC bioinformatics, 12, 1-7. https://doi.org/10.1186/1471-2105-12-S3-S7

상세보기

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증