[논문]LDA와 BERTopic을 이용한 토픽모델링의 증강과 확장 기법 연구

김선욱; 양기덕

doi:10.3743/kosim.2022.39.3.099

LDA와 BERTopic을 이용한 토픽모델링의 증강과 확장 기법 연구
Topic Model Augmentation and Extension Method using LDA and BERTopic 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.39 no.3, 2022년, pp.99 - 132

김선욱 (경북대학교 사회과학대학 문헌정보학과) , 양기덕 (영남고문헌아카이브센터)

초록
AI-Helper

본 연구의 목적은 LDA 토픽모델링 결과와 BERTopic 토픽모델링 결과를 합성하는 방법론인 Augmented and Extended Topics(AET)를 제안하고, 이를 사용해 문헌정보학 분야의 연구주제를 분석하는 데 있다. AET의 실제 적용결과를 확인하기 위해 2001년 1월부터 2021년 10월까지의 Web of Science 내 문헌정보학 학술지 85종에 게재된 학술논문 서지 데이터 55,442건을 분석하였다. AET는 서로 다른 토픽모델링 결과의 관계를 WORD2VEC 기반 코사인 유사도 매트릭스로 구축하고, 매트릭스 내 의미적 관계가 유효한 범위 내에서 매트릭스 재정렬 및 분할 과정을 반복해 증강토픽(Augmented Topics, 이하 AT)을 추출한 뒤, 나머지 영역에서 코사인 유사도 평균값 순위와 BERTopic 토픽 규모 순위에 대한 조화평균을 통해 확장토픽(Extended Topics, 이하 ET)을 결정한다. 최적 표준으로 도출된 LDA 토픽모델링 결과와 AET 결과를 비교한 결과, AT는 LDA 토픽모델링 토픽을 한층 더 구체화하고 세분화하였으며 ET는 유효한 토픽을 발견하였다. AT(Augmented Topics)의 성능은 LDA 이상이었으며 ET(Extended Topics)는 일부 경우를 제외하고 대부분 LDA와 유사한 수준의 성능을 나타내었다.

Abstract ▼ AI-Helper

The purpose of this study is to propose AET (Augmented and Extended Topics), a novel method of synthesizing both LDA and BERTopic results, and to analyze the recently published LIS articles as an experimental approach. To achieve the purpose of this study, 55,442 abstracts from 85 LIS journals within the WoS database, which spans from January 2001 to October 2021, were analyzed. AET first constructs a WORD2VEC-based cosine similarity matrix between LDA and BERTopic results, extracts AT (Augmented Topics) by repeating the matrix reordering and segmentation procedures as long as their semantic relations are still valid, and finally determines ET (Extended Topics) by removing any LDA related residual subtopics from the matrix and ordering the rest of them by F1 (BERTopic topic size rank, Inverse cosine similarity rank). AET, by comparing with the baseline LDA result, shows that AT has effectively concretized the original LDA topic model and ET has discovered new meaningful topics that LDA didn't. When it comes to the qualitative performance evaluation, AT performs better than LDA while ET shows similar performances except in a few cases.

주제어

표/그림 (12)

그림 <그림 1> 연구 절차
그림 <그림 2> 제안된 매트릭스 재정렬 방법
그림 <그림 3> AET 합성 개념
그림 <그림 4> 최적의 LDA 하이퍼파라미터 탐색
표 <표 1> LDA 토픽모델 결과 - 토픽과 토픽용어
표 <표 2> BERTopic 토픽모델 결과 - 상위 10위 토픽과 토픽용어
그림 <그림 5> 최초 의미적 관계 손실 위치 - 15번째 세그먼트
표 <표 3> 15번째 세그먼트 내 최초 의미손실 관계
표 <표 4> AET 카파통계량(K)
그림 <그림 6> LDAbow와 AT 임베딩 표상 비교
그림 <그림 7> ET 임베딩 표상 비교
표 <표 5> WORD2VEC 임베딩 불가 토픽용어

참고문헌 (44)

Bae, Jangseong, Lee, Changki, Lim, Soojong, & Kim, Hyunki (2020). Korean semantic role labeling with BERT. Journal of Korean Institute of Information Scientists and Engineers, 47(11), 1021-1026. https://doi.org/10.5626/JOK.2020.47.11.1021

상세보기
Choi, Won-Jun, Seol, Jae-Wook, Jeong, Hee-Seok, & Yoon, Hwamook (2018). Comparison and analysis of subject classification for domestic research data. The Journal of the Korea Contents Association, 18(8), 178-186. http://doi.org/10.5392/JKCA.2018.18.08.178

원문보기 상세보기
Hwang, Seung-Yeon, An, YoonBin, Shin, Dong-Jin, Oh, Jae-Kon, & Moon Jin-Yong (2020). A study on the document topic extraction system based on big data. The Journal of The Institute of Internet, Broadcasting and Communication, 20(5), 207-214. http://doi.org/10.7236/JIIBC.2020.20.5.207

원문보기 상세보기
Kang, Bora & Kim, Heesop (2017). An analysis of the digital library research trends in Korea. Journal of the Korean Society for Information Management, 34(3), 49-66. https://doi.org/10.3743/KOSIM.2017.34.3.049

원문보기 상세보기
Kim, Dong-Kee, Han, Mooyoung, & Han, Hae-Ree (2004). Use and misuse of biostatistical analysis. Korean Neuropsychiatric Association, 43(2), 141-147.
Lee, Da-Bin & Choi, Sung-Pil (2019). Comparative analysis of various Korean morpheme embedding models using massive textual resources. Journal of Korean Institute of Information Scientists and Engineers, 46(5), 413-418. http://doi.org/10.5626/JOK.2019.46.5.413

상세보기
Lee, Yubin, Lee, Youngho, Seong, Jeongchang, Ana, Stanescu, Ji, Sanghoon, & Hwang, Chul-Sue (2020). An analysis of the latest trends and topics in geography research using topic modeling. Journal of the Korean Geographical Society, 55(6), 589-599. http://doi.org/10.22776/kgs.2020.55.6.589
Lim, Sora & Kwon, YongJin (2017). IPC multi-label classification based on functional characteristics of fields in patent documents. Journal of Internet Computing and Services, 18(1), 77-88. https://doi.org/10.7472/jksii.2017.18.1.77

원문보기 상세보기
Park, ChangUn & Kim, HyungJung (2015). Measurement of inter-rater reliability in systematic review. Hanyang Medical Reviews, 35(1), 44-49. https://doi.org/10.7599/hmr.2015.35.1.44

상세보기
Park, Jong-Do (2019). A study on issue tracking on multi-cultural studies using topic modeling. Journal of the Korean Library and Information Science, 53(3), 273-289. https://doi.org/10.4275/KSLIS.2019.53.3.273

원문보기 상세보기
Park, Junhyeong & Oh, Hyo-Jung (2017). Comparison of topic modeling methods for analyzing research trends of archives management in Korea: focused on LDA and HDP. Journal of Korean Library and Information Science Society, 48(4), 235-258. https://doi.org/10.16981/kliss.48.201712.235

원문보기 상세보기
Park, Soonwook, Kim, Youngkook, & Kim, Myungho (2021). A design for intent classification modelswith covid-19 disaster alerts data. Proceedings of the 2021 Korea Computer Congress, 1810-1812.
Song, Eun-Young, Choi, Hoe-Ryeon, & Lee, Hong-Chul (2019). A study on efficient training method for named entity recognition model with word embedding applied to PCA. Journal of the Korean Institute of Industrial Engineers, 45(1), 30-39. https://doi.org/10.7232/JKIIE.2019.45.1.030

상세보기
Yang, Kiduk, Kim, SeonWook, & Lee, HyeKyung (2021). Comparison of research performance between domestic and international library and information science scholars. Journal of the Korean Library and Information Science, 55(1), 365-392. http://dx.doi.org/10.4275/KSLIS.2021.55.1.365

원문보기 상세보기
Yoon, Sang-Hun & Kim, Keun-Hyung (2021). Expansion of topic modeling with Word2Vec and case analysis. The Journal of Information Systems, 30(1), 45-64. http://dx.doi.org/10.5859/KAIS.2021.30.1.45

원문보기 상세보기
Ajayi, D. (2020). How BERT and GPT models change the game for NLP. Available: https://www.ibm.com/blogs/watson/2020/12/how-bert-and-gpt-models-change-the-game-for-nlp/
Angelov, D. (2020). Top2vec: Distributed representations of topics. https://doi.org/10.48550/arXiv.2008.09470
Behrisch, M., Bach, B., Riche, H. N., Schreck, T., & Fekete, J. D. (2016). Matrix reordering methods for table and network visualization. In Computer Graphics Forum, 35(3), 693-716. https://doi.org/10.1111/cgf.12935

상세보기
Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.

상세보기
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
Bodrunova, S. S., Orekhov, A. V., Blekanov, I. S., Lyudkevich, N. S., & Tarasov, N. A. (2020). Topic detection based on sentence embeddings and agglomerative clustering with markov moment. Future Internet, 12(9), 144. https://doi.org/10.3390/fi12090144

상세보기
Chen, A. T., Sheble, L., & Eichler, G. (2013). Topic modeling and network visualization to explore patient experiences. Visual Analytics in Healthcare Workshop 2013.
Choi, W. J. & Kim, E. (2019). A large-scale text analysis with word embeddings and topic modeling. Journal of Cognitive Science, 20(1), 147-187. http://doi.org/10.17791/jcs.2019.20.1.147

상세보기
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9

상세보기
Deveci, T. (2019). Sentence length in education research articles: a comparison between anglophone and turkish authors. The Linguistics Journal, 14(1), 73-100.
Ermann, L., Chepelianskii, A. D., & Shepelyansky, D. L. (2012). Toward two-dimensional search engines. Journal of Physics A: Mathematical and Theoretical, 45(27), 275101.

상세보기
Esposito, F., Corazza, A., & Cutugno, F. (2016, December). Topic Modelling with Word Embeddings. CLiC-it/EVALITA. https://doi.org/10.4000/books.aaccademia.1666
Gao, Q., Huang, X., Dong, K., Liang, Z., & Wu, J. (2022). Semantic-enhanced topic evolution analysis: a combination of the dynamic topic model and word2vec. Scientometrics, 1-21. https://doi.org/10.1007/s11192-022-04275-z

상세보기
Gerlach, M., Peixoto, T. P., & Altmann, E. G. (2018). A network approach to topic models. Science Advances, 4(7), eaaq1360. https://doi.org/10.1126/sciadv.aaq1360

상세보기
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. https://doi.org/10.48550/arXiv.2203.05794
Hasan, M., Rahman, A., Karim, M., Khan, M., Islam, S., & Islam, M. (2021). Normalized approach to find optimal number of topics in Latent Dirichlet Allocation(LDA). In Proceedings of International Conference on Trends in Computational and Cognitive Engineering, 341-354. Springer, Singapore. http://doi.org/10.1007/978-981-33-4673-4_27
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50-57.
Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169-15211. https://doi.org/10.1007/s11042-018-6894-4

상세보기
Li, C., Lu, Y., Wu, J., Zhang, Y., Xia, Z., Wang, T., Yu, D., Chen, X., Liu, P., & Guo, J. (2018, April). LDA meets Word2Vec: a novel model for academic abstract clustering. Companion Proceedings of the Web Conference 2018, 1699-1706. https://doi.org/10.1145/3184558.3191629
Losee, R. M. (2001). Term dependence: A basis for Luhn and Zipf models. Journal of the American Society for Information Science and Technology, 52(12), 1019-1025. https://doi.org/10.1002/asi.1155

상세보기
M'sik, B. & Casablanca, B. M. (2020). Topic modeling coherence: A comparative study between lda and nmf models using covid'19 corpus. International Journal, 9(4). https://doi.org/10.30534/ijatcse/2020/231942020

상세보기
Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR. https://doi.org/10.48550/arXiv.1301.3781
Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. https://doi.org/10.48550/arXiv.1605.02019
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. https://doi.org/10.48550/arXiv.1802.05365
Schick, T. & Schutze, H. (2019). BERTRAM: Improved word embeddings have big impact on contextualized model performance. https://doi.org/10.48550/arXiv.1910.07181
Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too. https://doi.org/10.48550/arXiv.2004.14914
Vayansky, I. & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, 101582. https://doi.org/10.1016/j.is.2020.101582

상세보기
Wang, Y., Shi, Z., Guo, X., Liu, X., Zhu, E., & Yin, J. (2018). Deep embedding for determining the number of clusters. Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.12150

상세보기
Yang, K., Lee, H., Kim, S., Lee, J., & Oh, D.-G. (2021). KCI vs. WoS: comparative analysis of Korean and international journal publications in library and information science. Journal of Information Science Theory and Practice, 9(3), 76-106. https://doi.org/10.1633/JISTAP.2021.9.3.6

원문보기 상세보기

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증