$\require{mediawiki-texvc}$

연합인증

연합인증 가입 기관의 연구자들은 소속기관의 인증정보(ID와 암호)를 이용해 다른 대학, 연구기관, 서비스 공급자의 다양한 온라인 자원과 연구 데이터를 이용할 수 있습니다.

이는 여행자가 자국에서 발행 받은 여권으로 세계 각국을 자유롭게 여행할 수 있는 것과 같습니다.

연합인증으로 이용이 가능한 서비스는 NTIS, DataON, Edison, Kafe, Webinar 등이 있습니다.

한번의 인증절차만으로 연합인증 가입 서비스에 추가 로그인 없이 이용이 가능합니다.

다만, 연합인증을 위해서는 최초 1회만 인증 절차가 필요합니다. (회원이 아닐 경우 회원 가입이 필요합니다.)

연합인증 절차는 다음과 같습니다.

최초이용시에는
ScienceON에 로그인 → 연합인증 서비스 접속 → 로그인 (본인 확인 또는 회원가입) → 서비스 이용

그 이후에는
ScienceON 로그인 → 연합인증 서비스 접속 → 서비스 이용

연합인증을 활용하시면 KISTI가 제공하는 다양한 서비스를 편리하게 이용하실 수 있습니다.

LDA와 BERTopic을 이용한 토픽모델링의 증강과 확장 기법 연구
Topic Model Augmentation and Extension Method using LDA and BERTopic 원문보기

정보관리학회지 = Journal of the Korean society for information management, v.39 no.3, 2022년, pp.99 - 132  

김선욱 (경북대학교 사회과학대학 문헌정보학과) ,  양기덕 (영남고문헌아카이브센터)

초록
AI-Helper 아이콘AI-Helper

본 연구의 목적은 LDA 토픽모델링 결과와 BERTopic 토픽모델링 결과를 합성하는 방법론인 Augmented and Extended Topics(AET)를 제안하고, 이를 사용해 문헌정보학 분야의 연구주제를 분석하는 데 있다. AET의 실제 적용결과를 확인하기 위해 2001년 1월부터 2021년 10월까지의 Web of Science 내 문헌정보학 학술지 85종에 게재된 학술논문 서지 데이터 55,442건을 분석하였다. AET는 서로 다른 토픽모델링 결과의 관계를 WORD2VEC 기반 코사인 유사도 매트릭스로 구축하고, 매트릭스 내 의미적 관계가 유효한 범위 내에서 매트릭스 재정렬 및 분할 과정을 반복해 증강토픽(Augmented Topics, 이하 AT)을 추출한 뒤, 나머지 영역에서 코사인 유사도 평균값 순위와 BERTopic 토픽 규모 순위에 대한 조화평균을 통해 확장토픽(Extended Topics, 이하 ET)을 결정한다. 최적 표준으로 도출된 LDA 토픽모델링 결과와 AET 결과를 비교한 결과, AT는 LDA 토픽모델링 토픽을 한층 더 구체화하고 세분화하였으며 ET는 유효한 토픽을 발견하였다. AT(Augmented Topics)의 성능은 LDA 이상이었으며 ET(Extended Topics)는 일부 경우를 제외하고 대부분 LDA와 유사한 수준의 성능을 나타내었다.

Abstract AI-Helper 아이콘AI-Helper

The purpose of this study is to propose AET (Augmented and Extended Topics), a novel method of synthesizing both LDA and BERTopic results, and to analyze the recently published LIS articles as an experimental approach. To achieve the purpose of this study, 55,442 abstracts from 85 LIS journals withi...

주제어

표/그림 (12)

참고문헌 (44)

  1. Bae, Jangseong, Lee, Changki, Lim, Soojong, & Kim, Hyunki (2020). Korean semantic role labeling with BERT. Journal of Korean Institute of Information Scientists and Engineers, 47(11), 1021-1026. https://doi.org/10.5626/JOK.2020.47.11.1021 

  2. Choi, Won-Jun, Seol, Jae-Wook, Jeong, Hee-Seok, & Yoon, Hwamook (2018). Comparison and analysis of subject classification for domestic research data. The Journal of the Korea Contents Association, 18(8), 178-186. http://doi.org/10.5392/JKCA.2018.18.08.178 

  3. Hwang, Seung-Yeon, An, YoonBin, Shin, Dong-Jin, Oh, Jae-Kon, & Moon Jin-Yong (2020). A study on the document topic extraction system based on big data. The Journal of The Institute of Internet, Broadcasting and Communication, 20(5), 207-214. http://doi.org/10.7236/JIIBC.2020.20.5.207 

  4. Kang, Bora & Kim, Heesop (2017). An analysis of the digital library research trends in Korea. Journal of the Korean Society for Information Management, 34(3), 49-66. https://doi.org/10.3743/KOSIM.2017.34.3.049 

  5. Kim, Dong-Kee, Han, Mooyoung, & Han, Hae-Ree (2004). Use and misuse of biostatistical analysis. Korean Neuropsychiatric Association, 43(2), 141-147. 

  6. Lee, Da-Bin & Choi, Sung-Pil (2019). Comparative analysis of various Korean morpheme embedding models using massive textual resources. Journal of Korean Institute of Information Scientists and Engineers, 46(5), 413-418. http://doi.org/10.5626/JOK.2019.46.5.413 

  7. Lee, Yubin, Lee, Youngho, Seong, Jeongchang, Ana, Stanescu, Ji, Sanghoon, & Hwang, Chul-Sue (2020). An analysis of the latest trends and topics in geography research using topic modeling. Journal of the Korean Geographical Society, 55(6), 589-599. http://doi.org/10.22776/kgs.2020.55.6.589 

  8. Lim, Sora & Kwon, YongJin (2017). IPC multi-label classification based on functional characteristics of fields in patent documents. Journal of Internet Computing and Services, 18(1), 77-88. https://doi.org/10.7472/jksii.2017.18.1.77 

  9. Park, ChangUn & Kim, HyungJung (2015). Measurement of inter-rater reliability in systematic review. Hanyang Medical Reviews, 35(1), 44-49. https://doi.org/10.7599/hmr.2015.35.1.44 

  10. Park, Jong-Do (2019). A study on issue tracking on multi-cultural studies using topic modeling. Journal of the Korean Library and Information Science, 53(3), 273-289. https://doi.org/10.4275/KSLIS.2019.53.3.273 

  11. Park, Junhyeong & Oh, Hyo-Jung (2017). Comparison of topic modeling methods for analyzing research trends of archives management in Korea: focused on LDA and HDP. Journal of Korean Library and Information Science Society, 48(4), 235-258. https://doi.org/10.16981/kliss.48.201712.235 

  12. Park, Soonwook, Kim, Youngkook, & Kim, Myungho (2021). A design for intent classification modelswith covid-19 disaster alerts data. Proceedings of the 2021 Korea Computer Congress, 1810-1812. 

  13. Song, Eun-Young, Choi, Hoe-Ryeon, & Lee, Hong-Chul (2019). A study on efficient training method for named entity recognition model with word embedding applied to PCA. Journal of the Korean Institute of Industrial Engineers, 45(1), 30-39. https://doi.org/10.7232/JKIIE.2019.45.1.030 

  14. Yang, Kiduk, Kim, SeonWook, & Lee, HyeKyung (2021). Comparison of research performance between domestic and international library and information science scholars. Journal of the Korean Library and Information Science, 55(1), 365-392. http://dx.doi.org/10.4275/KSLIS.2021.55.1.365 

  15. Yoon, Sang-Hun & Kim, Keun-Hyung (2021). Expansion of topic modeling with Word2Vec and case analysis. The Journal of Information Systems, 30(1), 45-64. http://dx.doi.org/10.5859/KAIS.2021.30.1.45 

  16. Ajayi, D. (2020). How BERT and GPT models change the game for NLP. Available: https://www.ibm.com/blogs/watson/2020/12/how-bert-and-gpt-models-change-the-game-for-nlp/ 

  17. Angelov, D. (2020). Top2vec: Distributed representations of topics. https://doi.org/10.48550/arXiv.2008.09470 

  18. Behrisch, M., Bach, B., Riche, H. N., Schreck, T., & Fekete, J. D. (2016). Matrix reordering methods for table and network visualization. In Computer Graphics Forum, 35(3), 693-716. https://doi.org/10.1111/cgf.12935 

  19. Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13. 

  20. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022. 

  21. Bodrunova, S. S., Orekhov, A. V., Blekanov, I. S., Lyudkevich, N. S., & Tarasov, N. A. (2020). Topic detection based on sentence embeddings and agglomerative clustering with markov moment. Future Internet, 12(9), 144. https://doi.org/10.3390/fi12090144 

  22. Chen, A. T., Sheble, L., & Eichler, G. (2013). Topic modeling and network visualization to explore patient experiences. Visual Analytics in Healthcare Workshop 2013. 

  23. Choi, W. J. & Kim, E. (2019). A large-scale text analysis with word embeddings and topic modeling. Journal of Cognitive Science, 20(1), 147-187. http://doi.org/10.17791/jcs.2019.20.1.147 

  24. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9 

  25. Deveci, T. (2019). Sentence length in education research articles: a comparison between anglophone and turkish authors. The Linguistics Journal, 14(1), 73-100. 

  26. Ermann, L., Chepelianskii, A. D., & Shepelyansky, D. L. (2012). Toward two-dimensional search engines. Journal of Physics A: Mathematical and Theoretical, 45(27), 275101. 

  27. Esposito, F., Corazza, A., & Cutugno, F. (2016, December). Topic Modelling with Word Embeddings. CLiC-it/EVALITA. https://doi.org/10.4000/books.aaccademia.1666 

  28. Gao, Q., Huang, X., Dong, K., Liang, Z., & Wu, J. (2022). Semantic-enhanced topic evolution analysis: a combination of the dynamic topic model and word2vec. Scientometrics, 1-21. https://doi.org/10.1007/s11192-022-04275-z 

  29. Gerlach, M., Peixoto, T. P., & Altmann, E. G. (2018). A network approach to topic models. Science Advances, 4(7), eaaq1360. https://doi.org/10.1126/sciadv.aaq1360 

  30. Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. https://doi.org/10.48550/arXiv.2203.05794 

  31. Hasan, M., Rahman, A., Karim, M., Khan, M., Islam, S., & Islam, M. (2021). Normalized approach to find optimal number of topics in Latent Dirichlet Allocation(LDA). In Proceedings of International Conference on Trends in Computational and Cognitive Engineering, 341-354. Springer, Singapore. http://doi.org/10.1007/978-981-33-4673-4_27 

  32. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50-57. 

  33. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169-15211. https://doi.org/10.1007/s11042-018-6894-4 

  34. Li, C., Lu, Y., Wu, J., Zhang, Y., Xia, Z., Wang, T., Yu, D., Chen, X., Liu, P., & Guo, J. (2018, April). LDA meets Word2Vec: a novel model for academic abstract clustering. Companion Proceedings of the Web Conference 2018, 1699-1706. https://doi.org/10.1145/3184558.3191629 

  35. Losee, R. M. (2001). Term dependence: A basis for Luhn and Zipf models. Journal of the American Society for Information Science and Technology, 52(12), 1019-1025. https://doi.org/10.1002/asi.1155 

  36. M'sik, B. & Casablanca, B. M. (2020). Topic modeling coherence: A comparative study between lda and nmf models using covid'19 corpus. International Journal, 9(4). https://doi.org/10.30534/ijatcse/2020/231942020 

  37. Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR. https://doi.org/10.48550/arXiv.1301.3781 

  38. Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. https://doi.org/10.48550/arXiv.1605.02019 

  39. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. https://doi.org/10.48550/arXiv.1802.05365 

  40. Schick, T. & Schutze, H. (2019). BERTRAM: Improved word embeddings have big impact on contextualized model performance. https://doi.org/10.48550/arXiv.1910.07181 

  41. Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too. https://doi.org/10.48550/arXiv.2004.14914 

  42. Vayansky, I. & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, 101582. https://doi.org/10.1016/j.is.2020.101582 

  43. Wang, Y., Shi, Z., Guo, X., Liu, X., Zhu, E., & Yin, J. (2018). Deep embedding for determining the number of clusters. Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.12150 

  44. Yang, K., Lee, H., Kim, S., Lee, J., & Oh, D.-G. (2021). KCI vs. WoS: comparative analysis of Korean and international journal publications in library and information science. Journal of Information Science Theory and Practice, 9(3), 76-106. https://doi.org/10.1633/JISTAP.2021.9.3.6 

관련 콘텐츠

오픈액세스(OA) 유형

GOLD

오픈액세스 학술지에 출판된 논문

이 논문과 함께 이용한 콘텐츠

저작권 관리 안내
섹션별 컨텐츠 바로가기

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

AI-Helper 아이콘
AI-Helper
안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

선택된 텍스트

맨위로