[논문]한국형 멀티모달 몽타주 앱을 위한 생성형 AI 연구

임정현; 차경애; 고재필; 홍원기

doi:10.18807/jsrs.2024.14.1.013

한국형 멀티모달 몽타주 앱을 위한 생성형 AI 연구
Research on Generative AI for Korean Multi-Modal Montage App 원문보기

서비스연구 = Journal of service research and studies, v.14 no.1, 2024년, pp.13 - 26

임정현 (대구대학교 AI학부) , 차경애 (대구대학교 AI학과) , 고재필 (국립금오공과대학교 컴퓨터공학과) , 홍원기 (대구대학교 컴퓨터정보공학부)

초록
AI-Helper

멀티모달 (multi-modal) 생성이란 텍스트, 이미지, 오디오 등 다양한 정보를 기반으로 결과를 도출하는 작업을 말한다. AI 기술의 비약적인 발전으로 인해 여러 가지 유형의 데이터를 종합적으로 처리해 결과를 도출하는 멀티모달 기반 시스템 또한 다양해지는 추세이다. 본 논문은 음성과 텍스트 인식을 활용하여 인물을 묘사하면, 몽타주 이미지를 생성하는 AI 시스템의 개발 내용을 소개한다. 기존의 몽타주 생성 기술은 서양인들의 외형을 기준으로 이루어진 반면, 본 논문에서 개발한 몽타주 생성 시스템은 한국인의 안면 특징을 바탕으로 모델을 학습한다. 따라서, 한국어에 특화된 음성과 텍스트의 멀티모달을 기반으로 보다 정확하고 효과적인 한국형 몽타주 이미지를 만들어낼 수 있다. 개발된 몽타주 생성 앱은 몽타주 초안으로 충분히 활용 가능하기 때문에 기존의 몽타주 제작 인력의 수작업을 획기적으로 줄여줄 수 있다. 이를 위해 한국지능정보사회진흥원의 AI-Hub에서 제공하는 페르소나 기반 가상 인물 몽타주 데이터를 활용하였다. AI-Hub는 AI 기술 및 서비스 개발에 필요한 인공지능 학습용 데이터를 구축하여 원스톱 제공을 목적으로 한 AI 통합 플랫폼이다. 이미지 생성 시스템은 고해상도 이미지를 생성하는데 사용하는 딥러닝 모델인 VQGAN과 한국어 기반 영상생성 모델인 KoDALLE 모델을 사용하여 구현하였다. 학습된 AI 모델은 음성과 텍스트를 이용해 묘사한 내용과 매우 유사한 얼굴의 몽타주 이미지가 생성됨을 확인할 수 있다. 개발된 몽타주 생성 앱의 실용성 검증을 위해 10명의 테스터가 사용한 결과 70% 이상이 만족한다는 응답을 보였다. 몽타주 생성 앱은 범죄자 검거 등 얼굴의 특징을 묘사하여 이미지화하는 여러 분야에서 다양하게 사용될 수 있을 것이다.

Abstract ▼ AI-Helper

Multi-modal generation is the process of generating results based on a variety of information, such as text, images, and audio. With the rapid development of AI technology, there is a growing number of multi-modal based systems that synthesize different types of data to produce results. In this paper, we present an AI system that uses speech and text recognition to describe a person and generate a montage image. While the existing montage generation technology is based on the appearance of Westerners, the montage generation system developed in this paper learns a model based on Korean facial features. Therefore, it is possible to create more accurate and effective Korean montage images based on multi-modal voice and text specific to Korean. Since the developed montage generation app can be utilized as a draft montage, it can dramatically reduce the manual labor of existing montage production personnel. For this purpose, we utilized persona-based virtual person montage data provided by the AI-Hub of the National Information Society Agency. AI-Hub is an AI integration platform aimed at providing a one-stop service by building artificial intelligence learning data necessary for the development of AI technology and services. The image generation system was implemented using VQGAN, a deep learning model used to generate high-resolution images, and the KoDALLE model, a Korean-based image generation model. It can be confirmed that the learned AI model creates a montage image of a face that is very similar to what was described using voice and text. To verify the practicality of the developed montage generation app, 10 testers used it and more than 70% responded that they were satisfied. The montage generator can be used in various fields, such as criminal detection, to describe and image facial features.

주제어

표/그림 (19)

그림 Fig. 2-1 DALLE Structure
표 Tab. 2-1. DALLE vs KoDALLE
그림 Fig. 2-2 VQGAN (Esser et al., 2021)
그림 Fig. 2-3 Existing montage creation program (Park et al., 2014)
그림 Fig. 3-1 Learning Model System Diagram
그림 Fig. 3-2 App UI Module Diagram
표 Tab. 3-1 Training Dataset
표 Tab. 3-2 Definition of Data Input Label
표 Tab. 3-3 Computing Environment for training VQGAN
표 Tab. 3-4 Learning Parameter for VQGAN
표 Tab. 3-5 Computing Environment for training KoDALLE
표 Tab. 3-6 Learning Parameter for KoDALLE
그림 Fig. 3-3 Loss Rate of Learning Outcome
그림 Fig. 3-4 Source Image for VQGAN learning
그림 Fig. 3-5 Reconstructed Image on VQGAN
표 Tab. 3-7 Montage generated by KoDALLE
그림 Fig. 4-1 Start Up Screen of App
그림 Fig. 4-2 Input Screen for Image Description
그림 Fig. 4-3 Generated Montage

참고문헌 (18)

AI-Hub K-Fashion(2024), https://aihub.or.kr/aihubdata/data/view.do?currMenu115&topMenu100&aihubDataSerealm&dataSetSn51？
AI-Hub Montage(2024), https://www.aihub.or.kr/aihubdata/data/view.do?currMenu115&topMenu100&dataSetSn618？
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D.(2020), Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901？
Esser, P., Rombach, R., and Ommer, B.(2021), Taming Transformers for High-Resolution Image Synthesis, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12873-12883？
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.(2020), Generative adversarial networks. Communications of the ACM, 63(11), pp. 139-144？

상세보기
Joh, H., and Park, B.S.(2018), A Comparative Study of Montage investigation and portrait investigation. 가천법학, 11(3), pp. 235-264
Kingma, D.P., and Welling, M.(2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114？
KoDALLE (2024), https://github.com/KR-HappyFace/KoDALLE？
KoGPT of Kakao Brain (2024), https://github.com/kakaobrain/kogpt？
KoGPT Trinity of SKT (2024), https://github.com/SKT-AI/KoGPT2？
Park, B., Nam, S., Chang, H. and Choi, C. (2013), EsFit - A facial composites methodology to help eyewitness, Annual Conference of IEIE, 1393-1396
Park, S., Moon, J., Kim, S., Cho, W. I., Han, J., Park, J., Song, C., Kim, J., Song, Y., Oh, T., Lee, J., Oh, J., Lyu, S., Jeong, Y., Lee, I., Seo, S., Lee, D., Kim, H., Lee, M., Jang, S., Do, S., Kim, S., Lim, K., Lee, J., Park, K., Shin, J., Kim, S., Park, L., Oh, A., Ha, J.-W., and Cho, K. (2021), Klue: Korean language understanding evaluation. arXiv preprint arXiv:2105.09680？
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I.(2021), Zero-shot text-to-image generation, In International Conference on Machine Learning, 8821-8831？
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022), High-resolution image synthesis with latent diffusion models, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684-10695？
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E. L., Ghasemipour, K., Gontijo Lopes, R., Ayan, B. K., Salimans, T., Ho, J., Fleet, D. J., and Norouzi, M.(2022), Photorealistic text-to-image diffusion models with deep language understanding, Advances in Neural Information Processing Systems, 35, 36479-36494？
Van Den Oord, A., and Vinyals, O.(2017), Neural discrete representation learning, Advances in neural information processing systems, 30？
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017), Attention is all you need. Advances in neural information processing systems, 30？
Weight&Biases(2024), https://wandb.ai/site

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증