[논문]통합 CNN, LSTM, 및 BERT 모델 기반의 음성 및 텍스트 다중 모달 감정 인식 연구

에드워드 카야디; 한스 나타니엘 하디 수실로; 송미화

doi:10.17703/jcct.2024.10.1.617

통합 CNN, LSTM, 및 BERT 모델 기반의 음성 및 텍스트 다중 모달 감정 인식 연구
Enhancing Multimodal Emotion Recognition in Speech and Text with Integrated CNN, LSTM, and BERT Models 원문보기

Journal of the convergence on culture technology : JCCT = 문화기술의 융합, v.10 no.1, 2024년, pp.617 - 623

에드워드 카야디 (세명대학교 정보통신학부) , 한스 나타니엘 하디 수실로 (서강대학교 컴퓨터공학과) , 송미화 (세명대학교 스마트IT학부)

초록
AI-Helper

언어와 감정 사이의 복잡한 관계의 특징을 보이며, 우리의 말을 통해 감정을 식별하는 것은 중요한 과제로 인식된다. 이 연구는 음성 및 텍스트 데이터를 모두 포함하는 다중 모드 분류 작업을 통해 음성 언어의 감정을 식별하기 위해 속성 엔지니어링을 사용하여 이러한 과제를 해결하는 것을 목표로 한다. CNN(Convolutional Neural Networks)과 LSTM(Long Short-Term Memory)이라는 두 가지 분류기를 BERT 기반 사전 훈련된 모델과 통합하여 평가하였다. 논문에서 평가는 다양한 실험 설정 전반에 걸쳐 다양한 성능 지표(정확도, F-점수, 정밀도 및 재현율)를 다룬다. 이번 연구 결과는 텍스트와 음성 데이터 모두에서 감정을 정확하게 식별하는 두 모델의 뛰어난 능력을 보인다.

Abstract ▼ AI-Helper

Identifying emotions through speech poses a significant challenge due to the complex relationship between language and emotions. Our paper aims to take on this challenge by employing feature engineering to identify emotions in speech through a multimodal classification task involving both speech and text data. We evaluated two classifiers-Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)-both integrated with a BERT-based pre-trained model. Our assessment covers various performance metrics (accuracy, F-score, precision, and recall) across different experimental setups). The findings highlight the impressive proficiency of two models in accurately discerning emotions from both text and speech data.

주제어

표/그림 (8)

그림 그림1. 미세 조정 BERT 시스템 구조 Figure 1. Fine tuned BERT system architecture
그림 그림2. 멀티모달 SER 시스템 아키텍처 Figure 2. Multimodal SER system architecture
그림 그림3. CNN 모델 혼동행렬 Figure 3 CNN Model Confusion Matrix
그림 그림4. LSTM 모델 혼동행렬 Figure 4 LSTM Model Confusion Matrix
그림 그림5. BERT 모델 혼동행렬 Figure 5 BERT Model Confusion Matrix
표 표1. CNN 모델 파라미터 Table 1. CNN Model Parameter
표 표2. LSTM 모델 파라미터 Table 2. LSTM Model Parameter
표 표3. 모델 성능 비교 Table 3. Model Comparison

참고문헌 (12)

S. Gaurav, "Multimodal speech emotion？recognition and ambiguity resolution", arXiv？preprint arXiv:1904.06022, 2019. doi.org/10.48550/arXiv.1904.06022
Alzubaidi, L., Zhang, J., Humaidi, A. J.,？Al-Dujaili, A., Duan, Y., Al-Shamma, O., ... &？Farhan, L, "Review of deep learning: Concepts,？CNN architectures, challenges, applications,？future directions", Journal of big Data, 8, pp？1-74, 2021. doi.org/10.1186/s40537-021-00444-8

상세보기
YY. Yu, X. Si, C. Hu and J. Zhang, "A review of？recurrent neural networks: LSTM cells and？network architectures", Neural computation, Vol？31, No. 7, pp. 1235-1270, 2019. doi:？10.1162/neco_a_01199.

상세보기
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A.,？Mower, E., Kim, S., ... & Narayanan, S. S.,？"IEMOCAP: Interactive emotional dyadic motion？capture database" Language resources and？evaluation, 42, pp. 335-359, 2008.？https://doi.org/10.1007/s10579-008-9076-6

상세보기
Tzirakis, P., Trigeorgis, G., Nicolaou, M. A.,？Schuller, B. W., & Zafeiriou, S., "End-to-End？Multimodal Emotion Recognition Using Deep？Neural Networks", IEEE Journal of Selected？Topics in Signal Processing, vol. 11, no. 8, pp.？1301-1309, Dec. 2017, doi:？10.1109/JSTSP.2017.2764438.

상세보기
Kim, J. H. & Lee, S. P., "Multi-Modal Emotion？Recognition Using Speech Features and Text？Embedding", Trans. Korean Inst. Electr. Eng, 70,？pp. 108-113, 2021. doi:10.5370/kiee.2021.70.1.108.

상세보기
Ranganathan, H., Chakraborty, S., &？Panchanathan, S., "Multimodal emotion？recognition using deep learning architectures"？2016 IEEE winter conference on applications of？computer vision (WACV). IEEE, pp. 1-9, 2016.？DOI: 10.1109/WACV.2016.7477679
Liu, W., Qiu, J. L., Zheng, W. L., & Lu, B. L..？"Comparing recognition performance and？robustness of multimodal deep learning models？for multimodal emotion recognition", IEEE？Transactions on Cognitive and Developmental？Systems, Vol. 14, No. 2, pp.715-729, 2021. DOI:？10.1109/TCDS.2021.3071170

상세보기
Jo, C.Y. & Jung, H.J., "Multimodal Emotion？Recognition System using Face Images and？Multidimensional Emotion-based Text", The？Journal of Korean Institute of Information？Technology, vol. 21, no. 5, pp. 39-47, 2023, doi:？10.14801/jkiit.2023.21.5.39

상세보기
Lee, S.J., Seo, J.Y. & Choi, J.H., "The Effect of Interjection in Conversational Interaction with the？AI Agent: In the Context of Self-Driving Car",？The Journal of the Convergence on Culture？Technology, vol. 8, no. 1, pp. 551-563, 2022. doi:10.17703/JCCT.2022.8.1.551.

원문보기 상세보기
Yoon, S., Byun, S. & Jung, K., "Multimodal？Speech Emotion Recognition Using Audio and？Text", 2018 IEEE Spoken Language Technology？Workshop (SLT), Athens, Greece, pp 112-118,？2018, doi: 10.1109/SLT.2018.8639583.
Devlin, J., Chang, M. W., Lee, K., & Toutanova,？K., "BERT: Pre-training of Deep Bidirectional？Transformers for Language Understanding", In？Proceedings of naacL-HLT, Vol. 1, p. 2, pp？4171-4186, 2019. DOI: 10.18653/V1/N19-1423

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증