[논문]RoutingConvNet: 양방향 MFCC 기반 경량 음성감정인식 모델

임현택; 김수형; 이귀상; 양형정

doi:10.30693/smj.2023.12.5.28

RoutingConvNet: 양방향 MFCC 기반 경량 음성감정인식 모델
RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC 원문보기

스마트미디어저널 = Smart media journal, v.12 no.5, 2023년, pp.28 - 35

임현택 (전남대학교 인공지능융합학과) , 김수형 (전남대학교 인공지능융합학과) , 이귀상 (전남대학교 인공지능융합학과) , 양형정 (전남대학교 인공지능융합학과)

초록
AI-Helper

본 연구에서는 음성감정인식의 적용 가능성과 실용성 향상을 위해 적은 수의 파라미터를 가지는 새로운 경량화 모델 RoutingConvNet(Routing Convolutional Neural Network)을 제안한다. 제안모델은 학습 가능한 매개변수를 줄이기 위해 양방향 MFCC(Mel-Frequency Cepstral Coefficient)를 채널 단위로 연결해 장기간의 감정 의존성을 학습하고 상황 특징을 추출한다. 저수준 특징 추출을 위해 경량심층 CNN을 구성하고, 음성신호에서의 채널 및 공간 신호에 대한 정보 확보를 위해 셀프어텐션(Self-attention)을 사용한다. 또한, 정확도 향상을 위해 동적 라우팅을 적용해 특징의 변형에 강인한 모델을 구성하였다. 제안모델은 음성감정 데이터셋(EMO-DB, RAVDESS, IEMOCAP)의 전반적인 실험에서 매개변수 감소와 정확도 향상을 보여주며 약 156,000개의 매개변수로 각각 87.86%, 83.44%, 66.06%의 정확도를 달성하였다. 본 연구에서는 경량화 대비 성능 평가를 위한 매개변수의 수, 정확도간 trade-off를 계산하는 지표를 제안하였다.

Abstract ▼ AI-Helper

In this study, we propose a new light-weight model RoutingConvNet with fewer parameters to improve the applicability and practicality of speech emotion recognition. To reduce the number of learnable parameters, the proposed model connects bidirectional MFCCs on a channel-by-channel basis to learn long-term emotion dependence and extract contextual features. A light-weight deep CNN is constructed for low-level feature extraction, and self-attention is used to obtain information about channel and spatial signals in speech signals. In addition, we apply dynamic routing to improve the accuracy and construct a model that is robust to feature variations. The proposed model shows parameter reduction and accuracy improvement in the overall experiments of speech emotion datasets (EMO-DB, RAVDESS, and IEMOCAP), achieving 87.86%, 83.44%, and 66.06% accuracy respectively with about 156,000 parameters. In this study, we proposed a metric to calculate the trade-off between the number of parameters and accuracy for performance evaluation against light-weight.

주제어

참고문헌 (29)

임명진, 이명호, 신주현, "상담 챗봇의 다차원 감정인식 모델," 스마트미디어저널, 제10권 제4호,？21-27쪽, 2021년 12월
이명호, 임명진, 신주현, "텍스트와 음성의 앙상블을？통한 다중 감정인식 모델," 스마트미디어저널, 제11권, 제8호, 65-72쪽, 2022년 09월
H.J. Vogel, C. Suss, T. Hubregtsen, and E. Andre,？"Emotion-awareness for intelligent vehicle？assistants: A research agenda," Proc. of the 1st？International Workshop on Software Engineering？for AI in Autonomous Systems, pp. 11-15,？Gothenburg, Swede, May. 2018.
임명진, 박원호, 신주현, "Word2Vec과 LSTM을 활용한 이별 가사 감정 분류," 스마트미디어저널, 제9권, 제3호, 90-97쪽, 2020년 9월
J. Parry, D. Palaz, G. Clarke, P. Lecomte, R.？Mead, M. Berger, and G. Hofer, "Analysis of？Deep Learning Architectures for Cross-Corpus？Speech Emotion Recognition," Interspeech, pp.？1656-1660, Graz, Austria, Sep. 2019.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,？L. Jones, A.N. Gomez, L. Kaiser, and I.？Polosukhin, "Attention is all you need," Proc, of？Conference on Neural Information Processing？Systems, pp. 5998-6008, Long Beach, California,？USA, Dec. 2017.
Z. Zhao, Q. Li, Z. Zhang, N. Cummins, H. Wang,？J. Tao, and B.W. Schuller, "Combining a parallel？2D CNN with a self-attention Dilated Residual？Network for CTC-based discrete speech emotion？recognition," Neural Network, vol. 141, pp.？52-60, 2021.
S. Sabour, N. Frosst, and G.E. Hinton, "Dynamic？routing between capsules," Proc, of Conference？on Neural Information Processing Systems, pp.？3856-3866, Long Beach, California, USA, Dec.？2017.
F. Liu, S.Y. Shen, Z.W. Fu, H.Y. Wang, A.M.？Zhou, and J.Y. Qi, "LGCCT: A light gated and？crossed complementation transformer for？multimodal speech emotion recognition," Entropy,？vol. 24, no. 7, pp. 1010-1025, 2022.
C.W. Wu, "ProdSumNet: reducing model？parameters in deep neural networks via？product-of-sums matrix decompositions,"？arXiv:1809.02209, 2018.
J. Ye, X.C. Wen, Y. Wei, Y. Xu, K. Liu, and H.？Shan, "Temporal Modeling Matters: A Novel？Temporal Emotional Modeling Approach for？Speech Emotion Recognition," arXiv:2211.08233,？2022.
S. Zhang, S. Zhang, T. Huang, and W. Gao,？"Speech Emotion Recognition Using Deep？Convolutional Neural Network and Discriminant？Temporal Pyramid Matching," IEEE？Transactions on Multimedia, vol. 20, no. 6, pp.？1576-1590, 2017.
F. Burkhardt, A. Paeschke, M. Rolfes, W.？Sendlmeier, and B. Weiss, "A Database of？German Emotional Speech," Interspeech, pp. 1-4,？Lisbon, Portugal, 2005.
S.R Livingstone and F.A. Russo, "The Ryerson？Audio-Visual Database of Emotional Speech and？Song (RAVDESS): A dynamic, multimodal set of？facial and vocal expressions in North American？English," Plos one, vol. 13, no. 5, pp. e0196391,？2018.
C. Busso, M. Bulut, C.C Lee, A. Kazemzadeh, E.？Mower, S. Kim, J.N. Chang, S. Lee, and S.S？narayanan, "IEMOCAP: Interactive emotional？dyadic motion capture database," Language？resources and evaluation, vol. 42, no. 4, pp.？335-359, 2008.
P. Nantasri, E. Phaisangittisagul, J. Karnjana,？S. Boonkla, S. Keerativittayanun, A.？Rugchatjaroen, and T. Shinozaki, "A？light-weight artificial neural network for？speech emotion recognition using average？values of MFCCs and their derivatives," 17th？International conference on electrical？engineering/electronics, computer,？telecommunications and information technology？(ECTI-CON), pp. 41-44, Phuket, Thailand,？Jun. 2020.
K. Atsavasirilert, T. Theeramunkong, S.？Usanavasin, A. Rugchatjaroen, S. Boonkla, J.？Karnjana, S. Keerativittayanun, and M. Okumura,？"A light-weight deep convolutional neural？network for speech emotion recognition using？mel-spectrograms," 14th International Joint？Symposium on Artificial Intelligence and Natural？Language Processing (iSAI-NLP), pp. 1-4,？Chiang Mai, Thailand, Oct. 2019.
A. Krizhevsky, I. Sutskever, and G.E. Hinton,？"Imagenet classification with deep convolutional？neural networks," Communications of the ACM,？vol. 60, no. 6, pp. 84-90, 2017.

상세보기
J.X Ye, X.C. Wen, X.Z. Wang, Y. Xu, Y. Luo,？C.L. Wu, L.Y. Chen, and K.H. Liu, "GM-TCNet:？Gated Multi-scale Temporal Convolutional？Network using Emotion Causality for Speech？Emotion Recognition," Speech Communication,？vol. 145, pp. 21-35, 2022.

상세보기
J. L. Bautista, Y.K. Lee, and H.S. Shin, "Speech？Emotion Recognition Based on Parallel？CNN-Attention Networks with Multi-Fold Data？Augmentation," Electronics, vol. 11, no. 23, pp.？3935-3949, 2022.
D. Tang, P. Kuppens, L. Geurts, and T.V.？Waterschoot, "End-to-end speech emotion？recognition using a novel context-stacking？dilated convolution neural network," EURASIP？Journal on Audio, Speech, and Music？Processing, vol. 2021, no. 1, pp. 1-16, 2021.
S. Loffe and C. Szegedy, "Batch normalization:？Accelerating deep network training by reducing？internal covariate shift," Proc 32nd International？Conference on International Conference on？Machine Learning, pp. 448-456, Lille, France, Jul.？2015.
D.A. Clevert, T. Unterthiner, and S. Hochreiter,？"Fast and accurate deep network learning by？exponential linear units (ELUs),"？arXiv:1511.07289, 2015.
J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and？C. Bregler "Efficient object localization using？convolutional networks," Proc. of the IEEE？conference on computer vision and pattern？recognition, pp. 648-656, Boston, USA, Jun. 2015.
B. McFee, C. Raffel, D. Liang, D.P.W. Ellis, M.？McVicar, E. Battenberg, and O. Nieto, "librosa:？Audio and Music Signal Analysis in Python,"？Proc. of the 14th python in science conference,？pp. 18-25, Austin, Texas, USA, Jul. 2015.
D.P. Kingma and J. Ba, "Adam: A Method for？Stochastic Optimization," arXiv:1412.6980, 2014.
B. Nagarajan and V.R.M. Oruganti, "Deep？Learning as Feature Encoding for Emotion？Recognition," arXiv:1810.12613 (2018).
K. Chauhan, K.K. Sharma, and T. Varma,？"Speech Emotion Recognition Using Convolution？Neural Networks," international conference on？artificial intelligence and smart systems (ICAIS),？pp. 1176-1181, JTC College, Mar. 2021.
X. Wu, S. Hu, Z. Wu, X. Liu, and H. Meng,？"Neural Architecture Search for Speech Emotion？Recognition," 2022 IEEE International？Conference on Acoustics, Speech and Signal？Processing (ICASSP) IEEE, pp. 6902-6906,？Singapore, May. 2022.

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

RoutingConvNet: 양방향 MFCC 기반 경량 음성감정인식 모델
RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC 원문보기

초록
AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (29)

이 논문을 인용한 문헌

저자의 다른 논문 :

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

RoutingConvNet: 양방향 MFCC 기반 경량 음성감정인식 모델 RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC 원문보기

초록 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

주제어

참고문헌 (29)

이 논문을 인용한 문헌

저자의 다른 논문 :

김수형 (106) 이귀상 (101) 양형정 (44)

관련 콘텐츠

원문 보기

원문 URL 링크

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

RoutingConvNet: 양방향 MFCC 기반 경량 음성감정인식 모델
RoutingConvNet: A Light-weight Speech Emotion Recognition Model Based on Bidirectional MFCC 원문보기

초록
AI-Helper