[논문]Sequence dicriminative training 기법을 사용한 트랜스포머 기반 음향 모델 성능 향상

이채원; 장준혁

doi:10.7776/ask.2022.41.3.335

Sequence dicriminative training 기법을 사용한 트랜스포머 기반 음향 모델 성능 향상
Improving transformer-based acoustic model performance using sequence discriminative training 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.41 no.3, 2022년, pp.335 - 341

이채원 (한양대학교 융합전자공학과) , 장준혁 (한양대학교 융합전자공학과)

초록
AI-Helper

본 논문에서는 기존 자연어 처리 분야에서 뛰어난 성능을 보이는 트랜스포머를 하이브리드 음성인식에서의 음향모델로 사용하였다. 트랜스포머 음향모델은 attention 구조를 사용하여 시계열 데이터를 처리하며 연산량이 낮으면서 높은 성능을 보인다. 본 논문은 이러한 트랜스포머 AM에 기존 DNN-HMM 모델에서 사용하는 가중 유한 상태 전이기(weighted Finite-State Transducer, wFST) 기반 학습인 시퀀스 분류 학습의 네 가지 알고리즘을 각각 적용하여 성능을 높이는 방법을 제안한다. 또한 기존 Cross Entropy(CE)를 사용한 학습방식과 비교하여 5 %의 상대적 word error rate(WER) 감소율을 보였다.

Abstract ▼ AI-Helper

In this paper, we adopt a transformer that shows remarkable performance in natural language processing as an acoustic model of hybrid speech recognition. The transformer acoustic model uses attention structures to process sequential data and shows high performance with low computational cost. This paper proposes a method to improve the performance of transformer AM by applying each of the four algorithms of sequence discriminative training, a weighted finite-state transducer (wFST)-based learning used in the existing DNN-HMM model. In addition, compared to the Cross Entropy (CE) learning method, sequence discriminative method shows 5 % of the relative Word Error Rate (WER).

주제어

표/그림 (7)

그림 Fig. 1. (Color available online) The architecture of transformer-based AM.
표 Table 1. Model architecture for transformer acoustic model (AM).
표 Table 2. WERs (%) on Librispeech for each latticebased sequence training method.
표 Table 3. WERs (%) on HD-100h for each latticebased sequence training method.
표 Table 4. WERs (%) on 7 different word count sections of Librispeech.
표 Table 5. WERs (%) on HD-100h for LF-MMI sequence training method.^[6]
표 Table 6. WERs (%) on Librispeech for CE, sMBR and LF-MMI.

참고문헌 (17)

B. Juang and L. Rabiner, "Hidden Markov models for speech recognition," Technometrics, 33, 251-272 (1991).

상세보기
A. Senior, H. Sak, and I. Shafran, "Context dependent phone models for LSTM RNN acoustic modelling," Proc. IEEE ICASSP, 4585-4589 (2015).
J. Li, V. Lavrukhin, B. Ginsburg, and R. Leary, "Jasper: An end-to-end convolutional neural acoustic model," arXiv preprint arXiv:1904.03288 (2019).
K. Chen and Q. Huo, "Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24 (2016).

상세보기
L. Bahl, P. Brown, P. Souza, and R. Mercer, "Maximum mutual information estimation of hidden Markov model parameters for speech recognition," Proc. ICASSP, 49-52 (1986).
D. Povey, D. Kanevsky, B. Kingsbury, B. Ranabhadran, G. Saon, and K. Visweswariah, "Boosted MMI for model and feature-space discriminative training," Proc. IEEE ICASSP, 4057-4060 (2008).
M. Gibson and T. Hain, "Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition," Proc. Interspeech, 2406-2409 (2006).
D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, and V. Manohar, "Purely sequence-trained neural networks for ASR based on lattice-free MMI," Proc. Interspeech, 2751-2755 (2016).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, 30 (2017).
K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative training of deep neural networks," Proc. Interspeech, 2345-2349 (2013).
Y. Wang, A. mohamed, D. Le, C. Liu, and A. Xiao, "Transformer-based acoustic modeling for hybrid speech recognition," Proc. IEEE ICASSP, 6874-6878 (2020).
V. Panayotov, G. Chen, D. Povey, and S.Khudanpur, "Librispeech: an asr corpus based on public domain audio books," Proc. IEEE ICASSP, 5206-5210 (2015).
S. Watanabe, T. Hori, S. karita, and T. Hayashi, "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015 (2018).
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," Proc. ASRU, (2011).
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, and Z. Vito, "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, 32 (2019).
L. Lu, X. Xiao, Z. Chen, and Y. Gong, "Pykaldi2: Yet another speech toolkit based on kaldi and pytorch," arXiv preprint arXiv:1907.05955 (2019).
Y. Shao and Y. Wang, "Pychain: A fully parallelized pytorch implementation of lf-mmi for end-to-end asr," arXiv preprint arXiv:2005.09824 (2020).

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증