[논문]단일 레이블 분류를 이용한 종단 간 화자 분할 시스템 성능 향상에 관한 연구

정재희; 김우일

doi:10.7776/ask.2023.42.6.536

단일 레이블 분류를 이용한 종단 간 화자 분할 시스템 성능 향상에 관한 연구
A study on end-to-end speaker diarization system using single-label classification 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.42 no.6, 2023년, pp.536 - 543

초록
AI-Helper

다수의 화자가 존재하는 음성에서 "누가 언제 발화했는가?"에 대해 레이블링하는 화자 분할은 발화 중첩 구간에 대한 레이블링과 화자 분할 모델의 최적화를 위해 심층 신경망 기반의 종단 간 방법에 대해 연구되었다. 대부분 심층 신경망 기반의 종단 간 화자 분할 시스템은 음성의 각 프레임에서 발화한 모든 화자의 레이블들을 추정하는 다중 레이블 분류 문제로 분할을 수행한다. 다중 레이블 기반의 화자 분할 시스템은 임계값을 어떤 값으로 설정하는지에 따라 모델의 성능이 많이 달라진다. 본 논문에서는 임계값 없이 화자 분할을 수행할 수 있도록 단일 레이블 분류를 이용한 화자 분할 시스템에 대해 연구하였다. 제안하는 화자 분할 시스템은 기존의 화자 레이블을 단일 레이블 형태로 변환하여 모델의 출력으로부터 레이블을 바로 추정한다. 훈련에서는 화자 레이블 순열을 고려하기 위해 Permutation Invariant Training(PIT) 손실함수와 교차 엔트로피 손실함수를 조합하여 사용하였다. 또한 심층 구조를 갖는 모델의 효과적인 학습을 위해 화자 분할 모델에 잔차 연결 구조를 추가하였다. 실험은 Librispeech 데이터베이스를 이용해 화자 2명에 대한 시뮬레이션 잡음 데이터를 생성하여 사용하였다. Diarization Error Rate(DER) 성능 평가 지수를 이용해 제안한 방법과 베이스라인 모델을 비교 평가했을 때, 제안한 방법이 임계값 없이 분할이 가능하며, 약 20.7 %만큼 향상된 성능을 보였다.

Abstract ▼ AI-Helper

Speaker diarization, which labels for "who spoken when?" in speech with multiple speakers, has been studied on a deep neural network-based end-to-end method for labeling on speech overlap and optimization of speaker diarization models. Most deep neural network-based end-to-end speaker diarization systems perform multi-label classification problem that predicts the labels of all speakers spoken in each frame of speech. However, the performance of the multi-label-based model varies greatly depending on what the threshold is set to. In this paper, it is studied a speaker diarization system using single-label classification so that speaker diarization can be performed without thresholds. The proposed model estimate labels from the output of the model by converting speaker labels into a single label. To consider speaker label permutations in the training, the proposed model is used a combination of Permutation Invariant Training (PIT) loss and cross-entropy loss. In addition, how to add the residual connection structures to model is studied for effective learning of speaker diarization models with deep structures. The experiment used the Librispech database to generate and use simulated noise data for two speakers. When compared with the proposed method and baseline model using the Diarization Error Rate (DER) performance the proposed method can be labeling without threshold, and it has improved performance by about 20.7 %.

주제어

표/그림 (5)

그림 Fig. 1. (Color available online) The structure of SAEEND model.^[8]
그림 Fig. 2. (Color available online) The process of combined loss calculation.
그림 Fig. 3. (Color available online) The structure of the proposed SL-Res-SA– EEND model.
표 Table 1. The DER performances of SA-EEND model according to threshold (%).
표 Table 2. The DER performances of baseline system and proposed model (%).

참고문헌 (18)

D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and？A. McCree, "Speaker diarization using deep neural？network embeddings," Proc. ICASSP, 4930-4934？(2017).
Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I.？L. Moreno, "Speaker diarization with LSTM," Proc.？ICASSP, 5239-5243 (2018).
M. Diez, L. Burget, S. Wang, J. Rohdin, and H.？Cernocky, "Bayesian HMM based x-Vector clustering？for speaker diarization," Proc. Interspeech, 346-350？(2019).
I. Medennikov, M. Korenevsky, T. Prisyach, Y.？Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva,？A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev,？and A. Romanenko, "Target-speaker voice activity？detection: a novel approach for multispeaker diarization？in a dinner party scenario," Proc. Interspeech, 274-278？(2020).
Y. C. Liu, E. Han, C. Lee, and A. Stolcke, "End-to-end？neural diarization: From transformer to conformer,"？Proc. Interspeech, 3081-3085 (2021).
Z. Du, S. Zhang, S. Zheng, and Z. Yan, "Speaker？embedding-aware neural diarization: A novel framework for overlapping speech diarization in the meeting？scenario," arXiv preprint arXiv:2203.09767 (2022).
Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and？S. Watanabe, "End-to-end neural speaker diarization？with permutation-free objectives," Proc. Interspeech,？4300-4304 (2019).
Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu,？and S. Watanabe, "End-to-end neural speaker diarization？with self-attention," Proc. ASRU, 296-303 (2019).
Y. Yu, D. Park, and H. K. Kim, "Auxiliary loss of？transformer with residual connection for end-to-end？speaker diarization," Proc. ICASSP, 8377-8381 (2022).
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,？"Librispeech: An asr corpus based on public domain？audio books," Proc. ICASSP, 5206-5210 (2015).
D. Snyder, G. Chen, and D. Povey, "Musan: A music,？speech, and noise corpus," arXiv preprint arXiv:1510.08484 (2015).
T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S.？Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition," Proc.？ICASSP, 5220-5224 (2017).
J. Carletta, "Unleashing the killer corpus: experiences？in creating the multi-everything AMI Meeting Corpus,"？Lang. Resour. Eval. 41, 181-190 (2007).

상세보기
A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N.？Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke,？and C. Wooters, "The ICSI meeting corpus," Proc.？ICASSP, 364-367 (2003).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L.？Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,？"Attention is all you need," Proc. NIPS, 5998-6008？(2017).
J. G. Fiscus, J. Ajot, and J. S. Garofolo, The Rich？Transcription 2007 Meeting Recognition Evaluation？(Springer, Maryland, 2007), pp. 373-389.
H. Bredin, R. Yin, J. M. Coria, G. Gelly, P.？Korshunov, M. Lavechin, D. Fustes, H. Titeux, W,？Bouaziz, and M. P. Gill, "Pyannote. audio: Neural？building blocks for speaker diarization," Proc. ICASSP,？7124-7128 (2020).
H. Bredin and A. Laurent, "End-to-end speaker segmentation for overlap-aware resegmentation," Proc.？Interspeech, 3111-3115 (2021).

저자의 다른 논문 :

원문 URL 링크

DOI : 10.7776/ASK.2023.42.6.536 [무료]
AccessON : 저널

*원문 PDF 파일 및 링크정보가 존재하지 않을 경우 KISTI DDS 시스템에서 제공하는 원문복사서비스를 사용할 수 있습니다.

오픈액세스(OA) 유형

GOLD

오픈액세스 학술지에 출판된 논문

이 논문과 함께 이용한 콘텐츠

[논문] 화자 구분 시스템의 관심 화자 추출을 위한 i-vector 유사도 기반의 음성 분할 기법

저작권 관리 안내

내보내기 메뉴

내보내기 구분

파일저장
인쇄
메일전송

구성항목

기본정보
상세정보

관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관

저장형식

Text(ASCII format)
Excel format
RefWorks Direct Export
RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley

메일정보

받는사람 (필수): @
보내는사람 (선택): @
제목
내용: KISTI 검색결과 이메일 서비스

안내

총 건의 자료가 검색되었습니다.

다운받으실 자료의 인덱스를 입력하세요. (1-10,000)

검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다.

데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요)

다운로드 파일은 UTF-8 형태로 저장됩니다.
파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오.

Text(ASCII format)
Excel format

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

AI-Helper ※ AI-Helper는 을 사용합니다.

AI-Helper

안녕하세요, AI-Helper입니다. 좌측 "선택된 텍스트"에서 텍스트를 선택하여 요약, 번역, 용어설명을 실행하세요.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.

연합인증