[논문]화자 검증을 위한 마스킹된 교차 자기주의 인코딩 기반 화자 임베딩

서순신; 김지환

doi:10.7776/ask.2020.39.5.497

화자 검증을 위한 마스킹된 교차 자기주의 인코딩 기반 화자 임베딩
Masked cross self-attentive encoding based speaker embedding for speaker verification 원문보기

한국음향학회지= The journal of the acoustical society of Korea, v.39 no.5, 2020년, pp.497 - 504

서순신 (Sogang University) , 김지환 (Sogang University)

초록
AI-Helper

화자 검증에서 화자 임베딩 구축은 중요한 이슈이다. 일반적으로, 화자 임베딩 인코딩을 위해 자기주의 메커니즘이 적용되어졌다. 이전의 연구는 마지막 풀링 계층과 같은 높은 수준의 계층에서 자기 주의를 학습시키는 데 중점을 두었다. 이 경우, 화자 임베딩 인코딩 시 낮은 수준의 계층의 영향이 감소한다는 단점이 있다. 본 연구에서는 잔차 네트워크를 사용하여 Masked Cross Self-Attentive Encoding(MCSAE)를 제안한다. 이는 높은 수준 및 낮은 수준 계층의 특징 학습에 중점을 둔다. 다중 계층 집합을 기반으로 각 잔차 계층의 출력 특징들이 MCSAE에 사용된다. MCSAE에서 교차 자기 주의 모듈에 의해 각 입력 특징의 상호 의존성이 학습된다. 또한 랜덤 마스킹 정규화 모듈은 오버 피팅 문제를 방지하기 위해 적용된다. MCSAE는 화자 정보를 나타내는 프레임의 가중치를 향상시킨다. 그런 다음 출력 특징들이 합쳐져 화자 임베딩으로 인코딩된다. 따라서 MCSAE를 사용하여 보다 유용한 화자 임베딩이 인코딩된다. 실험 결과, VoxCeleb1 평가 데이터 세트를 사용하여 2.63 %의 동일 오류율를 보였다. 이는 이전의 자기 주의 인코딩 및 다른 최신 방법들과 비교하여 성능이 향상되었다.

Abstract ▼ AI-Helper

Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.

주제어

표/그림 (6)

그림 Fig. 1. Overview of the proposed network using MCSAE.
표 Table 1. Proposed model architecture using MCSAE (D: dimension of input feature, L: length of input feature, N: number of speakers, SE: speaker embedding).
그림 Fig. 2. Overview of the proposed MCSAE (dashed box: self-attention module, matmul: matrix multiplication).
그림 Fig. 3. Overview of the proposed random masking regularization module.
표 Table 2. Experimental results compared with previous encodings including SAP (Dim: dimension of speaker embedding).
표 Table 3. Experimental results compared with state- of-the-arts methods (*These models used VoxCeleb1 training dataset, which is smaller than the VoxCeleb2 dataset).

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

제안 방법

^[28] It does not used additional methods after extracting the speaker embedding such as the References [10], [14]. From the t rained model , we extracted a speaker embedding and evaluated it using cosine similarity metrics: equal error rate (EER, %) performance.
Table 3 shows the results of the comparison with the state-of-the-art encodings. Here, we focused on speaker embedding encodings using a CNN-based model with the softmax loss function. These models were proposed for using various approaches such as TAP,^[26] NetVLAD,^[11] and GhostVLAD.
In this study, we proposed a new SAP-derived method for speaker embedding encoding called MCSAE. The model was focused on training both high-level and low-level layers in the ResNet architecture, in order to encode a more informative speaker embedding.
In this study, we proposed a new SAP-derived method for speaker embedding encoding called MCSAE. The model was focused on training both high-level and low-level layers in the ResNet architecture, in order to encode a more informative speaker embedding. In the MCSAE, the cross self-attention module improved the concentration of the speaker information by training the interdependence among the features of each residual layer.
These are large-scale text-independent SV datasets collected from YouTube. We evaluated the proposed methods using the VoxCeleb1 evaluation dataset containing 40 speakers and 37,220 pairs of official test protocol.^[27]

대상 데이터

In this study, we trained the proposed model using the VoxCeleb2 dataset,^[26] which contained over 1 million utterances from 5,994 celebrities. These are large-scale text-independent SV datasets collected from YouTube.
1 and Table 1. The proposed model has 4 residual layers, 16 residual blocks, and half the number of channels of a standard ResNet-34. Each residual block consists of convolution layers, batch normalizations, and leaky ReLU activation functions (LReLU).

이론/모형

It is trained by using the DNN-based speaker classifier. Then the activations of the last hidden layer are encoded as speaker embedding.

참고문헌 (28)

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models," Digital Signal Processing, 10, 19-41 (2000).

상세보기
W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, "SVM based speaker verification using a GMM supervector kernel and NAP variability compensation," Proc. ICASSP. 97-100 (2006).
P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouche, "Factor analysis simplified speaker verification applications," Proc. ICASSP. 637-640 (2005).
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans. on Audio, Speech, and Language Processing, 19, 788-798 (2011).
D. Garcia-Romero and C. Y. Espy-Wilson, "Analysis of i-vector length normalization in speaker recognition systems," Proc. Interspeech, 249-252 (2011).
E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4052-4056 (2014).
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018).
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. CVPR. 770-778 (2016).
W. Cai, J. Chen, and M. Li, "Exploring the encoding layer and loss function in end-to-end speaker and language recognition system," Proc. Odyssey, 74-81 (2018).
W. Cai, J. Chen, and M. Li, "Analysis of length normalization in end-to-end speaker verification system," Proc. Interspeech, 3618-3622 (2018).
W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, "Utterance-level aggregation for speaker recognition in the wild," Proc. ICASSP. 5791-5795 (2019).
I. Kim, K. Kim, J. Kim, and C. Choi, "Deep representation using orthogonal decomposition and recombination for speaker verification," Proc. ICASSP. 6126-6130 (2019).
Y. Jung, Y. Kim, H. Lim, Y. Choi, and H. Kim, "Spatial pyramid encoding with convex length normalization for text-independent speaker verification," Proc. Interspeech, 4030-4034 (2019).
S. Seo, D. J. Rim, M. Lim, D. Lee, H. Park, J. Oh, C. Kim, and J. Kim, "Shortcut connections based deep speaker embeddings for end-to-end speaker verification system," Proc. Interspeech, 2928-2932 (2019).
F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, "Residual attention network for image classification," Proc. CVPR. 3156-3164 (2017).
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NeurIPS. 5998-6008 (2017).
Z. Lin, M. Feng, C. N. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, "A structured self-attentive sentence embedding," Proc. ICLR (2017).
K. Lee, X. Chen, G. Hua, H. Hu, and X. He, "Stacked cross attention for image-text matching," Proc. ECCV. 201-216 (2018).
L. Bao, B. Ma, H. Chang, and X. Chen, "Masked graph attention network for person re-identification," Proc. CVPR. (2019).
S. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, "Endto- end attention based text-dependent speaker verification," Proc. SLT. 171-178 (2016).
G. Bhattacharya, J. Alam, and P. Kenny, "Deep speaker embeddings for short-duration speaker verification," Proc. Interspeech, 1517-1521 (2017).
F. R. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, "Attention-based models for text-dependent speaker verification," Proc. ICASSP. 5359-5363 (2018).
K. Okabe, T. Koshinaka, and K. Shinoda, "Attentive statistics pooling for deep speaker embedding," Proc. Interspeech, 2252-2256 (2018).
Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, "Self-attentive speaker embeddings for text-independent speaker verification," Proc. Interspeech, 3573-3577 (2018).
M. India, P. Safari, and J. Hernando, "Self multi-head attention for speaker recognition," Proc. Interspeech, 4305-4309 (2019).
J. S. Chung, A. Nagrani, and A. Zisserman, "VoxCeleb2: Deep speaker recognition," Proc. Interspeech, 1086-1090 (2018).
A. Nagrani, J. S. Chung, and A. Zisserman, "VoxCeleb: A large-scale speaker identification dataset," Proc. Interspeech, 2616-2620 (2017).
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," Proc. NeurIPS. 8024-8035 (2019).

저자의 다른 논문 :

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증