[논문]음성위조 탐지에 있어서 데이터 증강 기법의 성능에 관한 비교 연구

박관열; 곽일엽

doi:10.5351/kjas.2023.36.2.101

음성위조 탐지에 있어서 데이터 증강 기법의 성능에 관한 비교 연구
Comparative study of data augmentation methods for fake audio detection 원문보기

응용통계연구 = The Korean journal of applied statistics, v.36 no.2, 2023년, pp.101 - 114

박관열 (중앙대학교 응용통계학과) , 곽일엽 (중앙대학교 응용통계학과)

초록
AI-Helper

데이터 증강 기법은 학습용 데이터셋을 다양한 관점에서 볼 수 있게 해주어 모형의 과적합 문제를 해결하는데 효과적으로 사용되고 있다. 이미지 데이터 증강기법으로 회전, 잘라내기, 좌우대칭, 상하대칭등의 증강 기법 외에도 occlusion 기반 데이터 증강 방법인 Cutmix, Cutout 등이 제안되었다. 음성 데이터에 기반한 모형들에 있어서도, 1D 음성 신호를 2D 스펙트로그램으로 변환한 후, occlusion 기반 데이터 기반 증강기법의 사용이 가능하다. 특히, SpecAugment는 음성 스펙트로그램을 위해 제안된 occlusion 기반 증강 기법이다. 본 연구에서는 위조 음성 탐지 문제에 있어서 사용될 수 있는 데이터 증강기법에 대해 비교 연구해보고자 한다. Fake audio를 탐지하기 위해 개최된 ASVspoof2017과 ASVspoof2019 데이터를 사용하여 음성을 2D 스펙트로그램으로 변경시켜 occlusion 기반 데이터 증강 방식인 Cutout, Cutmix, SpecAugment를 적용한 데이터셋을 훈련 데이터로 하여 CNN 모형을 경량화시킨 LCNN 모형을 훈련시켰다. Cutout, Cutmix, SpecAugment 세 증강 기법 모두 대체적으로 모형의 성능을 향상시켰으나 방법에 따라 오히려 성능을 저하시키거나 성능에 변화가 없을 수도 있었다. ASVspoof2017 에서는 Cutmix, ASVspoof2019 LA 에서는 Mixup, ASVspoof2019 PA 에서는 SpecAugment 가 가장 좋은 성능을 보였다. 또, SpecAugment는 mask의 개수를 늘리는 것이 성능 향상에 도움이 된다. 결론적으로, 상황과 데이터에 따라 적합한 augmentation 기법이 다른 것으로 파악된다.

Abstract ▼ AI-Helper

The data augmentation technique is effectively used to solve the problem of overfitting the model by allowing the training dataset to be viewed from various perspectives. In addition to image augmentation techniques such as rotation, cropping, horizontal flip, and vertical flip, occlusion-based data augmentation methods such as Cutmix and Cutout have been proposed. For models based on speech data, it is possible to use an occlusion-based data-based augmentation technique after converting a 1D speech signal into a 2D spectrogram. In particular, SpecAugment is an occlusion-based augmentation technique for speech spectrograms. In this study, we intend to compare and study data augmentation techniques that can be used in the problem of false-voice detection. Using data from the ASVspoof2017 and ASVspoof2019 competitions held to detect fake audio, a dataset applied with Cutout, Cutmix, and SpecAugment, an occlusion-based data augmentation method, was trained through an LCNN model. All three augmentation techniques, Cutout, Cutmix, and SpecAugment, generally improved the performance of the model. In ASVspoof2017, Cutmix, in ASVspoof2019 LA, Mixup, and in ASVspoof2019 PA, SpecAugment showed the best performance. In addition, increasing the number of masks for SpecAugment helps to improve performance. In conclusion, it is understood that the appropriate augmentation technique differs depending on the situation and data.

주제어

참고문헌 (36)

Abdel-Hamid O, Mohamed AR, Jiang H, Deng L, Penn G, and Yu D (2014). Convolutional neural networks for？speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1533-1545.

상세보기
Brown JC (1991). Calculation of a constant Q spectral transform, The Journal of the Acoustical Society of America, 89, 425-434.

상세보기
Chapelle O, Weston J, Bottou L, and Vapnik V (2000). Vicinal risk minimization, Advances in Neural Information？Processing Systems, 13, Cambridge MA, USA.
Cheng X, Xu M, and Zheng TF (2019). Replay detection using CQT-based modified group delay feature and？ResNeWt network in ASVspoof 2019. In Proceedings of 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Lanzhou, China, 540-545.
Choi HJ and Kwak IY (2021). Data augmentation in voice spoofing problem, The Korean Journal of Applied？Statistics, 34, 449-460.
Delgado H, Todisco M, Sahidullah M, Evans N, Kinnunen T, Lee KA, and Yamagishi J (2017). ASVspoof？2017 Version 2.0: Meta-data analysis and baseline enhancement, Odyssey 2018-The Speaker and Language？Recognition Workshop.
DeVries T and Taylor GW (2017). Improved regularization of convolutional neural networks with Cutout, Available from: arXiv preprint arXiv
Dua M, Jain C, and Kumar S (2021). LSTM and CNN based ensemble approach for spoof detection task in？automatic speaker verification systems, Journal of Ambient Intelligence and Humanized Computing, 13,？1985-2000.
Fong R and Vedaldi A (2019). Occlusions for effective data augmentation in image classification. In Proceedings？of 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Korea, 4158-4166.
Goodfellow I, Warde-Farley D, Mirza M, et al. (2013). Maxout networks, In Proceedings of the 30th International？Conference on Machine Learning (ICML), Atlanta, Georgia, USA, 1319-1327.
Haut JM, Paoletti ME, Plaza J, Plaza A, and Li J (2019). Hyperspectral image classification using random occlusion data augmentation, IEEE Geoscience and Remote Sensing Letters, 16, 1751-1755.

상세보기
Hsu CY, Lin LE, and Lin CH (2021). Age and gender recognition with random occluded data augmentation on？facial images, Multimedia Tools and Applications, 80, 11631-11653.

상세보기
Ioffe S and Szegedy C (2015). Batch normalization: Accelerating deep network training by reducing internal？covariate shift, International Conference on Machine Learning, 37, 448-456.
Yang J, Das RK, and Li H (2018). Extended constant-Q cepstral coefficients for detection of spoofing attacks.？In Proceedings of 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and？Conference (APSIPA ASC), Honolulu, HI, USA, 1024-1029.
Ke Y, Hoiem D, and Sukthankar R (2005). Computer vision for music identification. In Proceedings of 2005？IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego,？CA, USA, 597-604.
Kim G, Han DK, and Ko H (2021). Specmix: A mixed sample data augmentation method for training with？time-frequency domain features, Available from: arXiv preprint arXiv:2108.03020
Kinnunen T, Delgado H, Evans N, et al. (2020). Tandem assessment of spoofing countermeasures and automatic？speaker verification: Fundamentals, IEEE/ACM Transactions on Audio, Speech, and Language Processing,？28, 2195-2210.

상세보기
Krizhevsky A, Sutskever I, and Hinton GE (2012). Imagenet classification with deep convolutional neural networks, Communications of the ACM, 60, 84-90.
Lavrentyeva G, Novoselov S, Malykh E, Kozlov A, Kudashev O, and Shchemelinin V (2017). Audio replay？attack detection with deep learning frameworks, In Interspeech 2017 (pp. 82-86).
Lavrentyeva, G, Novoselov S, Tseren A, Volkova M, Gorlanov A, and Kozlov A (2019). STC antispoofing systems for the ASVspoof2019 challenge, Interspeech 2019, 1033-1037.
Madhu A and Kumaraswamy S (2019). Data augmentation using generative adversarial network for environmental sound classification. In Proceedings of 27th IEEE European Signal Processing Conference (EUSIPCO),？A Coruna, Spain, 1-5.
Nam H, Kim SH, and Park YH (2022). FilterAugment: An acoustic environmental data augmentation method.？In Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal？Processing (ICASSP), Singapore,4308-4312.
Nagarsheth P, Khoury E, Patil K, and Garland M (2017). Replay attack detection using DNN for channel discrimination, Interspeech 2017, 97-101.
Park DS, Chan W, Zhang Y, Chiu C-C, Zoph B, Cubuk ED, and Le QV (2019). SpecAugment: A simple data？augmentation method for automatic speech recognition, Available from: arXiv preprint arXiv:1904.08779
Shim HJ, Jung JW, Kim JH, and Yu HJ (2022). Attentive max feature map and joint training for acoustic scene？classification. In Proceedings of ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech？and Signal Processing (ICASSP), Singapore, 1036-1040.
Singh KK, Yu H, Sarmasi A, Pradeep G, Lee YJ (2018). Hide-and-Seek: A data augmentation technique for？weakly-supervised localization and beyond, Available from: arXiv preprint arXiv:1811.02545
Sukthankar R, Ke Y, and Hoiem D (2006). Semantic learning for audio applications: A computer vision approach.？In Proceedings of 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06),？New York, NY, USA, 112-112.
Tomilov A, Svishchev A, Volkova M, Chirkovskiy A, Kondratev A, and Lavrentyeva G (2021). STC Antispoofing？Systems for the ASVspoof2021 Challenge. In Proc. 2021 Edition of the Automatic Speaker Verification and？Spoofing Countermeasures Challenge, (pp. 61-67).
Wei S, Zou S, and Liao F (2020). A comparison on data augmentation methods based on deep learning for audio？classification, Journal of Physics: Conference Series, 1453, 012085.
Witkowski M, Kacprzak S, Zelasko P, Kowalczyk K, and Galka J (2017). Audio replay attack detection using？high-frequency features, Interspeech 2017, 27-31.
Wu X, He R, Sun Z, and Tan T (2018). A light cnn for deep face representation with noisy labels, IEEE Transactions on Information Forensics and Security, 13, 2884-2896.

상세보기
Wu Z, Kinnunen T, Evans N, Yamagishi J, Hanilci C, Sahidullah Md, and Sizov A (2015). ASVspoof 2015: The？first automatic speaker verification spoofing and countermeasures challenge, Sixteenth Annual Conference？of the International Speech Communication Association, 2037-2041.
Yun S, Han D, Chun S, Oh SJ, Yoo Y, and Choe J (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision？(ICCV), Seoul, Korea, 6023-6032.
Zhang C, Yu C, and Hansen JH (2017). An investigation of deep-learning frameworks for speaker verification？antispoofing, IEEE Journal of Selected Topics in Signal Processing, 11, 684-694.

상세보기
Zhang H, Cisse M, Dauphin YN, and Lopez-Paz D (2017). Mixup: Beyond empirical risk minimization, Available？from: arXiv preprint arXiv
Zhong Z, Zheng L, Kang G, Li S, and Yang Y (2020). Random erasing data augmentation, In Proceedings of the？AAAI conference on artificial intelligence, Hilton New York Midtown, NY, USA, 13001-13008.

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증