[논문]딥러닝 기반 음성합성을 위한 스타일 모델의 전이 학습

양진혁

딥러닝 기반 음성합성을 위한 스타일 모델의 전이 학습
Transfer Learning of Style Model for Deep Learning-based Speech Synthesis 원문보기

양진혁 (한동대학교 일반대학원 정보통신공학과 국내석사)

초록 ▼
AI-Helper

스타일 모델은 딥러닝 기반 음성 합성 시스템이 특정한 스타일의 음성을 생성할 수 있게 한다. 스타일 모델은 바람직한 음성 스타일의 벡터 표현을 만든다. 스타일 벡터는 참조하는 입력 음성 신호로부터 부호화되거나, 잠재 공간으로부터 샘플링 될 수 있다. 음성 스타일을 모델링하는 것에는 다음과 같이 많은 이슈가 있다. 첫째로, 텍스트와 스타일에 대한 정답이 있는 큰 데이터세트를 얻기가 어렵다. 둘째로, 참조하는 음성에 의존하지 않고 직접적으로 잠재 공간으로부터 표본을 추출하는 방법은 원하는 스타일을 얻기가 어렵다.
첫번째 문제를 해결하기 위해, 우리는 스타일 모델을 학습하기 위한 전이 학습 기술을 적용하는 방법을 제안한다. 비지도 방식으로 음성 신호의 표현을 학습시키기 위해 우리는 먼저 이전 시간의 음성 신호로부터 다음 시간의 음성 신호를 예측하도록 auto-regressive model을 학습시킨다. 그리고나서 우리는 auto-regressive model의 지식을 reference encoder로 전이한다. 지식을 전이하고 나면, reference encoder는 적은 양의 데이터로도 빠르게 음성 스타일을 반영하도록 학습할 수 있다. 우리는 제안하는 방법의 정량적인 평가를 위해 화자 인식기를 사용했다. 생성한 음성의 스타일이 참조하는 음성의 스타일에 가까울수록 인식률이 높다. 제안하는 방법으로 학습한 결과와 Scratch부터 학습한 결과를 비교했을 때, TTS 학습과정에서 보여준 화자에 대해서는 1.57배 높은 정확도를 보였고 그렇지 않은 화자에 대해서는 5.3배 높은 정확도를 보였다.
두번째 문제를 해결하기 위해, 우리는 원하는 스타일을 얻기 위한 스타일 벡터 표본을 추출하는 방법을 제안한다. 우리는 PCA에서 영감을 얻었으며, 잠재 공간에서 고유벡터를 따라 스타일 벡터를 조정하는 방법을 제안한다. 고윳값이 큰 고유벡터는 학습 데이터의 스타일 분포에서 주요한 축을 나타내므로, 제안하는 방법은 출력 음성의 스타일을 조정하는 것에 효과적이다. 실험 결과는 고윳값이 큰 2개의 고유벡터가 각각 속도, 음의 높낮이를 나타내는 것을 보여줬다. 이 고유벡터를 따라 표본으로 추출된 스타일 벡터는 TTS가 그에 대응하는 스타일의 음성으로 만들도록 했다.

Abstract ▼ AI-Helper

Style model enables deep learning-based speech synthesis system to generate speech signals with specific styles. The style model generates a vector representation of the desired speech style. The style vector can be encoded from a reference speech signal or can be sampled from the latent space. There are a lot of challenges in speech style modeling including the following issues: first, it is costly to collect a large speech dataset containing both texts and speech signals with rich style information. Secondly, it is difficult to directly sample a style embedding from the latent space, not relying on a reference speech that produces the desired style.
In regard to the first problem, we propose a method to apply the transfer learning technique to train the style model. In order to learn the representation for the speech signal in an unsupervised manner, proposed method trains an auto-regressive model to reproduce the input speech signal. Then, the method transfers the knowledge from the auto-regressive model to the reference encoder. Given the transferred knowledge, the reference encoder can rapidly learn the speech style from a small amount of data. We quantitatively evaluated the effect of the proposed method using an independent speaker recognizer. The closer the style of the output speech is to the reference speech, the higher a model has the recognition accuracy. Compared to the model trained from the scratch, the model trained by the proposed method exhibited 1.57 times higher accuracy for the data of the speakers shown to the TTS, and 5.3 times higher accuracy for the data of the speakers not shown to the TTS.
To address the second problem, we propose a method to sample style vectors to acquire the desired style. Inspired by the principal component analysis (PCA), we sample the style vector along with eigenvectors of the latent space. The proposed algorithm is effective to control the style of the output speech because eigenvectors with large eigenvalue represent prominent axes in the distribution of the style of the training speech data. The experimental results demonstrated that two eigenvectors with the largest eigenvalues represent speed and pitch, respectively. The style vectors sampled along with these eigenvectors led the TTS to produce the output speech of the corresponding style.

학위논문 정보

저자	양진혁
학위수여기관	한동대학교 일반대학원
학위구분	국내석사
학과	정보통신공학과
지도교수	김인중
발행연도	2019
총페이지	20
언어	eng
원문 URL	http://www.riss.kr/link?id=T15432135&outLink=K
정보원	한국교육학술정보원

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명(한글), 저자명(한글), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문) 관리번호, 논문명(한글), 논문명(영문), 저자명(한글), 저자명(영문), 학위수여기관, 학위연도, 학위구분, 학과, 총페이지, 키워드, 초록(한글), 초록(영문)
저장형식	Text(ASCII format) Excel format
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

딥러닝 기반 음성합성을 위한 스타일 모델의 전이 학습
Transfer Learning of Style Model for Deep Learning-based Speech Synthesis 원문보기

초록 ▼
AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

딥러닝 기반 음성합성을 위한 스타일 모델의 전이 학습 Transfer Learning of Style Model for Deep Learning-based Speech Synthesis 원문보기

초록 ▼ 용어보기논문에서 용어와 풀이말을 자동 추출한 결과로, 시범 서비스 중입니다. AI-Helper

Abstract ▼ AI-Helper

학위논문 정보

이 논문을 인용한 문헌

관련 콘텐츠

원문 보기

이 논문과 함께 이용한 콘텐츠

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

딥러닝 기반 음성합성을 위한 스타일 모델의 전이 학습
Transfer Learning of Style Model for Deep Learning-based Speech Synthesis 원문보기

초록 ▼
AI-Helper