최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
DataON 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Edison 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Kafe 바로가기국가/구분 | United States(US) Patent 등록 |
---|---|
국제특허분류(IPC7판) |
|
출원번호 | US-0961370 (2015-12-07) |
등록번호 | US-9697820 (2017-07-04) |
발명자 / 주소 |
|
출원인 / 주소 |
|
대리인 / 주소 |
|
인용정보 | 피인용 횟수 : 2 인용 특허 : 1970 |
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In one example process, a sequence of target units can represent a spoken pronunciation of text. A set of predicted acoustic model parameters of a second target unit can be determined using a set of acoustic f
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In one example process, a sequence of target units can represent a spoken pronunciation of text. A set of predicted acoustic model parameters of a second target unit can be determined using a set of acoustic features of a first candidate speech segment of a first target unit and a set of linguistic features of the second target unit. A likelihood score of the second candidate speech segment with respect to the first candidate speech segment can be determined using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment of the second target unit. The second candidate speech segment can be selected for speech synthesis based on the determined likelihood score. Speech corresponding to the received text can be generated using the selected second candidate speech segment.
1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions which, when executed by one or more processors of an electronic device, cause the electronic device to: receive text to be converted to speech;generate a sequence of ta
1. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions which, when executed by one or more processors of an electronic device, cause the electronic device to: receive text to be converted to speech;generate a sequence of target units representing a spoken pronunciation of the text;select, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units;determine, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit;determine, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment;select the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; andgenerate speech corresponding to the received text using the second candidate speech segment. 2. The non-transitory computer-readable storage medium of claim 1, wherein the first target unit precedes the second target unit in the sequence of target units. 3. The non-transitory computer-readable storage medium of claim 1, wherein the predicted acoustic model parameters of the second target unit are determined using a statistical model. 4. The non-transitory computer-readable storage medium of claim 3, wherein the statistical model is generated using recorded speech samples corresponding to a corpus of text. 5. The non-transitory computer-readable storage medium of claim 3, wherein the statistical model is configured to: receive, as inputs, a set of linguistic features of a current target unit and a set of acoustic features of a candidate speech segment of a preceding target unit; andoutput a set of predicted acoustic model parameters of the current target unit. 6. The non-transitory computer-readable storage medium of claim 5, wherein the statistical model is a deep neural network comprising: an input layer configured to receive as inputs the set of linguistic features of the current target unit and the set of acoustic features of the candidate speech segment of the preceding target unit;an output layer configured to output the set of predicted acoustic model parameters of the current target unit; andat least one hidden layer. 7. The non-transitory computer-readable storage medium of claim 1, wherein the set of predicted acoustic model parameters of the second target unit comprises a set of predicted acoustic features of the second target unit. 8. The non-transitory computer-readable storage medium of claim 1, wherein the set of predicted acoustic model parameters of the second target unit comprises a set of statistical parameters of predicted acoustic features of the second target unit. 9. The non-transitory computer-readable storage medium of claim 8, wherein the set of predicted acoustic model parameters includes a mean of the predicted acoustic features of the second target unit and a variance of the predicted acoustic features of the second target unit. 10. The non-transitory computer-readable storage medium of claim 8, wherein the set of predicted acoustic model parameters includes means of the predicted acoustic features of the second target unit, variances of the predicted acoustic features of the second target unit, and density weights of the predicted acoustic features of the second target unit assuming a model composed by a mixture of probability distributions. 11. The non-transitory computer-readable storage medium of claim 1, wherein the set of predicted acoustic model parameters of the second target unit is determined using only the set of acoustic features of the first candidate speech segment and the set of linguistic features of the second target unit. 12. The non-transitory computer-readable storage medium of claim 1, wherein the one or more programs further comprise instructions that cause the electronic device to: select, from the plurality of speech segments, a third candidate speech segment for a third target unit of the sequence of target units, the third target unit preceding the first target unit in the sequence of target units, wherein the set of predicted acoustic model parameters of the second target unit is further determined using a set of acoustic features of the third candidate speech segment. 13. The non-transitory computer-readable storage medium of claim 1, wherein the likelihood score represents a likelihood of the set of acoustic features of the second candidate speech segment given the set of predicted acoustic model parameters of the second target unit and the set of acoustic features of the first candidate speech segment. 14. The non-transitory computer-readable storage medium of claim 13, wherein the likelihood score is determined by a Gaussian Mixture Model using the set of acoustic features of the second candidate speech segment as an observed set of acoustic features. 15. The non-transitory computer-readable storage medium of claim 1, wherein the likelihood score represents a difference between a set of predicted acoustic features of the second target unit and the set of acoustic features of the second candidate speech segment. 16. The non-transitory computer-readable storage medium of claim 1, wherein the first candidate speech segment and the second candidate speech segment are associated with a maximum accumulated likelihood score, and wherein the maximum accumulated likelihood score is determined based on the likelihood score. 17. The non-transitory computer-readable storage medium of claim 1, wherein the likelihood score is determined using only the set of predicted acoustic model parameters of the second target unit and the set of acoustic features of the second candidate speech segment. 18. The non-transitory computer-readable storage medium of claim 1, wherein the second candidate speech segment is not selected based on a separate concatenation score associated with joining the first candidate speech segment with the second candidate speech segment. 19. The non-transitory computer-readable storage medium of claim 1, wherein the first target unit is associated with a first plurality of candidate speech segments, and wherein the one or more programs further comprise instructions that cause the electronic device to: for each candidate speech segment of the first plurality of candidate speech segments, determine a respective set of predicted acoustic model parameters of the second target unit. 20. The non-transitory computer-readable storage medium of claim 1, wherein the first target unit is associated with a first plurality of candidate speech segments, wherein each candidate speech segment of the first plurality of candidate speech segments is associated with an accumulated likelihood score, and wherein the one or more programs further comprise instructions that cause the electronic device to: for each candidate speech segment in a subset of the first plurality of candidate speech segments, determine a respective set of predicted acoustic model parameters of the second target unit, wherein the subset includes candidate speech segments of the first plurality of candidate speech segments associated with the highest accumulated likelihood scores. 21. The non-transitory computer-readable storage medium of claim 1, wherein the first candidate speech segment and the second candidate speech segment each comprise a segment of recorded speech. 22. The non-transitory computer-readable medium of claim 1, wherein the one or more programs comprising instructions that cause the electronic device to select, from the plurality of speech segments, the first candidate speech segment for the first target unit and the second candidate segment for the second target unit comprises instructions that cause the electronic device to: select the first candidate speech segment for the first target unit based on a degree of matching between a set of linguistic features of the first candidate speech segment and a set of linguistic features of the first target unit; andselect the second candidate speech segment for the second target unit based on a degree of matching between a set of linguistic features of the second candidate speech segment and the set of linguistic features of the second target unit. 23. The non-transitory computer-readable medium of claim 1, wherein the one or more programs further comprises instructions that cause the electronic device to: select, from the plurality of speech segments, one or more additional candidate speech segments for the first target unit of the sequence of target units; andselect, from the plurality of speech segments, one or more additional candidate speech segments for the second target unit of the sequence of target units. 24. The non-transitory computer-readable medium of claim 23, wherein the one or more programs further comprises instructions that cause the electronic device to: determine, using a set of acoustic features of each of the additional candidate speech segments for the first target unit and the set of linguistic features of the second target unit, a respective set of predicted acoustic model parameters for each of the additional candidate speech segments for the second target unit; anddetermine, using the respective set of the predicted acoustic model parameters for each of the additional candidate speech segments for the second target unit and a set of acoustic features of a corresponding additional candidate speech segment for the second target unit, a likelihood score of each of the additional candidate speech segments for the second target unit with respect to each of the candidate speech segments for the first target unit. 25. The non-transitory computer-readable medium of claim 24, wherein the one or more programs comprising instructions that cause the electronic device to select the second candidate speech segment to be used in speech synthesis based on the determined likelihood score comprises instructions that cause the electronic device to: determine whether the likelihood score of the second candidate speech segment with respect to the first candidate speech segment maximizes an accumulated likelihood score; andin accordance with a determination that the likelihood score of the second candidate speech segment with respect to the first candidate speech segment maximizes the accumulated likelihood score, select the second candidate speech segment to be used in speech synthesis. 26. A method for performing unit-selection text-to-speech synthesis, comprising: at an electronic device having a processor and memory: receiving text to be converted to speech;generating a sequence of target units representing a spoken pronunciation of the text;selecting, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units;determining, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit;determining, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment;selecting the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; andgenerating speech corresponding to the received text using the second candidate speech segment. 27. A system for performing unit-selection text-to-speech synthesis, the system comprising: one or more processors; andmemory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to: receive text to be converted to speech;generate a sequence of target units representing a spoken pronunciation of the text;select, from a plurality of speech segments, a first candidate speech segment for a first target unit of the sequence of target units and a second candidate speech segment for a second target unit of the sequence of target units;determine, using a set of acoustic features of the first candidate speech segment and a set of linguistic features of the second target unit, a set of predicted acoustic model parameters of the second target unit;determine, using the set of predicted acoustic model parameters of the second target unit and a set of acoustic features of the second candidate speech segment, a likelihood score of the second candidate speech segment with respect to the first candidate speech segment;select the second candidate speech segment to be used in speech synthesis based on the determined likelihood score; andgenerate speech corresponding to the received text using the second candidate speech segment.
Copyright KISTI. All Rights Reserved.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.