최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
DataON 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Edison 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Kafe 바로가기국가/구분 | United States(US) Patent 등록 |
---|---|
국제특허분류(IPC7판) |
|
출원번호 | US-0266930 (2016-09-15) |
등록번호 | US-9934775 (2018-04-03) |
발명자 / 주소 |
|
출원인 / 주소 |
|
대리인 / 주소 |
|
인용정보 | 피인용 횟수 : 2 인용 특허 : 2099 |
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In an example process, text to be converted to speech is received. The text is represented as a sequence of target units. A plurality of candidate speech segments corresponding to the sequence of target units
Systems and processes for performing unit-selection text-to-speech synthesis are provided. In an example process, text to be converted to speech is received. The text is represented as a sequence of target units. A plurality of candidate speech segments corresponding to the sequence of target units are selected. Predicted statistical parameters of acoustic features associated with the sequence of target units are determined. The predicted statistical parameters of acoustic features are used to determine target costs and concatenation costs associated with the plurality of candidate speech segments. Based on a combined cost determined from the target costs and concatenation costs, a subset of candidate speech segments is selected from the plurality of candidate speech segments. Speech corresponding to the received text is generated using the subset of candidate speech segments.
1. A system for unit-selection text-to-speech synthesis, the system comprising: one or more processors; andmemory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to: receive text
1. A system for unit-selection text-to-speech synthesis, the system comprising: one or more processors; andmemory storing one or more programs, wherein the one or more programs include instructions which, when executed by the one or more processors, cause the one or more processors to: receive text to be converted to speech;generate a sequence of target units representing a spoken pronunciation of the text; determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit, wherein a second acoustic feature of the plurality of acoustic features represents a change of a first acoustic feature of the plurality of acoustic features across a portion of a respective target unit of the sequence of target units;select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;for each candidate speech segment of the plurality of candidate speech segments: determine a target cost based on the predicted statistical parameters of the first acoustic feature associated with a respective target unit of the sequence of target units; anddetermine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of the second acoustic feature associated with the respective target unit of the sequence of target units;select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; andgenerate speech corresponding to the received text using the subset of candidate speech segments. 2. The system of claim 1, wherein the portion of the respective target unit is an end portion of the respective target unit. 3. The system of claim 1, wherein the first acoustic feature comprises a fundamental frequency and the second acoustic feature comprises a change in the fundamental frequency across an end portion of the respective target unit. 4. The system of claim 1, wherein the first acoustic feature comprises a mel-frequency cepstral coefficient and the second acoustic feature comprises a change in the mel-frequency cepstral coefficient across an end portion of the respective target unit. 5. The system of claim 1, wherein the plurality of acoustic features include a fundamental frequency at the portion of the respective target unit and a fundamental frequency at a second portion of the respective target unit. 6. The system of claim 1, wherein the plurality of acoustic features includes a first plurality of mel-frequency cepstral coefficients at the portion of the respective target unit and a second plurality of mel-frequency cepstral coefficients at a second portion of the respective target unit. 7. The system of claim 1, wherein the plurality of acoustic features includes a duration of the respective target unit. 8. The system of claim 1, wherein the predicted statistical parameters of the second acoustic feature is not derived from the predicted statistical parameters of the first acoustic feature. 9. The system of claim 1, wherein the predicted statistical parameters for each of the plurality of acoustic features include a mean parameter for each of the plurality of acoustic features and a variance parameter for each of the plurality of acoustic features. 10. The system of claim 1, wherein the target cost for a respective candidate speech segment is based on a weighted difference between an actual value of the first acoustic feature for the respective candidate speech segment and a first predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit, and wherein the weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit. 11. The system of claim 1, wherein a concatenation cost of the plurality of concatenation costs for a respective candidate speech segment includes a second weighted difference between an actual value of the second acoustic feature for the respective candidate speech segment with respect to a subsequent candidate speech segment of the plurality of subsequent candidate speech segments and a first predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit, and wherein the second weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit. 12. The system of claim 11, wherein the actual value of the second acoustic feature for the respective candidate speech segment with respect to the subsequent candidate speech segment of the plurality of subsequent candidate speech segments comprises a difference between an actual value of the first acoustic feature at an end of the respective candidate speech segment and an actual value of the first acoustic feature at a beginning of the subsequent candidate speech segment. 13. The system of claim 1, wherein the predicted statistical parameters for each of the plurality of acoustic features associated with each target unit are determined using a statistical model. 14. The system of claim 13, wherein the statistical model is composed by a mixture of probability distributions. 15. The system of claim 13, wherein the statistical model is configured to: receive, as inputs, the plurality of linguistic features associated with a respective target unit; andoutput the predicted statistical parameters for each of the plurality of acoustic features associated with the respective target unit. 16. The system of claim 15, wherein the statistical model is further configured to output one or more density weights for each of the plurality of acoustic features associated with the respective target unit. 17. The system of claim 13, wherein the statistical model is a mixture density network comprising: an input layer configured to receive as inputs the plurality of linguistic features associated with a respective target unit;an output layer configured to output the predicted statistical parameters for each of the plurality of acoustic features associated with the respective target unit; andat least one hidden layer between the input layer and the output layer. 18. The system of claim 13, wherein the statistical model is configured to determine, for each target unit, the predicted statistical parameters of the second acoustic feature independent of the predicted statistical parameters of the first acoustic feature. 19. A method for unit-selection text-to-speech synthesis, comprising: at an electronic device having a processor and memory: receiving text to be converted to speech;generating a sequence of target units representing a spoken pronunciation of the text;determining, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit, wherein a second acoustic feature of the plurality of acoustic features represents a change of a first acoustic feature of the plurality of acoustic features across a portion of a respective target unit of the sequence of target units;selecting, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;for each candidate speech segment of the plurality of candidate speech segments: determining a target cost based on the predicted statistical parameters of the first acoustic feature associated with the respective target unit of the sequence of target units; anddetermining a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of the second acoustic feature associated with the respective target unit of the sequence of target units;selecting from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; andgenerating speech corresponding to the received text using the subset of candidate speech segments. 20. The method of claim 19, wherein the target cost for a respective candidate speech segment is based on a weighted difference between an actual value of the first acoustic feature for the respective candidate speech segment and a first predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit, and wherein the weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit. 21. The method of claim 19, wherein a concatenation cost of the plurality of concatenation costs for a respective candidate speech segment includes a second weighted difference between an actual value of the second acoustic feature for the respective candidate speech segment with respect to a subsequent candidate speech segment of the plurality of subsequent candidate speech segments and a first predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit, and wherein the second weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit. 22. The method of claim 21, wherein the actual value of the second acoustic feature for the respective candidate speech segment with respect to the subsequent candidate speech segment of the plurality of subsequent candidate speech segments comprises a difference between an actual value of the first acoustic feature at an end of the respective candidate speech segment and an actual value of the first acoustic feature at a beginning of the subsequent candidate speech segment. 23. The method of claim 19, wherein the portion of the respective target unit is an end portion of the respective target unit. 24. A non-transitory computer-readable storage medium comprising computer-readable instructions which, when executed by one or more processors, cause the one or more processors to: receive text to be converted to speech;generate a sequence of target units representing a spoken pronunciation of the text;determine, based on a plurality of linguistic features associated with each target unit of the sequence of target units, predicted statistical parameters for each of a plurality of acoustic features associated with each target unit, wherein a second acoustic feature of the plurality of acoustic features represents a change of a first acoustic feature of the plurality of acoustic features across a portion of a respective target unit of the sequence of target units;select, based on the plurality of linguistic features associated with each target unit, a plurality of candidate speech segments corresponding to the sequence of target units;for each candidate speech segment of the plurality of candidate speech segments: determine a target cost based on the predicted statistical parameters of the first acoustic feature associated with the respective target unit of the sequence of target units; anddetermine a plurality of concatenation costs with respect to a plurality of subsequent candidate speech segments, the plurality of concatenation costs determined based on the predicted statistical parameters of the second acoustic feature associated with the respective target unit of the sequence of target units;select from the plurality of candidate speech segments a subset of candidate speech segments for speech synthesis, the selecting based on a combined cost associated with the subset of candidate speech segments, wherein the combined cost is determined based on the target cost and the plurality of concatenation costs of each candidate speech segment; andgenerate speech corresponding to the received text using the subset of candidate speech segments. 25. The computer-readable storage medium of claim 24, wherein the portion of the respective target unit is an end portion of the respective target unit. 26. The computer-readable storage medium of claim 24, wherein the target cost for a respective candidate speech segment is based on a weighted difference between an actual value of the first acoustic feature for the respective candidate speech segment and a first predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit, and wherein the weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the first acoustic feature for the respective target unit. 27. The computer-readable storage medium of claim 24, wherein a concatenation cost of the plurality of concatenation costs for a respective candidate speech segment includes a second weighted difference between an actual value of the second acoustic feature for the respective candidate speech segment with respect to a subsequent candidate speech segment of the plurality of subsequent candidate speech segments and a first predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit, and wherein the second weighted difference is weighted by a second predicted statistical parameter of the predicted statistical parameters of the second acoustic feature for the respective target unit. 28. The computer-readable storage medium of claim 27, wherein the actual value of the second acoustic feature for the respective candidate speech segment with respect to the subsequent candidate speech segment of the plurality of subsequent candidate speech segments comprises a difference between an actual value of the first acoustic feature at an end of the respective candidate speech segment and an actual value of the first acoustic feature at a beginning of the subsequent candidate speech segment.
Copyright KISTI. All Rights Reserved.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.