최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
DataON 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Edison 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Kafe 바로가기국가/구분 | United States(US) Patent 등록 |
---|---|
국제특허분류(IPC7판) |
|
출원번호 | US-0156161 (2016-05-16) |
등록번호 | US-10049668 (2018-08-14) |
발명자 / 주소 |
|
출원인 / 주소 |
|
대리인 / 주소 |
|
인용정보 | 피인용 횟수 : 2 인용 특허 : 2084 |
Systems and processes for converting speech-to-text are provided. In one example process, speech input can be received. A sequence of states and arcs of a weighted finite state transducer (WFST) can be traversed. A negating finite state transducer (FST) can be traversed. A virtual FST can be compose
Systems and processes for converting speech-to-text are provided. In one example process, speech input can be received. A sequence of states and arcs of a weighted finite state transducer (WFST) can be traversed. A negating finite state transducer (FST) can be traversed. A virtual FST can be composed using a neural network language model and based on the sequence of states and arcs of the WFST. The one or more virtual states of the virtual FST can be traversed to determine a probability of a candidate word given one or more history candidate words. Text corresponding to the speech input can be determined based on the probability of the candidate word given the one or more history candidate words. An output can be provided based on the text corresponding to the speech input.
1. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the one or more processors to: receive speech input;traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transduc
1. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the one or more processors to: receive speech input;traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein: the sequence of states and arcs represents one or more history candidate words and a current candidate word; anda first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words;compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word;traverse the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST;determine, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input;based on the determined text, perform one or more tasks to obtain a result; andcause the result to be presented in spoken or visual form. 2. The non-transitory computer-readable medium of claim 1, wherein the virtual FST is composed after traversing the sequence of states and arcs of the WFST. 3. The non-transitory computer-readable medium of claim 1, wherein only one are transitions out of each virtual state of the one or more virtual states of the virtual FST. 4. The non-transitory computer-readable medium of claim 1, wherein the instructions further cause the one or more processors to: determine, using the neural network language model, a third probability of the candidate word given the one or more history candidate words, wherein the virtual FST is composed using the third probability of the candidate word given the one or more history candidate words. 5. The non-transitory computer-readable medium of claim 1, wherein the instructions further cause the one or more processors to: compose a second virtual FST using a second language model and based on the sequence of states and arcs, wherein one or more virtual states of the second virtual FST represents the current candidate word; andtraverse the one or more virtual states of the second virtual FST, wherein a fourth probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the second virtual FST, and wherein the text corresponding to the speech input is determined based on the fourth probability of the candidate word given the one or more history candidate words. 6. The non-transitory computer-readable medium of claim 5, wherein the instructions further cause the one or more processors to: interpolate the second probability of the candidate word given the one or more history candidate words and the fourth probability of the candidate word given the one or more history candidate words, and wherein: a combined probability of the candidate word given the one or more history candidate words is determined by the interpolating; andthe text corresponding to the speech input is determined based on the combined probability of the candidate word given the one or more history candidate words. 7. The non-transitory computer-readable medium of claim 5, wherein the second language model is an n-gram language model. 8. The non-transitory computer-readable medium of claim 1, wherein the instructions further cause the one or more processors to: compose the negating FST with the WFST prior to traversing the negating FST. 9. The non-transitory computer-readable medium of claim 8, wherein the virtual FST is composed prior to traversing the one or more virtual states of the virtual FST. 10. The non-transitory computer-readable medium of claim 1, wherein the WFST is a static finite state transducer built prior to receiving the speech input. 11. The non-transitory computer-readable medium of claim 1, wherein the negating FST is a static finite state transducer built prior to receiving the speech input. 12. The non-transitory computer-readable medium of claim 1, wherein the WFST is a single finite state transducer composed from a Hidden Markov Model (HMM) topology, a context dependent phonetic model, a lexicon, and a language model. 13. The non-transitory computer-readable medium of claim 12, wherein the language model is a unigram language model or a bigram language model. 14. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the one or more processors to: receive speech input;traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein: the sequence of states and arcs represents one or more history candidate words and a non-terminal class; anda first probability of the non-terminal class given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the non-terminal class given the one or more history candidate words;compose a virtual FST using a neural network language model and a user-specific language model FST, and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent a current candidate word corresponding to the non-terminal class;traverse the one or more virtual states of the virtual FST, wherein a probability of the current candidate word given the one or more history candidate words and the non-terminal class is determined by traversing the one or more virtual states of the virtual FST;determine, based on the probability of the current candidate word given the one or more history candidate words and the non-terminal class, text corresponding to the speech input;based on the determined text, perform one or more tasks to obtain a result; andcause the result to be presented in spoken or visual form. 15. The non-transitory computer-readable medium of claim 14, wherein the instructions further cause the one or more processors to: determine, using the neural network language model, a second probability of the non-terminal class given the one or more history candidate words, wherein the virtual FST is composed using the second probability of the non-terminal class given the one or more history candidate words. 16. The non-transitory computer-readable medium of claim 14, wherein the instructions further cause the one or more processors to: traverse the user-specific language model FST, wherein a probability of the current candidate word among a plurality of candidate words represented in the user-specific language model FST is determined by traversing the user-specific language model FST, and wherein the virtual FST is composed using the probability of the current candidate word among the plurality of candidate words represented in the user-specific language model FST. 17. The non-transitory computer-readable medium of claim 14, wherein the one or more virtual states of the virtual FST are composed using phone-word units from the WFST and based on the current candidate word represented in the user-specific language model FST. 18. The non-transitory computer-readable medium of claim 14, wherein the instructions further cause the one or more processors to: prior to receiving the speech input: receive user-specific data; andgenerate the user-specific language model FST using the user-specific data. 19. A method for performing speech-to-text conversion, the method comprising: at an electronic device having a processor and memory: receiving speech input;traversing, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein: the sequence of states and arcs represents one or more history candidate words and a current candidate word; anda first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;traversing a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words;composing a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word;traversing the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST;determining, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input;based on the determined text, performing one or more tasks to obtain a result; andcausing the result to be presented in spoken or visual form. 20. An electronic device comprising: one or more processors; andmemory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the one or more processors to: receive speech input;traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein: the sequence of states and arcs represents one or more history candidate words and a current candidate word; anda first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words;compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word;traverse the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST;determine, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input;based on the determined text, perform one or more tasks to obtain a result; andcause the result to be presented in spoken or visual form. 21. The method of claim 19, wherein the virtual FST is composed after traversing the sequence of states and arcs of the WFST. 22. The method of claim 19, wherein only one arc transitions out of each virtual state of the one or more virtual states of the virtual FST. 23. The method of claim 19, further comprising: composing the negating FST with the WFST prior to traversing the negating FST. 24. The method of claim 19, wherein the WFST is a static finite state transducer built prior to receiving the speech input. 25. The method of claim 19, wherein the WFST is a single finite state transducer composed from a Hidden Markov Model (HMM) topology, a context dependent phonetic model, a lexicon, and a language model. 26. The electronic device of claim 20, wherein the virtual FST is composed after traversing the sequence of states and arcs of the WFST. 27. The electronic device of claim 20, wherein only one arc transitions out of each virtual state of the one or more virtual states of the virtual FST. 28. The electronic device of claim 20, wherein the one or more programs further include instructions that when executed by the one or more processors, cause the one or more processors to: compose the negating FST with the WFST prior to traversing the negating FST. 29. The electronic device of claim 20, wherein the WFST is a static finite state transducer built prior to receiving the speech input. 30. The electronic device of claim 20, wherein the WFST is a single finite state transducer composed from a Hidden Markov Model (HMM) topology, a context dependent phonetic model, a lexicon, and a language model. 31. A method for performing speech-to-text conversion, the method comprising: at an electronic device having a processor and memory: receiving speech input;traversing, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein: the sequence of states and arcs represents one or more history candidate words and a non-terminal class; anda first probability of the non-terminal class given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;traversing a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the non-terminal class given the one or more history candidate words;composing a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent a current candidate word corresponding to the non-terminal class;traversing the one or more virtual states of the virtual FST, wherein a probability of the current candidate word given the one or more history candidate words and the non-terminal class is determined by traversing the one or more virtual states of the virtual FST;determining, based on the probability of the candidate word given the one or more history candidate words and the non-terminal class, text corresponding to the speech input;based on the determined text, performing one or more tasks to obtain a result; andcausing the result to be presented in spoken or visual form. 32. The method of claim 31, further comprising: determining, using the neural network language model, a second probability of the non-terminal class given the one or more history candidate words, wherein the virtual FST is composed using the second probability of the non-terminal class given the one or more history candidate words. 33. The method of claim 31, further comprising: traversing the user-specific language model FST, wherein a probability of the current candidate word among a plurality of candidate words represented in the user-specific language model FST is determined by traversing the user-specific language model FST, and wherein the virtual FST is composed using the probability of the current candidate word among the plurality of candidate words represented in the user-specific language model FST. 34. The method of claim 31, wherein the one or more virtual states of the virtual FST are composed using phone-word units from the WFST and based on the current candidate word represented in the user-specific language model FST. 35. The method of claim 31, further comprising: prior to receiving the speech input: receiving user-specific data; andgenerating the user-specific language model FST using the user-specific data. 36. An electronic device comprising: one or more processors; andmemory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the one or more processors to: receive speech input;traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein: the sequence of states and arcs represents one or more history candidate words and a non-terminal class; anda first probability of the non-terminal class given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the non-terminal class given the one or more history candidate words;compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent a current candidate word corresponding to the non-terminal class;traverse the one or more virtual states of the virtual FST, wherein a probability of the current candidate word given the one or more history candidate words and the non-terminal class is determined by traversing the one or more virtual states of the virtual FST;determine, based on the probability of the current candidate word given the one or more history candidate words and the non-terminal class, text corresponding to the speech input;based on the determined text, perform one or more tasks to obtain a result; andcause the result to be presented in spoken or visual form. 37. The electronic device of claim 36, wherein the one or more programs further include instructions that when executed by the one or more processors, cause the one or more processors to: determine, using the neural network language model, a second probability of the non-terminal class given the one or more history candidate words, wherein the virtual FST is composed using the second probability of the non-terminal class given the one or more history candidate words. 38. The electronic device of claim 36, wherein the one or more programs further include instructions that when executed by the one or more processors, cause the one or more processors to: traverse the user-specific language model FST, wherein a probability of the current candidate word among a plurality of candidate words represented in the user-specific language model FST is determined by traversing the user-specific language model FST, and wherein the virtual FST is composed using the probability of the current candidate word among the plurality of candidate words represented in the user-specific language model FST. 39. The electronic device of claim 36, wherein the one or more virtual states of the virtual FST are composed using phone-word units from the WFST and based on the current candidate word represented in the user-specific language model FST. 40. The electronic device of claim 36, wherein the one or more programs further include instructions that when executed by the one or more processors, cause the one or more processors to: prior to receiving the speech input: receive user-specific data; andgenerate the user-specific language model FST using the user-specific data.
Copyright KISTI. All Rights Reserved.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.