최소 단어 이상 선택하여야 합니다.
최대 10 단어까지만 선택 가능합니다.
다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
NTIS 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
DataON 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Edison 바로가기다음과 같은 기능을 한번의 로그인으로 사용 할 수 있습니다.
Kafe 바로가기국가/구분 | United States(US) Patent 등록 |
---|---|
국제특허분류(IPC7판) |
|
출원번호 | US-0208222 (2011-08-11) |
등록번호 | US-8706472 (2014-04-22) |
발명자 / 주소 |
|
출원인 / 주소 |
|
대리인 / 주소 |
|
인용정보 | 피인용 횟수 : 77 인용 특허 : 512 |
Disambiguating multiple readings in language conversion is disclosed, including: receiving an input data to be converted into a set of characters comprising a symbolic representation of the input data in a target symbolic system; and using a language model that distinguishes between a first reading
Disambiguating multiple readings in language conversion is disclosed, including: receiving an input data to be converted into a set of characters comprising a symbolic representation of the input data in a target symbolic system; and using a language model that distinguishes between a first reading and a second reading of a character of the target symbolic system to determine a probability that the heteronymous character should be used to represent a corresponding portion of the input data.
1. A method, comprising: at a device having one or more processors and memory:receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;identifying a firs
1. A method, comprising: at a device having one or more processors and memory:receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; andconverting the input data to a selected one of the plurality of candidate character strings, said converting comprising: determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character. 2. The method of claim 1, wherein the input text comprises pinyin. 3. The method of claim 1, wherein the input text is resolved into one or more monosyllabic groups of characters that are each converted to a respective candidate character in the target symbolic system. 4. The method of claim 1, wherein the target symbolic system includes Chinese characters. 5. The method of claim 1, wherein the language model is trained using a corpus that has been annotated to distinguish between the first pronunciation of the first candidate character and the second pronunciation of the first candidate character. 6. The method of claim 5, wherein for at least one of the first pronunciation and second pronunciation of the first candidate character, a corresponding new symbol or encoded representation thereof is created and added to the annotated corpus. 7. The method of claim 1, further comprising: receiving one or more manual input of annotations to a subset of text associated with a corpus, wherein a manual input of annotation indicates for an instance of a heteronymous character an appropriate pronunciation of that heteronymous character based at least in part on a context associated with the instance, wherein an annotation is associated with a symbol associated with that heteronymous character; andautomatically annotating at least a portion of the text associated with the corpus that has not been manually annotated based at least in part on the received one or more manual input of annotations. 8. The method of claim 1, wherein the language model is trained to associate a probability corresponding to the first pronunciation of the first candidate character and a probability corresponding to the second reading pronunciation of the first candidate character. 9. The method of claim 1, wherein the language model is trained to associate a probability corresponding to a first sequence of characters including the first pronunciation of the character and a probability corresponding to a second sequence of characters including the second pronunciation of the character, wherein the first and second sequences each includes two or more characters. 10. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising: receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; andconverting the input data to a selected one of the plurality of candidate character strings, said converting comprising: determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character. 11. The computer-readable medium of claim 10, wherein the input text comprises pinyin. 12. The computer-readable medium of claim 10, wherein the input text is resolved into one or more monosyllabic groups of characters that are each converted to a respective candidate character in the target symbolic system. 13. The computer-readable medium of claim 10, wherein the target symbolic system includes Chinese characters. 14. The computer-readable medium of claim 10, wherein the language model is trained using a corpus that has been annotated to distinguish between the first pronunciation and the second pronunciation of the first candidate character. 15. The computer-readable medium of claim 14, wherein for at least one of the first pronunciation and second pronunciation of the first candidate character, a corresponding new symbol or encoded representation thereof is created and added to the annotated corpus. 16. The computer-readable medium of claim 10, wherein the operations further comprise: receiving one or more manual input of annotations to a subset of text associated with a corpus, wherein a manual input of annotation indicates for an instance of a heteronymous character an appropriate pronunciation of that heteronymous character based at least in part on a context associated with the instance, wherein an annotation is associated with a symbol associated with that heteronymous character; andautomatically annotating at least a portion of the text associated with the corpus that has not been manually annotated based at least in part on the received one or more manual input of annotations. 17. The computer-readable medium of claim 10, wherein the language model is trained to associate a probability corresponding to the first pronunciation of the first candidate character and a probability corresponding to the second pronunciation of the first candidate character. 18. The computer-readable medium of claim 10, wherein the language model is trained to associate a probability corresponding to a first sequence of characters including the first pronunciation of the first candidate character and a probability corresponding to a second sequence of characters including the second pronunciation of the first candidate character, wherein the first and second sequences each includes two or more characters. 19. A system, comprising: one or more processors; andmemory having instructions stored thereon, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:receiving input data to be converted into a symbolic representation of the input data in a target symbolic system, the symbolic representation comprising a set of characters in the target symbolic system;identifying a first candidate character for the symbolic representation based on a first portion of the input data, and a second candidate character for the symbolic representation based on a second portion of the input data, wherein the first candidate character has at least a first pronunciation and a second pronunciation each applicable to a respective usage context;generating a plurality of candidate character strings, including at least a first candidate string comprising at least the first candidate character and the second candidate character; andconverting the input data to a selected one of the plurality of candidate character strings, said converting comprising: determining a respective probability that the first candidate character string is a correct symbolic representation of the input data using a language model that individually accounts for a respective usage probability of the first candidate character in a first usage context comprising the second candidate character in combination with the first pronunciation of the first candidate character, and not the second pronunciation of the first candidate character, and wherein the language model is trained on an annotated corpus that associates the first pronunciation with the first candidate character used in respective contexts comprising the second candidate character. 20. The system of claim 19, wherein the input text comprises pinyin. 21. The system of claim 19, wherein the input text is resolved into one or more monosyllabic groups of characters that are each converted to a respective candidate character in the target symbolic system. 22. The system of claim 19, wherein the target symbolic system includes Chinese characters. 23. The system of claim 19, wherein the language model is trained using a corpus that has been annotated to distinguish between the first pronunciation and the second pronunciation of the first candidate character. 24. The system of claim 23, wherein for at least one of the first pronunciation and second pronunciation of the first candidate character, a corresponding new symbol or encoded representation thereof is created and added to the annotated corpus. 25. The system of claim 19, wherein the operations further comprise: receiving one or more manual input of annotations to a subset of text associated with a corpus, wherein a manual input of annotation indicates for an instance of a heteronymous character an appropriate pronunciation of that heteronymous character based at least in part on a context associated with the instance, wherein an annotation is associated with a symbol associated with that heteronymous character; andautomatically annotating at least a portion of the text associated with the corpus that has not been manually annotated based at least in part on the received one or more manual input of annotations. 26. The system of claim 19, wherein the language model is trained to associate a probability corresponding to the first pronunciation of the first candidate character and a probability corresponding to the second pronunciation of the first candidate character. 27. The system of claim 19, wherein the language model is trained to associate a probability corresponding to a first sequence of characters including the first pronunciation of the first candidate character and a probability corresponding to a second sequence of characters including the second pronunciation of the first candidate character, wherein the first and second sequences each includes two or more characters.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.