Camera-assisted noise cancellation and speech recognition
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G10L-015/08
G10L-015/24
출원번호
US-0759907
(2010-04-14)
등록번호
US-8635066
(2014-01-21)
발명자
/ 주소
Morrison, Andrew R.
출원인 / 주소
T-Mobile USA, Inc.
대리인 / 주소
Lee & Hayes, PLLC
인용정보
피인용 횟수 :
7인용 특허 :
27
초록▼
Methods, system, and articles are described herein for receiving an audio input and a facial image sequence for a period of time, in which the audio input includes speech input from multiple speakers. The audio input is extracted based on the received facial image sequence to extract a speech input
Methods, system, and articles are described herein for receiving an audio input and a facial image sequence for a period of time, in which the audio input includes speech input from multiple speakers. The audio input is extracted based on the received facial image sequence to extract a speech input of a particular speaker.
대표청구항▼
1. A computer-implemented method, comprising: receiving an audio input and a facial image sequence for a period of time at an electronic device, the audio input including a decipherable portion and an indecipherable portion;converting the decipherable portion of the audio input into a first symbol s
1. A computer-implemented method, comprising: receiving an audio input and a facial image sequence for a period of time at an electronic device, the audio input including a decipherable portion and an indecipherable portion;converting the decipherable portion of the audio input into a first symbol sequence portion, the converting including: detecting separations between spoken words in the decipherable segment based on the facial image sequence, andprocessing the decipherable portion into the first symbol sequence portion based in part on the detected separations;processing a portion of the facial image sequence that corresponds temporally to the indecipherable portion of the audio input into a second symbol sequence portion; andintegrating the first symbol sequence portion and the second symbol sequence portion in temporal order to form a symbol sequence. 2. The computer-implemented method of claim 1, further comprising: identifying portions of the facial image sequence that indicate a particular speaker is silent; andfiltering out portions of the audio input that correspond to the portions of the facial image sequence that indicate the particular speaker is silent. 3. The computer-implemented method of claim 2, wherein the identifying includes identifying portions of the facial image sequence that indicate the particular speaker is silent based on facial features of the particular speaker shown in the portions of the facial image sequence. 4. The computer-implemented method of claim 1, further comprising transmitting the symbol sequence to another electronic device or storing the symbol sequence in a data storage on the electronic device. 5. The computer-implemented method of claim 1, wherein the receiving includes receiving the audio input via a microphone of an electronic device and receiving the facial image sequence via a camera of the electronic device. 6. The computer-implemented method of claim 1, further comprising determining that speech input is initiated when the facial image sequence indicates that a particular speaker begins to utter sounds. 7. The computer-implemented method of claim 6, further comprising determining that the speech input is terminated when a facial image of the particular speaker moves out of the view of a camera of the electronic device or when the facial image sequence indicates that the particular speaker has not uttered sounds for a predetermined period of time. 8. A computer-implemented method, comprising: receiving an audio input and a facial image sequence for a period of time at an electronic device, the audio input including speech inputs from multiple speakers;extracting a speech input of a particular speaker from the audio input based on the received facial image sequence, including selecting portions of the audio input that correspond to speech movement indicated by the facial image sequence, at least one selected portion including a decipherable segment and an indecipherable segment; andprocessing the extracted speech input into a symbol sequence, the processing including:converting the decipherable segment into a first sub symbol sequence portion, the converting including: detecting separations between spoken words in the decipherable segment based on the facial image sequence, andprocessing the decipherable portion into the first sub symbol sequence portion based in part on the detected separations;processing a portion of the facial image sequence that corresponds temporally to the indecipherable segment into a second sub symbol sequence portion; andintegrating the first sub symbol sequence portion and the second sub symbol sequence portion in temporal order to form one of the corresponding symbol sequence portions. 9. The computer-implemented method of claim 8, further comprising converting the symbol sequence into text for display by the electronic device. 10. The computer-implemented method of claim 8, furthering comprising matching the symbol sequence to a command that causes the electronic device to perform a function. 11. The computer-implemented method of claim 8, wherein the receiving includes receiving the audio input via a microphone of an electronic device and receiving the facial image sequence via a camera of the electronic device. 12. The computer-implemented method of claim 8, wherein the processing includes: converting each of the selected audio input portions into a corresponding symbol sequence portion; andassembling the symbol sequence portions into a symbol sequence. 13. The computer-implemented method of claim 8, wherein the processing includes: obtaining a first symbol sequence portion and a corresponding audio transformation confidence score for an audio input portion;obtaining a second symbol sequence portion and a corresponding visual transformation confidence score for each facial image sequence that corresponds to the audio input portion;comparing the audio transformation confidence score and the visual transformation confidence score of the audio portion;selecting the first symbol sequence portion for assembly into the symbol sequence when the audio transformation confidence score is higher than the visual transformation confidence score; andselecting the second symbol sequence portion for assembly into the symbol when the visual transformation confidence score is higher than the audio transformation confidence score. 14. The computer-implemented method of claim 13, further comprising selecting the first symbol sequence portion for assembly into the symbol sequence when the audio transformation confidence score is equal to the visual transformation confidence score and the audio transformation confidence score is equal to or higher than a predetermined audio confidence score threshold. 15. The computer-implemented method of claim 13, further comprising selecting the first symbol sequence portion or the second symbol sequence portion for assembly into the symbol sequence when the audio transformation confidence score is equal to the visual transformation confidence score. 16. The computer-implemented method of claim 8, wherein the indecipherable segment is masked by ambient noise or an audio input portion in which data is missing. 17. The computer-implemented method of claim 8, further comprising determining that a segment of one of the selected portions is an indecipherable segment when a transformation confidence score of a sub symbol sequence portion obtained from the segment is below a predetermined audio confidence score threshold. 18. The computer-implemented method of claim 8, further comprising determining that the audio input is initiated when the facial image sequence indicates that a speaker begins to utter sounds. 19. The computer-implemented method of claim 8, further comprising determining that the audio input is terminated when a facial image of a speaker moves out of the view of a camera of the electronic device or when the facial image sequence indicates that the speaker has not uttered sounds for a predetermined period of time. 20. The computer-implemented method of claim 8, wherein the audio input contains one or more phonemes and the facial image sequence includes one or more visemes. 21. An article of manufacture comprising: a storage medium; andcomputer-readable programming instructions stored on the storage medium and configured to program a computing device to perform operations including: receiving an audio input and a facial image sequence for a period of time at an electronic device, wherein the audio input includes a decipherable portion and an indecipherable portion;converting the decipherable portion of the audio input into a first symbol sequence portion, the converting including: detecting separations between spoken words in the decipherable segment based on the facial image sequence, andprocessing the decipherable portion into the first symbol sequence portion based in part on the detected separations;processing a portion of the facial image sequence that corresponds temporally to the indecipherable portion of the audio input into a second symbol sequence portion; andintegrating the first symbol sequence portion and the second symbol sequence portion in temporal order to form a symbol sequence. 22. The article of claim 21, wherein the operations further include converting the symbol sequence into text for display by the electronic device. 23. The article of claim 21, wherein the operations further includes determining that a portion of the audio input is an indecipherable portion when a transformation confidence score of a symbol sequence obtained from the portion is below a predetermined audio confidence score threshold. 24. The article of claim 21, wherein the operations further include further determining that the audio input is initiated when the facial image sequences indicate that a speaker begins to utter sounds, and determining that the audio input is terminated when a facial image of a speaker moves out of the view of a camera of the electronic device or when the facial image sequence indicates that the speaker has not uttered sounds for a predetermined period of time. 25. The article of claim 21, wherein the receiving includes receiving the audio input via a microphone of an electronic device and receiving the facial image sequence via a camera of the electronic device. 26. A device comprising: a microphone to receive an audio input from an environment;a camera to receive a plurality of facial image sequences, the camera configured to automatically track a face of a speaker associated with the facial image sequences to maintain its view of the face;a processor;a memory that stores a plurality of modules that comprises: a visual interpretation module to process a portion of an facial image sequence into a symbol sequence, wherein the facial image sequence corresponds temporally to an indecipherable portion of an audio input;a speech recognition module to convert a decipherable portion of the audio input into another symbol sequence, and to integrate the symbol sequences in temporal order to form an integrated symbol sequence; anda command module to cause the device to perform a function in response at least to the symbol sequence. 27. The device of claim 26, wherein the speech recognition module is to further convert the symbol sequence into text for display on the device. 28. The device of claim 26, wherein the command module is to further cause the device to perform a function in response to the integrated symbol sequence.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (27)
Asseily,Alexander; Einaudi,Andrew E., Acoustic vibration sensor.
Deborah W. Brown ; Randy G. Goldberg ; Richard R. Rosinski ; William R. Wetzel, Distributed recognition system having multiple prompt-specific and response-specific speech recognizers.
Petajan Eric D. (25 Cypress St. Millburn NJ 07041), Electronic facial tracking and detection system and method and apparatus for automated speech recognition.
Prasad K. Venkatesh (Cupertino CA) Stork David G. (Stanford CA), Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system.
Waletzky Jeremy P. (5039 Lowell St. Washington DC 20016) Wilk Peter J. (185 W. End Ave. New York NY 10023), Medical testing device and associated method.
Basu, Sankar; de Cuetos, Philippe Christian; Maes, Stephane Herman; Neti, Chalapathy Venkata; Senior, Andrew William, Method and apparatus for audio-visual speech detection and recognition.
Basu, Sankar; de Cuetos, Philippe Christian; Maes, Stephane Herman; Neti, Chalapathy Venkata; Senior, Andrew William, Methods and apparatus for audio-visual speech detection and recognition.
Stork David G. (Stanford CA) Wolff Gregory J. (Mountain View CA) Levine Earl I. (Dallas TX), Neural network acoustic and visual speech recognition system.
Stork David G. (Stanford CA) Wolff Gregory J. (Mountain View CA), Neural network acoustic and visual speech recognition system training method and apparatus.
Bennett,Steven M.; Anderson,Andrew V., Selecting one of multiple speech recognizers in a system based on performance predections resulting from experience.
Cheryl M. Hein ; Craig A. Lee ; Michael D. Howard ; Tamara Lacker ; Michael J. Daily, System for electronically-mediated collaboration including eye-contact collaboratory.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.