[특허]Camera-assisted noise cancellation and speech recognition

Camera-assisted noise cancellation and speech recognition 원문보기

IPC분류정보
국가/구분	United States(US) Patent 등록
국제특허분류(IPC7판)	G10L-015/08 G10L-015/24
출원번호	US-0759907 (2010-04-14)
등록번호	US-8635066 (2014-01-21)
발명자 / 주소	Morrison, Andrew R.
출원인 / 주소	T-Mobile USA, Inc.
대리인 / 주소	Lee & Hayes, PLLC
인용정보	피인용 횟수 : 7 인용 특허 : 27

초록 ▼

Methods, system, and articles are described herein for receiving an audio input and a facial image sequence for a period of time, in which the audio input includes speech input from multiple speakers. The audio input is extracted based on the received facial image sequence to extract a speech input

대표청구항 ▼

1. A computer-implemented method, comprising: receiving an audio input and a facial image sequence for a period of time at an electronic device, the audio input including a decipherable portion and an indecipherable portion;converting the decipherable portion of the audio input into a first symbol sequence portion, the converting including: detecting separations between spoken words in the decipherable segment based on the facial image sequence, andprocessing the decipherable portion into the first symbol sequence portion based in part on the detected separations;processing a portion of the facial image sequence that corresponds temporally to the indecipherable portion of the audio input into a second symbol sequence portion; andintegrating the first symbol sequence portion and the second symbol sequence portion in temporal order to form a symbol sequence. 2. The computer-implemented method of claim 1, further comprising: identifying portions of the facial image sequence that indicate a particular speaker is silent; andfiltering out portions of the audio input that correspond to the portions of the facial image sequence that indicate the particular speaker is silent. 3. The computer-implemented method of claim 2, wherein the identifying includes identifying portions of the facial image sequence that indicate the particular speaker is silent based on facial features of the particular speaker shown in the portions of the facial image sequence. 4. The computer-implemented method of claim 1, further comprising transmitting the symbol sequence to another electronic device or storing the symbol sequence in a data storage on the electronic device. 5. The computer-implemented method of claim 1, wherein the receiving includes receiving the audio input via a microphone of an electronic device and receiving the facial image sequence via a camera of the electronic device. 6. The computer-implemented method of claim 1, further comprising determining that speech input is initiated when the facial image sequence indicates that a particular speaker begins to utter sounds. 7. The computer-implemented method of claim 6, further comprising determining that the speech input is terminated when a facial image of the particular speaker moves out of the view of a camera of the electronic device or when the facial image sequence indicates that the particular speaker has not uttered sounds for a predetermined period of time. 8. A computer-implemented method, comprising: receiving an audio input and a facial image sequence for a period of time at an electronic device, the audio input including speech inputs from multiple speakers;extracting a speech input of a particular speaker from the audio input based on the received facial image sequence, including selecting portions of the audio input that correspond to speech movement indicated by the facial image sequence, at least one selected portion including a decipherable segment and an indecipherable segment; andprocessing the extracted speech input into a symbol sequence, the processing including:converting the decipherable segment into a first sub symbol sequence portion, the converting including: detecting separations between spoken words in the decipherable segment based on the facial image sequence, andprocessing the decipherable portion into the first sub symbol sequence portion based in part on the detected separations;processing a portion of the facial image sequence that corresponds temporally to the indecipherable segment into a second sub symbol sequence portion; andintegrating the first sub symbol sequence portion and the second sub symbol sequence portion in temporal order to form one of the corresponding symbol sequence portions. 9. The computer-implemented method of claim 8, further comprising converting the symbol sequence into text for display by the electronic device. 10. The computer-implemented method of claim 8, furthering comprising matching the symbol sequence to a command that causes the electronic device to perform a function. 11. The computer-implemented method of claim 8, wherein the receiving includes receiving the audio input via a microphone of an electronic device and receiving the facial image sequence via a camera of the electronic device. 12. The computer-implemented method of claim 8, wherein the processing includes: converting each of the selected audio input portions into a corresponding symbol sequence portion; andassembling the symbol sequence portions into a symbol sequence. 13. The computer-implemented method of claim 8, wherein the processing includes: obtaining a first symbol sequence portion and a corresponding audio transformation confidence score for an audio input portion;obtaining a second symbol sequence portion and a corresponding visual transformation confidence score for each facial image sequence that corresponds to the audio input portion;comparing the audio transformation confidence score and the visual transformation confidence score of the audio portion;selecting the first symbol sequence portion for assembly into the symbol sequence when the audio transformation confidence score is higher than the visual transformation confidence score; andselecting the second symbol sequence portion for assembly into the symbol when the visual transformation confidence score is higher than the audio transformation confidence score. 14. The computer-implemented method of claim 13, further comprising selecting the first symbol sequence portion for assembly into the symbol sequence when the audio transformation confidence score is equal to the visual transformation confidence score and the audio transformation confidence score is equal to or higher than a predetermined audio confidence score threshold. 15. The computer-implemented method of claim 13, further comprising selecting the first symbol sequence portion or the second symbol sequence portion for assembly into the symbol sequence when the audio transformation confidence score is equal to the visual transformation confidence score. 16. The computer-implemented method of claim 8, wherein the indecipherable segment is masked by ambient noise or an audio input portion in which data is missing. 17. The computer-implemented method of claim 8, further comprising determining that a segment of one of the selected portions is an indecipherable segment when a transformation confidence score of a sub symbol sequence portion obtained from the segment is below a predetermined audio confidence score threshold. 18. The computer-implemented method of claim 8, further comprising determining that the audio input is initiated when the facial image sequence indicates that a speaker begins to utter sounds. 19. The computer-implemented method of claim 8, further comprising determining that the audio input is terminated when a facial image of a speaker moves out of the view of a camera of the electronic device or when the facial image sequence indicates that the speaker has not uttered sounds for a predetermined period of time. 20. The computer-implemented method of claim 8, wherein the audio input contains one or more phonemes and the facial image sequence includes one or more visemes. 21. An article of manufacture comprising: a storage medium; andcomputer-readable programming instructions stored on the storage medium and configured to program a computing device to perform operations including: receiving an audio input and a facial image sequence for a period of time at an electronic device, wherein the audio input includes a decipherable portion and an indecipherable portion;converting the decipherable portion of the audio input into a first symbol sequence portion, the converting including: detecting separations between spoken words in the decipherable segment based on the facial image sequence, andprocessing the decipherable portion into the first symbol sequence portion based in part on the detected separations;processing a portion of the facial image sequence that corresponds temporally to the indecipherable portion of the audio input into a second symbol sequence portion; andintegrating the first symbol sequence portion and the second symbol sequence portion in temporal order to form a symbol sequence. 22. The article of claim 21, wherein the operations further include converting the symbol sequence into text for display by the electronic device. 23. The article of claim 21, wherein the operations further includes determining that a portion of the audio input is an indecipherable portion when a transformation confidence score of a symbol sequence obtained from the portion is below a predetermined audio confidence score threshold. 24. The article of claim 21, wherein the operations further include further determining that the audio input is initiated when the facial image sequences indicate that a speaker begins to utter sounds, and determining that the audio input is terminated when a facial image of a speaker moves out of the view of a camera of the electronic device or when the facial image sequence indicates that the speaker has not uttered sounds for a predetermined period of time. 25. The article of claim 21, wherein the receiving includes receiving the audio input via a microphone of an electronic device and receiving the facial image sequence via a camera of the electronic device. 26. A device comprising: a microphone to receive an audio input from an environment;a camera to receive a plurality of facial image sequences, the camera configured to automatically track a face of a speaker associated with the facial image sequences to maintain its view of the face;a processor;a memory that stores a plurality of modules that comprises: a visual interpretation module to process a portion of an facial image sequence into a symbol sequence, wherein the facial image sequence corresponds temporally to an indecipherable portion of an audio input;a speech recognition module to convert a decipherable portion of the audio input into another symbol sequence, and to integrate the symbol sequences in temporal order to form an integrated symbol sequence; anda command module to cause the device to perform a function in response at least to the symbol sequence. 27. The device of claim 26, wherein the speech recognition module is to further convert the symbol sequence into text for display on the device. 28. The device of claim 26, wherein the command module is to further cause the device to perform a function in response to the integrated symbol sequence.

이 특허에 인용된 특허 (27)

Asseily,Alexander; Einaudi,Andrew E., Acoustic vibration sensor.
상세보기
Lee, Soo Jong; Kim, Sang Hun; Lee, Young Jik; Kim, Eung Kyeu, Apparatus and method for speech segment detection and system for speech recognition.
상세보기
Rui,Yong; Chen,Yunqiang, Automatic detection and tracking of multiple individuals using multiple cues.
상세보기
Rui,Yong; Chen,Yunqiang, Automatic detection and tracking of multiple individuals using multiple cues.
상세보기
Rui,Yong; Chen,Yunqiang, Automatic detection and tracking of multiple individuals using multiple cues.
상세보기
Rui,Yong; Chen,Yunqiang, Automatic detection and tracking of multiple individuals using multiple cues.
상세보기
Rui,Yong; Chen,Yunqiang, Automatic detection and tracking of multiple individuals using multiple cues.
상세보기
Bennett, Steven M.; Anderson, Andrew V., Combining N-best lists from multiple speech recognizers.
상세보기
Fujimoto, Jun; Huang, Shengyang; Katukura, Hiroshi, Conversation control apparatus, conversation control method, and programs therefor.
상세보기
Seshadri, Nambi, Correlating video images of lip movements with audio signals to improve speech recognition.
상세보기
Burnett,Gregory C., Detecting voiced and unvoiced speech using both acoustic and nonacoustic sensors.
상세보기
Deborah W. Brown ; Randy G. Goldberg ; Richard R. Rosinski ; William R. Wetzel, Distributed recognition system having multiple prompt-specific and response-specific speech recognizers.
상세보기
Petajan Eric D. (25 Cypress St. Millburn NJ 07041), Electronic facial tracking and detection system and method and apparatus for automated speech recognition.
상세보기
Prasad K. Venkatesh (Cupertino CA) Stork David G. (Stanford CA), Facial feature extraction method and apparatus for a neural network acoustic and visual speech recognition system.
상세보기
McAllister Alex (Wheaton MD) Curry James (Herndon VA) Meador Frank (Baltimore MD), Intelligent recognition.
상세보기
Waletzky Jeremy P. (5039 Lowell St. Washington DC 20016) Wilk Peter J. (185 W. End Ave. New York NY 10023), Medical testing device and associated method.
상세보기
Basu, Sankar; de Cuetos, Philippe Christian; Maes, Stephane Herman; Neti, Chalapathy Venkata; Senior, Andrew William, Method and apparatus for audio-visual speech detection and recognition.
상세보기
Basu, Sankar; de Cuetos, Philippe Christian; Maes, Stephane Herman; Neti, Chalapathy Venkata; Senior, Andrew William, Methods and apparatus for audio-visual speech detection and recognition.
상세보기
Stork David G. (Stanford CA) Wolff Gregory J. (Mountain View CA) Levine Earl I. (Dallas TX), Neural network acoustic and visual speech recognition system.
상세보기
Stork David G. (Stanford CA) Wolff Gregory J. (Mountain View CA), Neural network acoustic and visual speech recognition system training method and apparatus.
상세보기
Bennett,Steven M.; Anderson,Andrew V., Selecting one of multiple speech recognizers in a system based on performance predections resulting from experience.
상세보기
Colmenarez,Antonio; Kellner,Andreas, Speech activity detection using acoustic and facial characteristics in an automatic speech recognition system.
상세보기
Harada Masaaki,JPX ; Takeuchi Shin,JPX ; Fukui Motofumi,JPX ; Shimizu Tadashi,JPX, Speech detection apparatus using specularly reflected light.
상세보기
Chen Chengjun Julian ; Wu Frederick Yung-Fung ; Yeh James T., Speech recognition aided by lateral profile image.
상세보기
Cheryl M. Hein ; Craig A. Lee ; Michael D. Howard ; Tamara Lacker ; Michael J. Daily, System for electronically-mediated collaboration including eye-contact collaboratory.
상세보기
Moore, Keith E., Use of mouth position and mouth movement to filter noise from speech in a hearing aid.
상세보기
Handelman, Doron, Voice activated communication system and program guide.
상세보기

이 특허를 인용한 특허 (7)

Taubman, Gabriel; Byrne, William J., Affecting the behavior of a user device based on a user's gaze.
상세보기
Lu, Taoran; Ganapathy-Kathirvelu, Hariharan; Yin, Peng; Dickins, Glenn N.; Sun, Xuejing, Handling nuisance in teleconference system.
상세보기
Feilner, Manuela; Boretzki, Michael; Krueger, Harald, Method for operating a hearing system as well as a hearing system.
상세보기
Dimitriadis, Dimitrios; Bowen, Donald J.; Gilbert, Mazin E.; Schroeter, Horst J., Sensor enhanced speech recognition.
상세보기
Dimitriadis, Dimitrios; Bowen, Donald J.; Gilbert, Mazin E.; Schroeter, Horst J., Sensor enhanced speech recognition.
상세보기
Nolte, Mark; Calder, Ryan; Loss, Benjamin, System, method, and computer program for integrating voice-to-text capability into call systems.
상세보기
Verthein, William George; Leorin, Simone, Video and audio tagging for active speaker detection.
상세보기

IPC	Description
A	생활필수품
A62	인명구조; 소방(사다리 E06C)
A62B	인명구조용의 기구, 장치 또는 방법(특히 의료용에 사용되는 밸브 A61M 39/00; 특히 물에서 쓰이는 인명구조 장치 또는 방법 B63C 9/00; 잠수장비 B63C 11/00; 특히 항공기에 쓰는 것, 예. 낙하산, 투출좌석 B64D; 특히 광산에서 쓰이는 구조장치 E21F 11/00)
A62B-1/08	.. 윈치 또는 풀리에 제동기구가 있는 것

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표IPC 관리번호, 국가코드, 자료구분, 상태, 출원번호, 출원일자, 공개번호, 공개일자, 공고번호, 공고일자, 등록번호, 등록일자, 발명명칭(한글), 발명명칭(영문), 출원인(한글), 출원인(영문), 출원인코드, 대표출원인, 출원인국적, 출원인주소, 발명자, 발명자E, 발명자코드, 발명자주소, 발명자 우편번호, 발명자국적, 대표IPC, IPC코드, 요약, 미국특허분류, 대리인주소, 대리인코드, 대리인(한글), 대리인(영문), 국제공개일자, 국제공개번호, 국제출원일자, 국제출원번호, 우선권, 우선권주장일, 우선권국가, 우선권출원번호, 원출원일자, 원출원번호, 지정국, Citing Patents, Cited Patents
저장형식	Text(ASCII format) Excel format PIAS분석(.xls)
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증

Camera-assisted noise cancellation and speech recognition 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

이 특허에 인용된 특허 (27)

이 특허를 인용한 특허 (7)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트

연합인증

Camera-assisted noise cancellation and speech recognition 원문보기

초록 ▼

대표청구항 ▼

연구과제 타임라인

전체(0) 논문(0) 특허(0) 보고서(0)

전체(0) 논문(0) 특허(0) 보고서(0)

이 특허에 인용된 특허 (27)

이 특허를 인용한 특허 (7)

관련 콘텐츠

특허 원문 보기

IPC 상위 출원인

AI-Helper ※ AI-Helper는 오픈소스 모델을 사용합니다.

선택된 텍스트