IPC분류정보
국가/구분 |
United States(US) Patent
등록
|
국제특허분류(IPC7판) |
|
출원번호 |
US-0879981
(2010-09-10)
|
등록번호 |
US-8700392
(2014-04-15)
|
발명자
/ 주소 |
- Hart, Gregory M.
- Freed, Ian W.
- Zehr, Gregg Elliott
- Bezos, Jeffrey P.
|
출원인 / 주소 |
- Amazon Technologies, Inc.
|
대리인 / 주소 |
Novak Druce Connolly Bove + Quigg LLP
|
인용정보 |
피인용 횟수 :
39 인용 특허 :
15 |
초록
▼
A user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or v
A user can provide input to a computing device through various combinations of speech, movement, and/or gestures. A computing device can analyze captured audio data and analyze that data to determine any speech information in the audio data. The computing device can simultaneously capture image or video information which can be used to assist in analyzing the audio information. For example, image information is utilized by the device to determine when someone is speaking, and the movement of the person's lips can be analyzed to assist in determining the words that were spoken. Any gestures or other motions can assist in the determination as well. By combining various types of data to determine user input, the accuracy of a process such as speech recognition can be improved, and the need for lengthy application training processes can be avoided.
대표청구항
▼
1. A method of determining user input to a computing device, comprising: capturing audio data using at least one audio capture element of the computing device;concurrent with capturing audio data, capturing image data of the user using at least one image capture element of the computing device; andu
1. A method of determining user input to a computing device, comprising: capturing audio data using at least one audio capture element of the computing device;concurrent with capturing audio data, capturing image data of the user using at least one image capture element of the computing device; andusing at least one algorithm executing on a processor of the computing device: detecting a presence of speech information contained in the captured audio data;in the captured image data, detecting mouth movement of the user of the computing device at a time during which the speech information was detected;in response to detecting the presence of speech information and detecting mouth movement, identifying at least one user input based at least in part on a combination of the speech information and the mouth movement for a defined period of time, wherein the identifying of the at least one user input includes comparing the mouth movement in the captured image data to one or more word formation models, the one or more word formation models capable of being personalized for the user over a duration of time based at least in part on the speech information and the mouth movement of the user; andproviding the user input for processing if a confidence level of the identified user input exceeds a minimum threshold, wherein the confidence level of the identified user input is relative to the combination of the speech information and the mouth movement, and wherein the confidence level of the identified user input is based at least in part on a metric indicating a level of matching between the identified user input and an input term. 2. The method of claim 1, wherein identifying the at least one user input utilizes at least one speech algorithm and at least one image analysis algorithm. 3. The method of claim 1, wherein the defined period of time utilizes audio data and image data captured for a substantially same period of time. 4. A method of determining input to a computing device, comprising: capturing audio data using at least one audio capture element of the computing device;capturing image data using at least one image capture element of the computing device;analyzing at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; andif the speech is perceptible by the computing device, analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzed combination of audio data and image data are for a substantially same period of time, and wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; andif at least a portion of the content corresponds to an input to the computing device, processing the input, wherein the at least the portion corresponds to the input when the at least the portion matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data. 5. The method of claim 4, wherein analyzing the image data comprises executing at least one algorithm for determining speech formation or gesturing of the person, the speech formation or gesturing used to determine input provided by the person. 6. The method of claim 4, further comprising: determining a state of the computing device; anddetermining a sub-dictionary to be used in identifying content of the speech, the sub-dictionary comprising terms at least partially relevant to the determined state and including fewer terms than a dictionary for general speech determination. 7. The method of claim 6, wherein the state of the device depends at least in part on an interface displayed on the device. 8. The method of claim 6, further comprising: determining an approximate location on the computing device at which the person is looking, the approximate location capable of being used to modify a currently selected sub-dictionary based at least in part upon one or more elements at that approximate location. 9. The method of claim 4, wherein a general speech model is able to be used to determine content of the speech data without the person first undertaking an initial training process. 10. The method of claim 4, wherein capturing image data only occurs when the captured audio data meets at least one triggering criterion. 11. The method of claim 10, wherein the at least one triggering criterion includes at least one of a minimum volume threshold or a frequency pattern matching human speech. 12. The method of claim 4, further comprising: irradiating the person with infrared (IR) radiation; anddetecting IR reflected from at least one feature of the person in order to determine at least one aspect of the person with respect to the computing device. 13. The method of claim 12, wherein the reflected IR is analyzed to determine at least one of a gaze direction of the person and whether the user is forming speech. 14. The method of claim 4, wherein capturing image data comprises capturing image data corresponding to the person's mouth, and further comprising: monitoring movement of the person's mouth in order to determine speech being formed by the person's mouth. 15. The method of claim 4, wherein the person is a primary user of the computing device or another person within proximity of the computing device. 16. The method of claim 15, further comprising: determining the person speaking out of a plurality of people within a proximity of the computing device. 17. The method of claim 16, further comprising: accepting input only from an identified person authorized to provide input. 18. The method of claim 16, further comprising: determining a context for the speech based at least in part upon an identity of the person providing the speech. 19. The method of claim 15, wherein the computing device includes at least two image capture elements positioned with respect to the computing device so as to be able to capture image data for a primary user of the device or other person within a proximity of the device. 20. The method of claim 15, wherein the audio capture element includes a plurality of microphones so as to be able to determine a location of the primary user, further comprising capturing audio data from a primary user from the determined location and rejecting noise sources from positions other than the location of the primary user. 21. The method of claim 4, further comprising: associating a gesture made by the person concurrent with the determined speech generated, wherein after an initial period of association, the person is able to provide the input based only on the gesture and without the speech. 22. The method of claim 4, further comprising: capturing an image of an object; anddetermining the input for the speech based in part upon an identification of the object. 23. The method of claim 4, further comprising: determining a location of the computing device or the person; anddetermining the input for the speech based in part upon the determined location. 24. The method of claim 4, wherein the image capture element includes at least one of a digital camera element and an infrared (IR) radiation detector. 25. A computing device, comprising: a processor;a memory device including instructions operable to be executed by the processor to perform a set of actions, enabling the processor to:capture audio data using at least one audio capture element in communication with the computing device;capture image data using at least one image capture element in communication with the computing device;analyze at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; andif the person is generating speech, analyze a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; andif the content corresponds to input to the computing device, process the input on the computing device, wherein the content corresponds to the input when the content matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data. 26. The computing device of claim 25, wherein analyzing the image data comprises executing at least one algorithm for determining speech formation or gesturing of the user, the speech formation or gesturing capable of being used to determine input provided by the person. 27. The computing device of claim 25, wherein the instructions when executed further enable the computing device to: determine a state of the computing device; anddetermine a sub-dictionary to be used in identifying content of the speech, the sub-dictionary comprising terms at least partially relevant to the determined state and including fewer terms that a dictionary for general speech determination. 28. The computing device of claim 25, further comprising: an infrared (IR) emitter for irradiating the person with IR radiation; andan IR detector for detecting IR reflected from at least one feature of the person in order to determine at least one aspect of the person with respect to the computing device. 29. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising: program code for accessing audio data;program code for accessing image data;program code for analyzing at least one of the audio data and the image data to determine whether a person interacting with the computing device is generating speech that is perceptible by the computing device; andprogram code for, if the person is generating speech, analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech, wherein the analyzing includes, in part, comparing at least a portion of the image data to one or more word formation models that are capable of being personalized for the person over a duration of time based at least in part on the speech generated by the person; andif the content corresponds to an input to the computing device, providing the input for processing, wherein the content corresponds to the input when the content matches the input at least at a defined confidence level, the defined confidence level being relative to the audio data and the image data. 30. The non-transitory computer-readable storage medium of claim 29, wherein analyzing the image data comprises executing at least one algorithm for determining speech formation or gesturing of the user, the speech formation or gesturing capable of being used to determine input provided by the person. 31. The non-transitory computer-readable storage medium of claim 29, further comprising: program code for determining a state of the computing device; andprogram code for determining a sub-dictionary to be used in identifying content of the speech, the sub-dictionary comprising terms at least partially relevant to the determined state and including fewer terms that a dictionary for general speech determination. 32. The non-transitory computer-readable storage medium of claim 31, further comprising: program code for determining a region of interest on the computing device at which the person is looking; andprogram code for modifying a currently selected sub-dictionary based in part upon one or more elements of the determined region of interest. 33. The non-transitory computer-readable storage medium of claim 29, wherein the computing devices includes at least two image capture elements, wherein analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech further comprising: program code for identifying at least one image capture element generating an output that is greater than or equal to a threshold level; andprogram code for using the output from each identified image capture element to determine at least a portion of the content of the speech. 34. The non-transitory computer-readable storage medium of claim 29, wherein the computing devices each include at least two audio capture elements, wherein analyzing a combination of the audio data and the image data to determine at least a portion of the content of the speech further comprising: program code for identifying at least one audio capture element generating an output that is greater than or equal to a threshold level; andprogram code for using the output from each identified audio capture element to determine at least a portion of the content of the speech.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.