Techniques for enhancing an acoustic echo canceller based on visual cues are described herein. The techniques include changing adaptation of a filter of the acoustic echo canceller, calibrating the filter, or reducing background noise from an audio signal processed by the acoustic echo canceller. Th
Techniques for enhancing an acoustic echo canceller based on visual cues are described herein. The techniques include changing adaptation of a filter of the acoustic echo canceller, calibrating the filter, or reducing background noise from an audio signal processed by the acoustic echo canceller. The changing, calibrating, and reducing are responsive to visual cues that describe acoustic characteristics of a location of a device that includes the acoustic echo canceller. Such visual cues may indicate that no human being is present at the location, that some subject(s) are engaged in speaking or sound generating activities, or that motion associated with an echo path change has occurred at the location.
대표청구항▼
1. A computer-implemented method comprising: ascertaining, from one or more images of a location, that a person at the location is speaking;detecting, by at least one of a double-talk detector of an acoustic echo processor or by a voice activity detector of the acoustic echo processor, that an audio
1. A computer-implemented method comprising: ascertaining, from one or more images of a location, that a person at the location is speaking;detecting, by at least one of a double-talk detector of an acoustic echo processor or by a voice activity detector of the acoustic echo processor, that an audio signal associated with a voice is generated by a microphone at the location;determining, by the acoustic echo processor, a confidence score indicating a likelihood that the audio signal is associated with the person at the location;adjusting, by the acoustic echo processor, the confidence score based at least in part on the one or more images depicting the person at the location engaged in speaking;determining that the confidence score exceeds a threshold;changing, by the acoustic echo processor, adaptation of a filter of the acoustic echo processor based at least in part on the confidence score exceeding the threshold and the detecting the audio signal associated with the voice; andremoving, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location. 2. The method of claim 1, wherein the detecting that the audio signal associated with the voice into the microphone is performed by the double-talk detector and the method further comprises: receiving, by the double-talk detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication. 3. The method of claim 1, wherein the detecting that the audio signal associated with the voice into the microphone is performed by the voice activity detector and the method further comprises: receiving, by the voice activity detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication. 4. The method of claim 1, wherein the method further comprises: determining, based at least in part on the one or more images, that an item at the location has changed position;determining that the change in position of the item is associated with a corresponding change in the known echo path; andaccelerating the adaptation of the filter based at least in part on determining that the change in position of the item is associated with the corresponding change in the known echo path. 5. The method of claim 1, wherein the ascertaining comprises determining if the one or more images of the location show movement of lips of the person in a specified time period. 6. The method of claim 1, wherein the method further comprises: determining that the location does not include people;capturing audio at the location with the microphone based at least in part on determining that the location does not include people; anddetermining the known echo path based at least in part on the audio captured at the location. 7. The method of claim 1, wherein the changing comprises halting or slowing the adaptation of the filter based at least in part on a determination that the confidence score exceeds the threshold. 8. The method of claim 1, wherein the method further comprises: determining that the audio signal associated with the voice is no longer detected by the microphone; andresuming the adaptation of the filter. 9. The method of claim 1, wherein the method further comprises removing, at least in part, an acoustic echo from the audio signal, wherein an amount of the acoustic echo removed from the audio signal is based at least in part on the known echo path. 10. One or more non-transitory computer-readable media having computer-executable instructions stored thereon and configured to program a computing device to perform operations comprising: ascertaining, from one or more images of a location, that a person at the location is speaking;capturing an audio signal by a microphone at the location;detecting that the audio signal is associated with a human voice;determining a first confidence score based at least in part on a first indication that the human voice is associated with the person at the location;determining a second confidence score based at least in part on a second indication that the one or more images depict the person at the location engaged in speaking, the second confidence score being greater than the first confidence score;changing adaptation of a filter of an acoustic echo processor based at least in part on at least one of the first confidence score or the second confidence score and the audio signal; andremoving, at least in part, background noise from the audio signal based at least in part on the at least one of the first confidence score or the second confidence score, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location. 11. The non-transitory computer-readable media of claim 10, wherein the detecting that the audio signal is the human voice is performed by at least one of a double-talk detector of the acoustic echo processor or by a voice activity detector of the acoustic echo processor. 12. The non-transitory computer-readable media of claim 11, wherein the detecting that the audio signal is the human voice is performed by the double-talk detector and the operations further comprise: receiving, by the double-talk detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, anddetermining the second confidence score based at least in part on the indication. 13. The non-transitory computer-readable media of claim 11, wherein the detecting that the audio signal is the human voice is performed by the voice activity detector and the operations further comprise: receiving, by the voice activity detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting based on the indication. 14. The non-transitory computer-readable media of claim 10, wherein the operations further comprise receiving an indication that the person at the location is engaged in speaking, the receiving performed by the acoustic echo processor. 15. The non-transitory computer-readable media of claim 10, wherein the ascertaining comprises determining whether the one or more images of the location show movement of lips of the person in a specified time period. 16. The non-transitory computer-readable media of claim 10, wherein the changing comprises halting or slowing the adaptation of the filter based at least in part on a determination that the first confidence score or the second confidence score exceeds a threshold. 17. The non-transitory computer-readable media of claim 10, the operations further comprising determining that a subsequent audio signal is not associated with a human voice and resuming the adaptation of the filter. 18. The non-transitory computer-readable media of claim 10, the operations further comprising performing acoustic echo processing on the audio signal to remove, at least in part, an acoustic echo. 19. A system comprising: one or more processors;a camera to capture one or more images of a location;a speaker to output audio in the location;a microphone to capture audio in the location; andone or more non-transitory computer-readable media having computer-executable instructions stored thereon and configured to program the one or more processors to perform operations comprising: ascertaining, from the one or more images of the location captured by the camera, that a person at the location is speaking;capturing an audio signal by the microphone in the location;detecting that the audio signal is associated with a voice;determining a confidence score that the audio signal represents the voice that is associated with the person at the location;adjusting the confidence score based at least in part on the one or more images captured by the camera depicting the person at the location engaged in speaking;determining that the confidence score exceeds a threshold;changing adaptation of a filter of an acoustic echo processor based at least in part on the confidence score exceeding the threshold and the audio signal; andremoving, at least in part, background noise from the audio signal based at least in part on the confidence score exceeding the threshold, wherein an amount of the background noise removed from the audio signal is based at least in part on a known echo path of the location. 20. The system of claim 19, wherein the detecting that the audio signal is associated with the voice is performed by a double-talk detector of the acoustic echo processor or by a voice activity detector of the acoustic echo processor. 21. The system of claim 20, wherein the detecting that the audio signal is associated with the voice is performed by the double-talk detector and the operations further comprise: receiving, by the double-talk detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication. 22. The system of claim 20, wherein the detecting that the audio signal is associated with the voice is performed by the voice activity detector and the operations further comprise: receiving, by the voice activity detector, an indication that the one or more images of the location depict the person at the location engaged in speaking, andadjusting the confidence score based on the indication. 23. The system of claim 19, wherein the operations further comprise: determining, from one or more subsequent images of the location captured by the camera, that the location does not include people;playing a calibration sound by the speaker;determining one or more echo paths of the location based on audio captured by the microphone at the location while the calibration sound is playing;calibrating the acoustic echo canceller filter based at least in part on the one or more echo paths; andlearning background noise characteristics in the audio captured by the microphone at the location based at least in part on no voice activity being detected. 24. The method of claim 1, wherein determining the confidence score indicating the likelihood that the audio signal is associated with the person at the location further comprises: accessing at least one stored speech characteristic associated with a voice profile corresponding to the person;determining a comparison between the at least one stored speech characteristic and a characteristic of the audio signal; anddetermining the confidence score based at least in part on the comparison.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (7)
Chujo Kaoru,JPX ; Fujino Naoji,JPX, Echo canceller and method of controlling the same.
Stork David G. (Stanford CA) Wolff Gregory J. (Mountain View CA), Neural network acoustic and visual speech recognition system training method and apparatus.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.