Voice activity decision base on zero crossing rate and spectral sub-band energy
원문보기
IPC분류정보
국가/구분
United States(US) Patent
등록
국제특허분류(IPC7판)
G10L-021/02
G10L-015/20
G10L-017/00
출원번호
US-0546572
(2012-07-11)
등록번호
US-8554547
(2013-10-08)
우선권정보
CN-2009 1 0206840 (2009-10-15)
발명자
/ 주소
Wang, Zhe
출원인 / 주소
Huawei Technologies Co., Ltd.
대리인 / 주소
Brinks Hofer Gilson & Lione
인용정보
피인용 횟수 :
1인용 특허 :
8
초록▼
A voice activity detection method and apparatus, and an electronic device are provided. The method includes: obtaining a time domain parameter and a frequency domain parameter from an audio frame; obtaining a first distance between the time domain parameter and a long-term-sliding mean of the time d
A voice activity detection method and apparatus, and an electronic device are provided. The method includes: obtaining a time domain parameter and a frequency domain parameter from an audio frame; obtaining a first distance between the time domain parameter and a long-term-sliding mean of the time domain parameter in a history background noise frame, and obtaining a second distance between the frequency domain parameter and a long-term-sliding mean of the frequency domain parameter in the history background noise frame; and judging whether the audio frame is a foreground voice frame or a background noise frame according to the first distance, the second distance and a set of decision inequalities based on the first distance and the second distance. The above technical solutions enable the judgment criterion to have an adaptive adjustment capability, thus improving the performance of the voice activity detection.
대표청구항▼
1. A voice activity detection method, comprising: obtaining a time domain parameter and a frequency domain parameter from a current audio frame to be detected;obtaining a first distance between the time domain parameter and a long-term sliding mean of the time domain parameter in a history backgroun
1. A voice activity detection method, comprising: obtaining a time domain parameter and a frequency domain parameter from a current audio frame to be detected;obtaining a first distance between the time domain parameter and a long-term sliding mean of the time domain parameter in a history background noise frame;obtaining a second distance between the frequency domain parameter and a long-term sliding mean of the frequency domain parameter in the history background noise frame; andjudging whether the current audio frame is a foreground voice frame or a background noise frame according to the first distance, the second distance, and a set of decision inequalities based on the first distance and the second distance,wherein at least one coefficient in the set of decision inequalities is a variable determined in response to features of an input signal. 2. The method according to claim 1, wherein if the audio frame is judged to be the background noise frame, then the long-term sliding mean of the time domain parameter in the history background noise frame is updated according to the time domain parameter of the audio frame and the long-term sliding mean of the frequency domain parameter in the history background noise frame is updated according to the frequency domain parameter of the audio frame. 3. The method according to claim 1, wherein the time domain parameter is a zero-crossing rate, and wherein the first distance between the time domain parameter and the long-term sliding mean of the time domain parameter in the history background noise frame is a Differential Zero-Crossing rate (DZC). 4. The method according to claim 3, wherein if the audio frame is judged to be the background noise frame, then the long-term sliding mean of the zero-crossing rate in the history background noise frame is updated to a α· ZCR+(1−α)·ZCR, and wherein α is an update speed control parameter, ZCR is a current value of the long-term sliding mean of the zero-crossing rate in the history background noise frame, and ZCR is a zero-crossing rate of the audio frame. 5. The method according to claim 1, wherein the frequency domain parameter indicates spectral sub-band energy, and wherein the second distance between the frequency domain parameter and the long-term sliding mean of the frequency domain parameter in the history background noise frame is a signal-to-noise ratio of the audio frame. 6. The method according to claim 5, wherein if the audio frame is judged to be the background noise frame, then the long-term sliding mean of the spectral sub-band energy in the history background noise frame is updated to β· Ei+(1−β)·Ei,and wherein i =0, . . . N, N is the number of sub-bands minus one, β is an update speed control parameter, Ēi is a current value of the long-term sliding mean of the spectral sub-band energy in the history background noise frame, and Ei is spectral sub-band energy of the audio frame. 7. The method according to claim 5, wherein obtaining the signal-to-noise ratio of the audio frame comprises: obtaining a signal-to-noise ratio of each sub-band according to a ratio of the spectral sub-band energy to the long-term sliding mean of the spectral sub-band energy in the history background noise frame;performing linear processing or nonlinear processing on the signal-to-noise ratio of each sub-band; andsumming the signal-to-noise ratio of each sub-band after the processing to obtain the signal-to-noise ratio of the audio frame. 8. The method according to claim 7, wherein performing the linear processing on the signal-to-noise ratio of each sub-band comprises performing linear processing on the signal-to-noise ratio of each sub-band, and wherein performing the nonlinear processing on the signal-to-noise ratio of each sub-band comprises performing either the same nonlinear processing or different nonlinear processing on the signal-to-noise ratio of each sub-band. 9. The method according to claim 1, wherein judging whether the current audio frame is the foreground voice frame or the background noise frame according to the first distance, the second distance, and the set of decision inequalities based on the first distance and the second distance comprises: judging that the current audio frame is the foreground voice frame if the first distance and the second distance satisfy any one decision inequality in the set of decision inequalities; andjudging that the audio frame is the background noise frame if the first distance and the second distance satisfy none of decision inequality in the set of decision inequalities. 10. The method according to claim 1, wherein determining the variable according to the voice activity detection operation mode or the features of the input signal comprises determining the variable according to one or more of: the voice activity detection operation point, the signal long-term signal-to-noise ratio, the background noise fluctuation degree, and the background noise level, and wherein the voice activity detection operation mode comprises a voice activity detection operation point, and the features of the input signal comprise one or more of: a signal long-term signal-to-noise ratio, a background noise fluctuation degree, and a background noise level. 11. A voice activity detection apparatus, comprising: a first obtaining module, configured to obtain a time domain parameter and a frequency domain parameter from a current audio frame to be detected;a second obtaining module, configured to obtain a first distance between the time domain parameter and a long-term sliding mean of the time domain parameter in a history background noise frame, and obtain a second distance between the frequency domain parameter and a long-term sliding mean of the frequency domain parameter in the history background noise frame; anda judging module, configured to judge whether the current audio frame to be detected is a foreground voice frame or a background noise frame according to the first distance, the second distance, and a set of decision inequalities based on the first distance and the second distance,wherein at least one coefficient in the set of decision inequalities is a variable determined in response to features of an input signal. 12. The apparatus according to claim 11, wherein the judging module comprises: a decision inequality sub-module, configured to store the set of decision inequalities, and adjust the variable coefficient in the set of decision inequalities according to at least one of: a voice activity detection operation point, a signal long-term signal-to-noise ratio, a background noise fluctuation degree, and a background noise level; anda judging sub-module, configured to judge whether the audio frame is the foreground voice frame or the background noise frame according to the set of decision inequalities stored in the decision inequality sub-module. 13. The apparatus according to claim 11, wherein the second obtaining module comprises: an updating sub-module, configured to store the long-term sliding mean of the time domain parameter in the history background noise frame and the long-term sliding mean of the frequency domain parameter in the history background noise frame, and if the audio frame is judged as the background noise frame by the judging module, update the stored long-term sliding mean of the time domain parameter in the history background noise frame according to the time domain parameter of the audio frame, and update the stored long-term sliding mean of the frequency domain parameter in the history background noise frame according to the frequency domain parameter of the audio frame; andan obtaining sub-module, configured to obtain the first distance and the second distance according to the long-term sliding mean of the time domain parameter in the history background noise frame,wherein the long-term sliding mean of the frequency domain parameter in the history background noise frame stored in the updating sub-module, andwherein the time domain parameter and the frequency domain parameter are 13obtained by the first obtaining module. 14. The apparatus according to claim 11, wherein the first obtaining module comprises. a zero-crossing rate obtaining sub-module, configured to obtain a zero-crossing rate from the audio frame; anda spectral sub-band energy obtaining sub-module, configured to obtain spectral sub-hand energy from the audio frame, wherein the second obtaining module obtains a signal-to-noise ratio of the audio frame, andwherein the signal-to-noise ratio of the audio frame is the distance between the frequency domain parameter and the long-term sliding mean of the frequency domain parameter in the history background noise frame. 15. The apparatus according to claim 14, wherein the second obtaining module or the obtaining sub-module is configured to obtain a signal-to-noise ratio of each sub-band according to a ratio of the spectral sub-band energy to a long-term sliding mean of the spectral sub-band energy in the history background noise frame, performs linear processing or nonlinear processing on the signal-to-noise ratio of each sub-band, and sums the signal-to-noise ratio of each sub-band after the processing to obtain the signal-to-noise ratio of the audio frame.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (8)
Mozer, Forrest S.; Savoie, Robert E.; Teasley, William T., Audio recognition peripheral system.
Bou Ghazale,Sahar E.; Asadi,Ayman O.; Assaleh,Khaled, System and method for a endpoint detection of speech for improved speech recognition in noisy environments.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.