Presented herein are systems and methods for processing sound signals for use with electronic speech systems. Sound signals are temporally parsed into frames, and the speech system includes a speech codebook having entries corresponding to frame sequences. The system identifies speech sounds in an a
Presented herein are systems and methods for processing sound signals for use with electronic speech systems. Sound signals are temporally parsed into frames, and the speech system includes a speech codebook having entries corresponding to frame sequences. The system identifies speech sounds in an audio signal using the speech codebook.
대표청구항▼
1. A method for processing a signal, comprising the steps of: receiving an input sound signal including speech and environmental noise;temporally parsing the input sound signal into input frame sequences of at least three input frames, wherein an input frame represents a segment of a waveform of the
1. A method for processing a signal, comprising the steps of: receiving an input sound signal including speech and environmental noise;temporally parsing the input sound signal into input frame sequences of at least three input frames, wherein an input frame represents a segment of a waveform of the input sound signal;providing a speech codebook including a plurality of entries corresponding to speech spectral trajectories of reference frame sequences that include at least three reference frames,wherein a reference frame represents a segment of a waveform of a reference sound signal,wherein the reference frame sequence corresponding to the entries are derived from allowable sequences of at least three reference frames, andwherein the speech codebook substantially lacks entries corresponding to (1) reference frame sequences that include a single unvoiced frame between a pair of voiced frames, and (2) reference frame sequences that include a single voiced frame between a pair of unvoiced frames;identifying phones within the speech based on a comparison of an input frame sequence with a plurality of the speech spectral trajectories of reference frame sequences; andencoding the phones. 2. The method of claim 1, wherein the segment of the waveform represented by an input frame is represented by a spectrum. 3. The method of claim 1, wherein the segment of the waveform represented by a reference frame is represented by a spectrum. 4. The method of claim 1, wherein an input frame includes the segment of the waveform of the input sound signal it represents. 5. The method of claim 1, wherein a reference frame includes the segment of the waveform of the reference sound signal that it represents. 6. The method of claim 1, comprising identifying pitch values of the at least two input frames. 7. The method of claim 6, comprising encoding the identified pitch values. 8. The method of claim 1, comprising providing a noise codebook including a plurality of noise codebook entries corresponding to frames of environmental noise;selecting at least one noise sequence of noise codebook entries; andidentifying phones based on a comparison of at least one of the input frame sequences with the at least one noise sequence. 9. The method of claim 8, wherein the at least one noise sequence comprises a first noise codebook entry and a second noise codebook entry. 10. The method of claim 9, wherein the first noise codebook entry and the second noise codebook entry are the same noise codebook entry. 11. The method of claim 8, wherein selecting comprises: calculating frame-level discriminant values for the noise code book entries;creating a matrix having a plurality of matrix entries including the frame-level discriminant values; andidentifying, in respective columns of the matrix, a matrix entry having the largest frame-level discriminant value. 12. The method of claim 1, wherein the at least two input frames are temporally adjacent portions of the input sound signal. 13. The method of claim 1, comprising determining the set of allowable sequences based on sequences of phones that are formable by the average human vocal tract. 14. The method of claim 1, comprising determining the set of allowable sequences based on sequences of phones that are permissible in a selected language. 15. The method of claim 14, wherein the selected language is English. 16. The method of claim 1, comprising creating the at least two input frames from temporally overlapping portions of the input sound signal. 17. The method of claim 1, comprising creating the reference spectral sequences from frames derived from overlapping portions of a speech signal. 18. The method of claim 1, wherein the parsing comprises parsing the input sound signal into variable length frames. 19. The method of claim 18, wherein at least one of the variable length frames corresponds to a phone. 20. The method of claim 18, wherein at least one of the variable length frames corresponds to at least one of a phone and a transition between phones. 21. The method of claim 1, wherein the input sound signal is temporally parsed into frame sequences of one of at least 3 frames, at least 5 frames, at least 7 frames, at least 9 frames, and at least 12 frames. 22. The method of claim 1, wherein encoding the phones comprises encoding the identified phones as a digital signal having a bit rate of less than 2500 bits per second. 23. A device comprising: a receiver for receiving an input sound signal including speech and environmental noise;a first processor for temporally parsing the input sound signal into input frame sequences of at least three input frames, wherein an input frame represents a segment of a waveform of the input sound signal;a first memory for storing a plurality of speech codebook entries corresponding to speech spectral trajectories of reference frame sequences that include at least three reference frames,wherein a reference frame represents a segment of a waveform of a reference sound signal,wherein the reference frame sequence corresponding to the entries are derived from allowable sequences of at least three reference frames, andwherein the speech codebook substantially lacks entries corresponding to (1) reference frame sequences that include a single unvoiced frame between a pair of voiced frames, and (2) reference frame sequences that include a single voiced frame between a pair of unvoiced frames;a second processor for identifying phones within the speech based on a comparison of an input frame sequence with a plurality of the speech spectral trajectories of reference frame sequences; anda third processor for encoding the phones. 24. The device of claim 23, wherein at least two of the first processor, the second processor, and the third processor are the same processor. 25. The device of claim 23, wherein the segment of the waveform represented by an input frame is represented by a spectrum. 26. The device of claim 23, wherein a the segment of the waveform represented by a reference frame is represented by a spectrum. 27. The device of claim 23, wherein an input frame includes the segment of the waveform of the input sound signal it represents. 28. The device of claim 23, wherein a reference frame includes the segment of the waveform of the reference sound signal that it represents. 29. The device of claim 23, comprising a second memory for storing a plurality of noise codebook entries corresponding to spectra of environmental noise;a fourth processor for selecting at least one noise sequence of noise codebook entries; andwherein the second processor identifies phones within the speech based on a comparison of the spectra corresponding to a frame sequence with the at least one noise sequence. 30. The device of claim 23, comprising a fourth processor for identifying pitch values of the at least two input frames. 31. The device of claim 23, wherein the allowable sequences are based on sequences of phones predetermined to be formable by the average human vocal tract. 32. The device of claim 23, wherein allowable sequences are based on sequences of phones predetermined to be permissible in a selected language. 33. The device of claim 32, wherein the selected language is English. 34. The device of claim 23, wherein the first processor creates the at least two input frames from temporally adjacent portions of the input sound signal. 35. The device of claim 23, wherein the first processor creates the at least two input frames from temporally overlapping portions of the input sound signal. 36. The device of claim 23, wherein the reference frame sequences are from reference frames created from overlapping portions of a speech signal. 37. The device of claim 23, wherein the first processor parses the input sound signal into variable length input frames. 38. The device of claim 37, wherein at least one of the variable length input frames corresponds to a phone. 39. The device of claim 37, wherein at least one of the variable length input frames corresponds to at least one of a phone and a transition between phones. 40. The device of claim 23, wherein the first processor temporally parses the input sound signal into input frame sequences of one of at least 3 frames, at least 5 frames, at least 7 frames, at least 9 frames, and at least 12 frames. 41. The device of claim 23, wherein the third processor encodes phones as a digital signal having a bit rate of less than 2500 bits per second. 42. The method of claim 1, wherein non-allowable sequences are reference frame sequences that represent a waveform which is not typical of a speech signal. 43. The method of claim 1, wherein the comparison comprises determining a likelihood that the input frame sequence corresponds to one of the plurality of speech spectral trajectories of reference frame sequences. 44. The method of claim 1, further comprising generating a plurality of noise-corrupted versions of the plurality of the speech spectral trajectories of reference frame sequences using noise entries from a noise codebook, and wherein the comparison comprises comparing the input frame sequence with the noise-corrupted versions of the plurality of the speech spectral trajectories of reference frame sequences. 45. The device of claim 23, wherein the comparison comprises determining a likelihood that the input frame sequence corresponds to one of the plurality of speech spectral trajectories of reference frame sequences. 46. The device of claim 23, further comprising a fourth processor for generating a plurality of noise-corrupted versions of the plurality of the speech spectral trajectories of reference frame sequences using noise entries from a noise codebook, and wherein the comparison comprises comparing the input frame sequence with the noise-corrupted versions of the plurality of the speech spectral trajectories of reference frame sequences.
연구과제 타임라인
LOADING...
LOADING...
LOADING...
LOADING...
LOADING...
이 특허에 인용된 특허 (54)
Porter Jack E. (San Diego CA), Apparatus and methods for the selective addition of noise to templates employed in automatic speech recognition systems.
Safdar M. Asghar ; Lin Cong, Matrix quantization with vector quantization error compensation and neural network postprocessing for robust speech recognition.
Furui,Sadaoki; Zhang,Zhipeng; Horikoshi,Tsutomu; Sugimura,Toshiaki, Noise adaptation system of speech model, noise adaptation method, and noise adaptation program for speech recognition.
Goldenthal William D. (Cambridge MA) Glass James R. (Arlington MA), Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both tem.
Rotola-Pukkila, Jani; Mikkola, Hannu; Vainio, Janne, Wideband speech codec using a higher sampling rate in analysis and synthesis filtering than in excitation searching.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.