[논문]Multi-band Approach to Deep Learning-Based Artificial Stereo Extension

Jeon, Kwang Myung; Park, Su Yeon; Chun, Chan Jun; Park, Nam In; Kim, Hong Kook

doi:10.4218/etrij.17.0116.0773

Multi-band Approach to Deep Learning-Based Artificial Stereo Extension 원문보기

ETRI journal, v.39 no.3, 2017년, pp.398 - 405

Jeon, Kwang Myung (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology) , Park, Su Yeon (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology) , Chun, Chan Jun (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology) , Park, Nam In (Digital Technology and Biometry Division, National Forensic Service) , Kim, Hong Kook (School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology)

Abstract ▼ AI-Helper

In this paper, an artificial stereo extension method that creates stereophonic sound from a mono sound source is proposed. The proposed method first trains deep neural networks (DNNs) that model the nonlinear relationship between the dominant and residual signals of the stereo channel. In the training stage, the band-wise log spectral magnitude and unwrapped phase of both the dominant and residual signals are utilized to model the nonlinearities of each sub-band through deep architecture. From that point, stereo extension is conducted by estimating the residual signal that corresponds to the input mono channel signal with the trained DNN model in a sub-band domain. The performance of the proposed method was evaluated using a log spectral distortion (LSD) measure and multiple stimuli with a hidden reference and anchor (MUSHRA) test. The results showed that the proposed method provided a lower LSD and higher MUSHRA score than conventional methods that use hidden Markov models and DNN with full-band processing.

주제어

AI 본문요약
AI-Helper

* AI 자동 식별 결과로 적합하지 않은 문장이 있을 수 있으니, 이용에 유의하시기 바랍니다.

문제 정의

IID and IPD relate to sound localization factors, such as the relative position, while ICC characterizes the wideness of the auditory image [3]. The aim of this study was to regenerate stereophonic effects for a given monaural sound, as shown in Fig. 1. Assuming that a sound source moves around a dotted circle, as indicated in Fig.
1, the sound localization parameters, such as the IID and IPD, are unobtainable with a single-channel microphone [4]. Therefore, this study focused on reproducing the wideness of the stereophonic effect.

제안 방법

After the DNN has been trained, the LSF coefficients of the side signals are estimated from the DNN and are converted into LP coefficients. Next, the estimated side signal # is reconstructed using the residual signals for the mid signals and the estimated LP coefficients as
In this paper, a multi-band DNN approach is proposed for extending a mono audio signal into a stereo one. As previously mentioned, the proposed method is intended to model a DNN for each sub-band for stereo extension, while only one DNN was used together over all the sub-bands in [13]. To this end, the proposed method represents the stereo channel as a set of the band-wise log-spectral magnitude and the unwrapped phase of mid/side signals, which comprise the dominant and residual portions of the stereo channel [2].
In the proposed stereo extension method, each sub-band DNN was trained using feature vectors that were obtained by applying a 64-point FFT to each channel signal after QMF analysis. This 64-dimensional spectral feature vector was spliced across 11 neighboring frames (S = 5 in Section III), and they were used for the input layer of each DNN.
In this section, the performance of the proposed stereo extension method is evaluated in terms of both objective and subjective qualities by measuring the log spectral distortion (LSD) [24] and multiple stimuli with a hidden reference and anchor (MUSHRA) [25]. In addition, the performance of the proposed stereo extension method is first compared with those of conventional full-band stereo extension methods, including ICC [5], HMM [7], and DNN with LSF features (DNN-LSF) [9].
After the sub-band side signals are merged by QMF synthesis, artificial stereo signals are finally obtained by adding or subtracting the estimated side signals to the mono signal. Respective objective and subjective evaluations were conducted to demonstrate the performance of the proposed method. The results of the LSD and MUSHRA evaluations showed that the proposed stereo extension method significantly outperformed the conventional stereo extension methods.
As shown in the figure, the proposed method estimates side-signals x_s,t(n) using multi-band DNNs, which act as a mapping function between the sub-band of the mid and side signals. Similar to the conventional stereo extension methods based on HMM, as well as the DNN with LSF features described in Section II, the proposed method consists of training and stereo extension stages. Each stage is detailed in the following subsections.
1-channel audio playback system were modeled by using a single DNN model. The input feature vector for the DNN in the model was constructed by concatenating all the sub-band spectral features. In other words, the actually trained DNN model was a full-band approach using sub-band features.
In addition, the number of hidden layers with 256 nodes each was set to three. The learning rate and number of iterations were set to 0.008 and 100, respectively, for training DNNs for DNN-LSF, DNN-AU, and the proposed method.
In this paper, a stereo extension method that applies multiband DNNs was proposed. The method utilizes QMF analysis to train the DNN of each sub-band to estimate a more realistic side signal for the extension. Its sub-band signals are decoded by DNNs for the extension of an input mono signal to estimate the corresponding side signal of each sub-band.
Specifically, a 32-channel QMF [15] is applied in both the DNN training and the stereo extension stages. Unlike the conventional DNN-based method, the proposed method trains multiple DNNs for each sub-band, which models the band-wise nonlinearity between the mid and side signals. Once the multi-band DNNs are prepared, the log spectral magnitude and unwrapped phase of each side signal band are estimated via feed-forward decoding at the stereo extension stage.

대상 데이터

The speech databases used in training consisted of 20 min of Sound Quality Assessment Material (SQAM) [25], 50 min of the ETRI SWB Korean speech corpus [26], and 20 min of the TSP speech DB [27]. The music databases used in the training consisted of 40 min of SQAM, 20 min of orchestra, 30 min of popular music, and 30 min of audio-form user-created content (UCC). The total 3.

이론/모형

The performance of the proposed stereo extension method was compared with those of conventional full-band stereo extension methods, including ICC [5], HMM [7], and DNN with LSF features [9]. Moreover, the proposed method was then compared with a multi-band DNN-based audio upmixing method [13]. In addition, to compare the proposed method with a multi-band HMM method, the full-band HMM-based method in [7] was modified into a multi-band HMM-based method.
In addition, the performance of the proposed stereo extension method is first compared with those of conventional full-band stereo extension methods, including ICC [5], HMM [7], and DNN with LSF features (DNN-LSF) [9]. Then, the proposed method is compared with a multi-band DNN-based audio upmixing (DNN-AU) approach [13]. To compare the proposed method with a multi-band HMM method, the full-band HMM-based method in [7] is modified into a multi-band HMM-based method by replacing the sub-band-DNNs with sub-band HMMs, as shown in Fig.
2) was used to transform the time-domain signal into the frequency domain one. These MDCT coefficients were brought to the HMM-based stereo extension method as feature vectors. To train a DNN for DNN-LSF, the LSF feature extraction method was applied to each audio frame, where the order of LSF was set to M = 30.

성능/효과

However, DNN-AU had an average MUSHRA score similar to MB-HMM. A comparison of the proposed method with other sub-band methods, such as MB-HMM and DNN-AU, showed that the proposed method provided significantly higher average MUSHRA scores.
From the results of the objective and subjective evaluations, it is concluded that the proposed multi-band DNN-based stereo extension method can extend mono audio into stereo with a higher quality than conventional methods, including full-band HMM, sub-band HMM, and full-band DNN methods.
That is, the average MUSHRA score of the multi-band HMM (MB-HMM) was higher than that of the full-band HMM. The proposed multi-band extension method and DNN-AU produced higher average MUSHRA scores than DNN-LSF (that is, the full-band extension method). However, DNN-AU had an average MUSHRA score similar to MB-HMM.
Respective objective and subjective evaluations were conducted to demonstrate the performance of the proposed method. The results of the LSD and MUSHRA evaluations showed that the proposed stereo extension method significantly outperformed the conventional stereo extension methods.

후속연구

Thus, the proposed multi-band stereo extension method is motivated by this frequency-dependent similarity and difference between each channel of stereo audio. This is because it is expected to further improve the performance of the full-band DNN-based stereo extension method if the DNN is modeled for each sub-band.

참고문헌 (27)

J. Lapierre and C. Faller, "Spatial Audio Processing," Proc. AES Convention, Paris, France, May 20-23, 2006, Preprint 6804.
E. Schuijers et al., "Low Complexity Parametric Stereo Coding," Proc. AES Convention, Berlin, Germany, May 8-11, 2004, Preprint 6073.
H. Purnhangen et al., "Synthetic Ambience in Parametric Stereo Coding," Proc. AES Convention, Berlin, Germany, May 8-11, 2004, Preprint 6074.
C.J. Chun et al., "Real-Time Conversion of Stereo Audio to 5.1 Channel Audio for Providing Realistic Sounds," Int. J. Signal Process. Image Process. Pattern Recogn., vol. 2, no. 4, Dec. 2009, pp. 85-94.
N.I. Park and H.K. Kim, "Artificial Stereo Extension of Speech Based on Inter-Channel Coherence," Adv. Sci. Technol. Lett., vol. 14, no. 1, Aug. 2012, pp. 168-171.
N.I. Park et al., "Artificial Stereo Extension Based on Gaussian Mixture Model," Proc. AES Convention, Rome, Italy, May 4-7, 2013, Preprint 8877.
N.I. Park et al., "Artificial Stereo Extension Based on Hidden Markov Model for the Incorporation of Non-stationary Energy Trajectory," Proc. AES Convention, New York, USA, Oct. 17-20, 2013, Preprint 8980.
G. Hinton et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition," IEEE Signal Process. Mag., vol. 29, no. 6, Nov. 2012, pp. 82-97.
C.J. Chun et al., "Extension of Monaural to Stereophonic Sound Based on Deep Neural Networks," Proc. AES Convention, New York, USA, Oct. 29-Nov. 1, 2015, Preprint 9400.
J. Herre et al., "MPEG Surround―the ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding," J. Audio Eng. Soc., vol. 56, no. 11, Nov. 2008, pp. 932-955.

상세보기
J. Herre et al., "MPEG-H 3D Audio―the New Standard for Coding of Immersive Spatial Audio," IEEE J. Sel. Topics Signal Process., vol. 9, no. 5, Aug. 2015, pp. 770-779.

상세보기
K.M. Jeon et al., "An MDCT-Domain Audio Denoising Method with a Block Switching Scheme," IEEE Trans. Consum. Electron., vol. 59, no. 4, Nov. 2013, pp. 818-824.

상세보기
S.Y. Park, C.J. Chun, and H.K. Kim, "Sub-band-based Upmixing of Stereo to 5.1-Channel Audio Signals Using Deep Neural Networks," Int. Conf. Inform. Commun. Technol. Convergence, Jeju, Rep. of Korea, Oct. 19-21, 2016, pp. 377-380.
G. Kovacs, L. Toth, and T. Grosz, "Robust Multi-band ASR Using Deep Neural Nets and Spectro-Temporal Features," Proc. Int. Conf. Speech Comput. (SPECOM), Novi Sad, Serbia, Oct. 5-9, 2014, pp. 386-393.
ISO/IEC 23008-3:2015, Information Technology - High Efficiency Coding and Media Delivery in Heterogeneous Environments - Part 3: 3D Audio, Oct. 2015.
A. Spanias, T. Painter, and V. Atti, Audio Signal Processing and Coding, Hoboken, NJ, USA: John & Wiley & Sons, Inc., Jan. 2007.
X. Mei and S. Sun, "An Efficient Method to Compute LSFs from LPC Coefficients," Int. Conf. Signal Process. Proc., Beijing, China, Aug. 21-25, 2000, pp. 655-658.
Y. Bengio, "Learning Deep Architectures for AI," Found. Trends $^{(R)}$ Mach. Learn., vol. 2, no. 1, Jan. 2009, pp. 1-127.

상세보기
Y. Xu et al., "An Experimental Study on Speech Enhancement Based on Deep Neural Networks," IEEE Signal Process. Lett., vol. 21, no. 1, Jan. 2014, pp. 65-68.

상세보기
G.S. Kendall, "Directional Sound Processing in Stereo Reproduction," Int. Comput. Music Conf., San Jose, CA, Oct. 14-18, 1992, pp. 261-264.
C. Shuixian et al., "Frequency Dependence of Spatial Cues and Its Implication in Spatial Stereo Coding," Proc. Int. Conf. Comput. Sci. Softw. Eng., Wuhan, China, Dec. 12-14, 2008, pp. 1066-1069.
A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Englewood Cliffs, NJ, USA: Prentice-Hall, 1989.
A.H. Gray and J.D. Markel, "Distance Measures for Speech Processing," IEEE Trans. Acoust. Speech Signal Process., vol. 24, no. 5, Oct. 1976, pp. 380-391.

상세보기
ITU-R BS.1534-1, Method for the Subjective Assessment of Intermediate Quality Levels of Coding System, Jan. 2003.
EBU Technical Document 3253, Sound Quality Assessment Material Recordings for Subjective Tests-Users' Handbook for the EBU-SQAM Compact Disc, Apr. 1988.
http://slrdb.etri.re.kr/
P. Kabal, TSP Speech Database, Department of Electrical & Computer Engineering, McGill University, Montreal, Canada, Tech. Rep. TR-2002-09-04, Sept. 2002.

저자의 다른 논문 :

LOADING...

표제어: PCR

동의어: Packet Collision Rate

용어 설명 출처 목록 (6)

용어 설명: PCR은 세균 특이성이 있는 primer를 이용하여 적은 수의 세균이 있을지라도 쉽게 검출할 수 있는 유용한 방법이며, 이를 이용하여 구강 내 치면세균막이나 타액에서 직접 세균을 검출할 수 있게 되었다[8].

내보내기 구분	파일저장 인쇄 메일전송
구성항목	기본정보 상세정보 관리번호, 논문명, 저널/프로시딩명, 저자 , 발행년, 권, 호, 시작페이지, 끝페이지, 발행기관 관리번호, 논문명, 대등논문명, 저자 , 저널/프로시딩명, 발행기관, 발행년, 발행언어, 권, 호, 시작페이지, 끝페이지, ISBN, ISSN, 주제분야, 키워드, 초록(한글), 초록(영문), 저자(소속기관)
저장형식	Text(ASCII format) Excel format RefWorks Direct Export RIS format (for Reference Manager, ProCite, EndNote), Scholar's Aids, Mendeley
메일정보	받는사람 (필수) @ 보내는사람 (선택) @ 제목 내용 KISTI 검색결과 이메일 서비스
안내	총 건의 자료가 검색되었습니다. 다운받으실 자료의 인덱스를 입력하세요. (1-10,000) 검색결과의 순서대로 최대 10,000건 까지 다운로드가 가능합니다. 데이타가 많을 경우 속도가 느려질 수 있습니다.(최대 2~3분 소요) 다운로드 파일은 UTF-8 형태로 저장됩니다. 파일의 내용이 제대로 보이지 않을실 때는 웹브라우저 상단의 보기 -> 인코딩 -> 자동선택 여부를 확인하십시오. ~ Text(ASCII format) Excel format

연합인증