



基于多窗谱估计的维纳滤波语音增强算法张正文;周航麒【摘要】针对维纳滤波在复杂背景噪声情况下,语音信号成分衰减过大的问题,提出了一种基于多窗口谱估计和维纳滤波相结合的语音增强方法.该方法先将带噪语音进行多窗口谱估计,再通过小波阈值去除噪声项得到近似纯净的语音谱;然后与维纳滤波处理后的语音谱相比较,根据不同的失真类型选择相应的谱作为最终增强的语音谱.仿真结果表明,在不同类型的噪声和信噪比条件下,该方法在抑制噪声和降低语音信号衰减上优于均方预测误差(MSCEP)和预白化子空间(PSS)方法.【期刊名称】《河南理工大学学报(自然科学版)》【年(卷),期】2015(034)005【总页数】5页(P686-690)【关键词】信噪比;维纳滤波;小波阈值;多窗口谱估计【作者】张正文;周航麒【作者单位】湖北工业大学电气与电子工程学院,武汉430068;湖北工业大学电气与电子工程学院,武汉430068【正文语种】中文【中图分类】TN912.350 引言语音增强是语音编码(Speech Code)的关键步骤。





维纳滤波[4](Wiener Filter)语音增强算法是一种基于统计模型,采用“直接判别”最小均方误差(Minimizing the Mean-Squared Error,MMSE)短时谱估计方法,估计当前帧的先验信噪比,增强后的语音残留噪声类似于白噪声且噪声大大减少;然而维纳滤波要求待处理信号是平稳信号,当语音信号中噪声较多时会导致语音信号成分衰减过大,得不到理想纯净语音信号。



I.J. Image, Graphics and Signal Processing, 2019, 9, 44-55Published Online September 2019 in MECS (/)DOI: 10.5815/ijigsp.2019.09.05Speech Enhancement based on Wavelet Thresholding the Multitaper Spectrum Combined with Noise Estimation AlgorithmP.SunithaResearch Scholar, Dept. of ECE, JNTUK,IndiaEmail:Sunitha4949@,Dr.K.Satya PrasadRetd.Professor, Dept. of ECE, JNTUK,IndiaEmail:sprasad.kodati@Received: 26 May 2019; Accepted: 26 June 2019; Published: 08 September 2019Abstract—This paper presents a method to reduce the musical noise encountered with the most of the frequency domain speech enhancement algorithms. Musical Noise is a phenomenon which occurs due to random spectral speaks in each speech frame, because of large variance and inaccurate estimate of spectra of noisy speech and noise signals. In order to get low variance spectral estimate, this paper uses a method based on wavelet thresholding the multitaper spectrum combined with noise estimation algorithm, which estimates noise spectrum based on the spectral average of past and present according to a predetermined weighting factor to reduce the musical noise. To evaluate the performance of this method, sine multitapers were used and the spectral coefficients are threshold using Wavelet thresholding to get low variance spectrum .In this paper, both scale dependent, independent thresholdings with soft and hard thresholding using Daubauchies wavelet were used to evaluate the proposed method in terms of objective quality measures under eight different types of real-world noises at three distortions of input SNR. To predict the speech quality in presence of noise, objective quality measures like Segmental SNR ,Weighted Spectral Slope Distance ,Log Likelihood Ratio, Perceptual Evaluation of Speech Quality (PESQ) and composite measures are compared against wavelet de-noising techniques, Spectral Subtraction and Multiband Spectral Subtraction provides consistent performance to all eight different noises in most of the cases considered.Index Terms—Speech Enhancement, Wavelet thresholding, Multitaper Power Spectrum, Noise power estimation, smoothing parameter, SNR, threshold.I.I NTRODUCTIONSpeech is a basic way of communicating ideas from one person to another. This speech is degraded due to background noise .To reduce this background noise numerous speech enhancement algorithms were available, among them spectral subtractive algorithms are more popular because of their simple implementation and their effectiveness. In these algorithms Noise power spectrum is subtracted from the noisy power spectrum by assuming the noise spectrum is available. These methods introduce musical noise due to inaccurate estimate of noise. These spectral subtractive algorithms works well in stationary noise but they fails in non-stationary noise. This led to the use of low variance spectral estimation methods because spectral estimation plays a key role in speech enhancement algorithms. To reduce the variance an average of estimate can be calculated across all frequencies. To improve the speech quality and intelligibility in presence of highly non stationary noise, a speech enhancement algorithm requires noise estimation algorithms which update the noise spectrum continuously. Most of the speech enhancement applications in non-stationary scenarios use noise estimation methods algorithms which track the noise spectrum continuously. Now, researchers focus their attention to improve the speech quality and intelligibility using efficient noise estimation algorithms. Estimate of noise signal strongly depends on the smoothing parameter. If its value is too large i.e closer to one results in over estimation of the noise level. Generally, smoothing parameter is set to be small during speech activity to track the non-stationary of the speech. This makes the smoothing parameter as time and frequency dependent, taking into the consideration of speech presence or absence probability. Numerous noise estimation algorithms are available in literature. One among them is minimum statistics algorithm, proposed in [1] estimates the noise by considering the instantaneous SNR of speech using smoothing parameter and bias correction factor. It tracks the minimum over a fixed window and updates the noise PSD. The performance of this method was tested under non-stationary noise it results in large error and it is unable to respond for fast changes in increasing levels of noise power. Martín .R implemented spectral subtraction with minimum statistics and its performance was evaluated in terms of both objective and subjective measures. This was comparedagainst spectral subtraction method that uses voice activity detection which results in improved speech intelligibility measures [2].Another variant of minimum statistics suggested in [3] implements estimates noise by continuous spectral minimum tracking in sub bands. In this method a different approach was used to obtain spectral minimum, by smoothing the noisy speech power spectra continuously using a non-linear smoothing rule. This non-linear tracking provides continuous smoothing over PSD without making any distinction between speech presence and absence segments. The shortcoming of this was when noise power spectrum increases, then the noise estimate increases irrespective of changes in the noise power level. Similarly when the noisy power is decreasing then the noise power is decreasing .This will results in overestimation of speech during speech presence regions i.e clipping of speech. This method was evaluated in terms of objective and subjective quality measures it shows its superior performance over Minimum statistics algorithm. The non-uniform effect of noise on the speech spectrum affects few frequency components severely than others. This led to the use of time recursive noise estimation algorithm which updates the noise spectrum when the effective SNR in a particular band is too small [4]. In this method noise spectrum is estimated as a weighted average of past and present estimates of noisy power spectrum depending on the effective SNR in each frequency bin .This algorithm works well in tracking the non-stationary noise in case of multitalker babble noise. Another type of recursive algorithm ,which uses a fixed smoothing factor, but the noise spectrum should be updated based on the comparison of the estimated a-posteriori SNR over a threshold[5].If this a-posteriori SNR is larger than the threshold indicates that speech presence and no update is required for noise spectrum. Otherwise it is treated as a speech absence segment, which requires a noise updating. This method is well known as weighted spectral averaging. In this method the threshold value, have a significant effect on the noise spectrum estimation .If the threshold value is too small noise spectrum is underestimated, conversely the threshold value is too high then the spectrum is over estimated. Improvements to the Minimum statistics was suggested in [6] by using optimal smoothing for noise power spectral density estimation. Cohen proposed noise estimation algorithm, which uses time-frequency dependent smoothing factor which requires continuous updating depending on the speech presence probability in each frequency bin .Speech presence probability was calculated as the ratio of the noisy power spectrum to its local minimum [6].This local minimum is computed considering the smoothed noisy PSD, over a fixed window by sample wise comparison of noisy PSD. This has a short coming ,it may lag when the noise power is raising from the true noise PSD .To address this shortcoming, a different approach was suggested in[8,9] uses continuous spectral minimal tracking and frequency dependent threshold was used to identify the speech presence segments .This method was evaluated in terms of subjective preference tests over other noise estimation algorithms like MS and MCRA. This method shows better performance .Further refinement to this algorithm was reported in [10] i.e noise power spectrum estimation in adverse environments by Improved Minima Controlled Recursive Averaging (IMCRA).This method involves two steps smoothing and minimal tracking .Minimal tracking provides Voice Activity Detection in each frame whereas smoothing excludes strong speech components. Speech presence probability is calculated using a –posteriori and a-priori SNRs. This method yields in lower values of error for different types of noise considered.The structure of the paper is as follows, Section II provides Literature review , Multitaper spectral estimation and spectral refinement is given in Section III ,noise estimation by weighted spectral averaging technique was presented in section IV ,section V presents proposed speech enhancement method, results and discussion in VI and finally section VII gives conclusion.II.L ITERATURE R EVIEWThis section presents literature review on spectral subtractive type algorithms for single channel enhancement techniques. In the past, number of researchers proposed different speech enhancement methods. Most of them are based on Spectral Subtraction (SS), Statistical Model based, Sub space algorithms and Transform based methods. One of the popular noise reduction method which is computationally efficient and less complexity for single channel speech enhancement is Spectral subtraction proposed by Boll S.F for both Magnitude and Power Spectral Subtraction which itself creates a bi-product named as synthetic noise[17].A significant improvement to spectral subtraction with over subtraction factor and spectral floor parameter to reduce the musical noise given by Berouti [19]is Non –Linear Spectral subtraction . Multi Band Spectral Subtraction (MBSS) proposed by S.D. Kamath with multiple subtraction factors in non-overlapping frequency bands [18] .Ephraim and Malah proposed spectral subtraction with MMSE using a gain function based on priori and posteriori SNRs[20].Spectral subtraction based on perceptual properties using masking properties of human auditory system proposed by Virag [21].Another method in spectral subtraction with Wiener filter to estimate the noise spectrum is extended spectral subtraction by Sovka [22]. Spectral Subtraction algorithm based on two-band is Selective spectral subtraction described by He,C.and Zweig,G. [23].Spectral subtraction with Adaptive Gain Averaging to reduce the overall processing delay given by Gustafsson et al[24].A frequency dependent spectral subtraction is non-linear spectral subtraction (NSS) method conferred by Lockwood and Boudy[25].The spectral subtractive type algorithms works well in case of additive noise but fails in colored noise. To overcome this problem Hu and Loizou proposed a Speech enhancement technique based on wavelet thresholding the multitaper spectrum [11] and its performance is evaluated in terms of objective quality measures.III. M ULTITAPER S PECTRAL E STIMATION A ND S PECTRALR EFINEMENT Due to sudden changes and sporadic behavior, Speech signal can be modeled as a non- stationary signal. As time evolves the statistics like mean, variance, co-variance and higher order moments of a non-stationary signal changes over time. Spectral analysis plays a major rule in speech enhancement techniques to get accurate noise estimation. FFT method is widely used to get power spectrum estimation in most of the speech enhancement algorithms especially in spectral subtractive type methods. The estimated power spectrum obtained by FFT is reduced by variance of the estimate and energy leakage across frequencies which create bias. To avoid leakage, multiply the signal in time domain with a suitable window which having less energy in side lobes. Type of window affects the noise estimate in speech enhancement algorithms, hence selection of desirable window which provides an accurate noise estimation plays a significant role in Speech enhancement process .Generally Hamming window is preferable with less energy in side lobes but it effects the estimate by reducing leakage but not the variance. In most of the speech enhancement algorithms noise estimate is obtained by using suitable windows which reduce the bias but not the variance. The variance can be reduced by taking multiple estimates from the sample which can be achieved by using tapers. Hu and Loizou [11] used these multi-tapers to get low variance spectral estimate ,further the spectrum was refined using wavelet thresholding ,Finally this was used to improve the quality of speech signal in case of highly non-stationary noise. Results shows that this method has superior performance in terms of quality measures with high correlation between subjective listening test and objective quality measures. Speech enhancement techniques find wide range of applications like hearing aids to personal communication, teleconferencing, Automatic Speech Recognition (ASR), Speaker Authentication and Voice operated Systems. The multitaper spectrum estimator is given byS ̂mt (ω)=1L ∑Ŝpmt L−1p=0(ω) (1) WithS ̂p mt (ω)=|∑b p (m )x (m )e−jωm N−1M=0|2 (2) Here data length is given by N and b p is the p thsine taper used for spectral estimate [12] and b p is given byb p (m )=√2N+1sinπp(m+1)N+1,m =0,.N − 1 (3)Further refinement of spectrum is obtained by applying wavelet thresholding techniquesv (ω)=S ̂mt (ω)S(ω)~X 2L22L,0<ω<π (4)Where v (ω)is the ratio of the estimated multitaper spectrum to the true power spectrum .Taking logarithm on both sides, we getlogŜmt (ω)=log S (ω)+logv(ω) (5)From this equation , we conclude that sum of the true log spectrum and noise can be treated as log of multitaper spectrum. If L is at least equivalent to 5 then logv(ω) will be nearer to normal distribution and the random variable n (ω) is given byn (ω)=logv (ω)−∅(L )+log (L ) (6)Z (ω) is defined asZ (ω)=logS ̂mt (ω)−∅(L )+log (L). (7)The idea behind multitaper spectral refinement [11] can be summarized as1. Obtain the multitaper spectrum of noisy speech using orthogonal sine tapers by equation1 .2. Apply Dabauchies Discrete Wavelet Transform to get the DWT coefficients.3. Perform thresholding procedure on the DWT coefficients.4. Apply Inverse Discrete Wavelet Transform to get the refined log spectrum.Fig.1. Multiple window method for spectrum estimation by individualwindows.Fig.2. Speech signal(mtlb.wav)(a)(b)(c)Fig.3. (a),(b) and (c) Spectrum obtained by N=1,2,3.’N’ is the number oftapersFig.4. Final spectrum obtained by averagingIV.N OISE E STIMATION B Y W EIGHTED S PECTRALA VERAGINGNoise estimation algorithms works on the assumption that the duration of analysis segment is too long enough that it should contain both low energy segments and speech pauses .Noise present in analysis segment is more stationary than speech. This paper uses noise estimation based on the variance of the spectrum suggested in [12].The noise spectrum updating will take place when the magnitude spectrum of noisy speech falls within a variance of the noise estimate. The noise spectrum was updated based on the following condition.|Ŝmt(λ,K)|−σd(λ,K)<ϵ√Var d(λ,K) (8) Where Var d(λ,K)represents the instantaneous variance of the noise spectrum and ϵis a adjustable parameter, |Ŝmt(λ,K)|is multitaper magnitude spectrum and σd(λ,K) is the estimate of the noise PSD .The variance of the noise spectrum was evaluated using the recursive equationVar d(λ,K)=δ Var d(λ−1,K)+(1−δ)[|Ŝmt(λ,K)|−σd(λ,K)]2 (9)Where δis a smoothing parameter. ‘λ, is a frame index and ‘K’ is a frequency bin .The noise estimation algorithm can be summarized as ifIf|Ŝmt(λ,K)|−σd(λ−1,K)<ϵ√Var d(λ−1,K)σd(λ,K)=α σd(λ−1,K)+(1−α)|Ŝmt(λ,K)|(9)Var d(λ,K)=δ Var d(λ−1,K)+(1−δ)[|Ŝmt(λ,K)|−σd(λ,K)]2(10)Elseσ̂d(λ,K)=σ̂(λ−1,K) (11)This paper uses this weighted spectral averaging method for noise estimation from noisy power spectrum using the parameters δ=α=0.9 and ϵ=2.5.V.P ROPOSED M ETHODThe implementation details of speech enhancement method can be given as follows:1. Obtain the multi taper estimate of the Noisy speech using sine tapers using equation12. Perform spectral refinement with the help of wavelet thresholding procedure, which involves Forward Discrete Wavelet Transform (FDWT), Thresholding and Inverse Discrete Wavelet Transform (IDWT).In this paper Dabauchies wavelets were used at level 5 decompition by using both soft and hard thresholding.3. Compute Z(ω)from the equation (6) and apply Discrete Wavelet Transform to Z(ω)then threshold the multitaper spectrum for further refinement of spectrum and the refined log spectrum .4. Estimate of the noise can be evaluated using weighted spectral recursive averaging algorithm discussed in section IV.5. Perform multitaper spectral subtraction between the refined log spectrum of noisy speech and noise spectrumto get an estimate of Clean Speech spectrum .S ̂x ωmt (ω)=S ̂y ωmt (ω)−S ̂n mt (ω) (12)and it results in negative values which are rounded asS ̂X mt ={S ̂y mt −S ̂n mt , if S ̂y mt >S ̂n mt βS ̂n mt , if S ̂y mt ≤ S ̂n mt ,(13) Where ‘ β’ is spectral floor parameter .6. Finally the enhanced speech Signal can be reconstructed using Inverse Discrete Fourier Transform and overlap- add method.Fig.5. Block diagram of proposed methodVI. R ESULTS A ND D ISCUSSIONAssessment of speech enhancement techniques can be done either by using objective quality or subjective listening tests. Comparative analysis of original speech and processed speech signals by a group of listeners is known as subjective listening test based on human auditory system.. Which involves a complex process and it is difficult to identify the persons with good listening skills. While objective evaluation is done on mathematical comparison of clean and enhanced signals .In order to calculate the objective measures, the speech signal is first divided into frames of duration of 10-30 msec. This result in a single measure which gives the average of distortion measures calculated for all the processed frames. This section gives the performance analysis of the proposed method by using four numbers of bands. Simulations were performed in the MATLAB environment. NOIZEUS is used as a speech corpus which is available at [15] and used by the most of the researchers, containing 30 sentences of six different speakers, three are male and other three are female speakers originally sampled at 25 KHz and down sampled to 8 KHz with 16 bits resolution quantization. Clean Speech is distorted by eight different real-world noises (babble, airport, station, street, exhibition, restaurant, car and train) at three distinct ranges of input SNR (0dB, 5dB, 10dB). In this algorithm speech sample is taken from a male speaker, English sentence is ”we can find joy in the simplest things”. This paper presents the performance evaluation based on different quality measures which are segmental-SNR, Weighted Slope Spectral Distance(WSSD) [13], Log Likelikelihood Ratio, Perceptual Evaluation of Speech Quality (PESQ) [14]and three different composite measures[13]. A. Segmental SNR (seg-SNR)To improve the correlation between clean and processed speech signals summation can be performed over each frame of the signal [13] this results in segmental SNR .The segmental Signal-to-Noise Ratio (seg-SNR) in the time domain can be expressed asSNR seg =10M∑log 10M−1M=0∑x 2(n)Nm+N−1n=NM∑(x (n)−x̂(n))2Nm+N−1n=NM (14)Here x(n) shows the original speech signal. x(n)̂ is theprocessed speech signal, frame length is given by N andthe number of frames is given by M. The geometric mean of all frames of the speech signal is seg-SNR [10], whose value was limited in the range of [-10, 35dB] B. Log Likelihood Ratio (LLR)This measure was based on LPC analysis of speech signal.LLR (a x ⃗⃗⃗⃗ ,ax ̂)=log (a ⃗ x̂R x a ⃗ x ̂T a ⃗ x R x a⃗ x T ) (15)a x ,a x ̂Tare the LPC coefficients of the original andprocessed signals. R x is the autocorrelation matrix of the original signal .In LLR denominator term is always lower than numerator therefore LLR is always positive [13] and the LLR values are in the range of (0-2).Multitaper spectral estimation using sine-tapersWavelet thresholding the Multitaper spectrumSpectralSubtractionNoise estimation by weighted spectral averagingSignal framing using Hamming WindowInverse DFT &OLANoisy SpeechEnhanced SpeechC. Weighted Slope Spectral Distance(WSSD)This measure can be evaluated as the weighted difference between the spectral slopes in each band can be computed using first order difference operation[13].Spectral slopes in each band of original and processed signals are given byWSSD=1M ∑∑W(j,m)(X x(j,m)−X X̂(j,m))2Kj=1∑W(j,m)Kj=1M−1M=0(16)D. Perceptual Evaluation of Speech Quality (PESQ) One among the objective quality measures which provides an accurate speech quality recommended by ITU_T [14] which involves more complexity in computation. A linear combination of average asymmetrical disturbance A ind and average disturbance D ind is given by PESQ.PESQ=4.754-0.186D ind-0.008 A ind (17) E. Composite MeasuresLinear combination of existing objective quality measures results in a new measure [10].This can be evaluated by using linear regression analysis. This paper uses the multiple linear regression analysis to obtain the following new composite measures [13].These composite measures were measured on a five-point scale.(i) Signal Distortion(C sig): The linear combination of PESQ, LLR and WSSD measures results in a new composite measure named as Signal Distortion [13].This is evaluated using the following equationC sig=3.093-1.029*LLR+0.603*PESQ-0.009*WSSD(18)(ii) Noise intrusiveness(C bak): The linear combination of PESQ, seg-SNR and WSSD measures results in anew composite measure named as noise Distortion [13]. This is evaluated using the following equation.C bak=1.634+0.478*PESQ+0.007*WSSD+0.063*seg-SNR(19)(iii) Overall Quality (C ovl): Overall Quality is formed by Linear combination of LLR ,PESQ and WSSD measures and is given byC ovl=1.594+0.805*PESQ-0.512*LLR-0.007*WSSD(20) Scale of signal degradation, background intrusiveness and overall quality measures are shown in table 1,2,3.Table 1. Scale of Signal DistortionTable 2. Scale of Background IntrusivenessTable 3. Scale of Overall qualityTo obtain objective quality measures for the proposed method first the multitaper spectrum was obtained using sine tapers. Further spectral refinement is achieved through wavelet thresholding the multitaper spectrum .Then noise spectrum is estimated using weighted spectral averaging. The results were compared against Wavelet de-noising using hard thresholding (WDH) and soft thresholding (WDS) suggested in [16], Spectral subtraction(SS) [17] and Multi Band Spectral Subtraction (MBSS) [18].Table 4. Objective quality measures Segmental SNR(seg_SNR),Log Likelihood Ratio(LLR),Weighted Slope Spectral Distance (WSSD),PerceptualEvaluation of Speech Quality(PESQ)Table 5. Composite measures(C sig, C bak, C ovl) for eight different types of noises(a)(b)(c)(d)(e)(f)(g)Fig.6. a)Segmental SNR b)Log Likelihood Ratio c)Weighted spectral slope distance d)PESQ e)Signal Distortion (C sig) f) Background intrusiveness (C bak) g) Overall quality (C ovl) measures against inputSNR.Fig.7. Time domain and spectrogram representation of Clean Speech noisy speech and enhanced speech signals by SS[17],MBSS[18],WDH,WDS[16] ,Wavelet thresholding the multi taper spectrum[11] and proposed method.VII. C ONCLUSIONFrom the results shown in table.4, performance of wavelet de-noising techniques is very poor in terms of all objective quality measures i.e lower values of segmental SNR and PESQ and higher values of LLR and WSSD in all the cases considered when compared to other techniques. The proposed method exhibits its superior performance i,e higher values of segmental SNR and PESQ for all types of noises at three levels of input SNR against all the methods considered. The performance of proposed method decreases in terms of LLR and WSSD when compared to Multi Band Spectral Subtraction method. Composite measures were shown in table.5, indicates that the proposed method provides improvement in terms of all three composite measures when compared to all the four different methods considered. The same results can be shown in the form of graphs by taking average of all eight different noises at three levels in figure .6 from a to g. From the results it can be concluded that the proposed method is suitable for higher values of segmental SNR, PESQ and composite measures.Figure7.,shows the time domain and frequency domain representation of noisy speech, noise and enhanced speech signals for various methods like Spectral Subtraction [17],Multi Band Spectral Subtraction[18], Wavelet de-noising techniques with both soft and hard thresholding [16],Wavelet thresholding the Multitaper spectrum for speech enhancement[11] and proposed methods.Spectrograms are widely used in speech processing to plot the spectrum of frequencies as it varies with time. The spectrogram can be evaluated as a sequence of FFTs computed over a windowed signal of duration of 20ms In the time domain Enhanced speech signal from Spectral subtractive type algorithm introduces musical noise; it was eliminated in the Multi Band Spectral Subtraction the same can be observed in spectrograms. Wavelet de-noising techniques shows its performance in suppression of noise. The proposed method gives the enhanced signal closer to original clean speech signal and spectrogram also closer to the spectrogram of clean speech signal.A CKNOWLEDGEMENTI would like to take this opportunity to express my profound gratitude and deep regard to my Research Guide Dr.K.Satya Prasad for his exemplary guidance, valuable feedback and constant encouragement throughout the duration of the research. His valuable suggestions were of immense help throughout research. Working under him was an extremely knowledgeable experience for me. I would also like to give my sincere gratitude to the authors Hu and Loizou, for inspiring me with their research papers in the field of speech enhancement along with objective quality measures.R EFERENCES[1]R.Martin, “An efficient algorithm to estimate theinstantaneous SNR of speech signals”, proceedings of Euro speech ,Berlin,pp.1093-1096,1993.[2]R.Martin, “Spectral subtraction based on minimumstatistics, Proceedings of European Signal Processing,U.K,pp.1182-1185,1994.[3]G.Doblinger, “Computationally efficient speechenhancement by spectral minima tracking in sub bands”, proceedings of Euro speech ,Spain, pp:1513-1516,1995. [4]H.Hirch, and C.Ehrlicher, “Noise estimation techniquesfor robust speech rec ognition”, proceedings of IEEE International Conference on Acoustic Speech Signal Processing, MI, pp.153-156,1995.[5]R.Martin, “Noise Power Spectral Density Estimationbased on Optimal Smoothing and Minimum statistics”, IEEE Transactions on Audio, Speech Processing pp.504–512, 2001.[6]I.Cohen, “Noise Estimation by Minima controlledrecursive averaging for robust speech enhancement”, IEEE Signal Processing. Letter, pp.12–15,2002[7]I.Cohen, “Noise spectrum Estimation in adverseenvironments: Improved Minima controlled recursive averaging”, IEEE Transactions on Audio, Speech Processing, pp.466-475, 2003.[8]L.Lin ,W.Holmes and E.Ambikairajah , “Adaptive noiseestimation algorithm for speech enhancement”,Electron .Lett,754-555,2003[9]Loizou, R.Sundarajan,Y. Hu,”Nois e estimation Algorithmwith rapid Adaption for highly non-stationary environments “Proceedings on IEEE International Conference on Acoustic Speech Signal Processing,2004.[10]Loizou, R.Sundarajan, “A Noise estimation Algorithm forhighly non-stationary Envi ronments”. Speech Communication,48, Science Direct , pp.220-231,2006. [11]Yi.Hu ,P.C .Loizou.,"Speech enhancement based onwavelet thresholding the multitaper spectrum”, IEEE Transactions on Speech and Audio Processing,pp.59-67,2004.[12] C.Ris and S.Dupont, “Assessing local noise levelestimation methods: Applications to noise robust ASR”, Speech Communication, pp.141-158,2001.[13]Yi.Hu ,P.C .Loizou.,"Evaluation of objective QualityMeasures for Speech Enhancement " ,IEEE Transactions on Audio, Speech and Language Processing pp.229-238,Jan.2008.[14]ITU_T Rec, “Perceptual evaluation of speechquality(PESQ), An objective method for end to end speech quality assessment of narrowband telephone networks and speech codecs”.,International Telecommunications Union ,Geneva Switzerland, February 2001.[15] A Noisy Speech Corpus for Assessment of SpeechEnhancement Algorithms. https: // / Loizou /speech/noizeous.[16]DL.Donoho, “De-noising by soft thresholding “,IEEErm.Theory,41(3), 613627,1995.[17]Boll,S.F, “Suppression of acoustic noise in speech usingspectral subtraction”. IEEE Transactions on Acoustics Speech and Signal Processing, 1979,27(2), 113–120.。



Speech Enhancement Algorithm Based on Wavelet Packet and Adaptive Wiener Filter
DONG Hu1,2 ,XU Yu-ming1 ,MA Zhen-zhong1 ,LI Lie-wen1 ,REN Ke1
(1. School of Information Science and Engineering,Changsha Normal University,Changsha 410100,China; 2. School of Physics and Electronics,Hunan Normal University,Changsha 410181,China)
(1. 长沙师范学院 信息科学与工程学院,湖南 长沙 410100; 2. 湖南师范大学 物理与电子科学学院,湖南 长沙 410181)
摘摇 要:语音增强主要用来提高受噪声污染的语音可懂度和语音质量,它的主要应用与在嘈杂环境中提高移动通信质量 有关。 传统的语音增强方法有谱减法、维纳滤波、小波系数法等。 针对复杂噪声环境下传统语音增强算法增强后的语音 质量不佳且存在音乐噪声的问题,提出了一种结合小波包变换和自适应维纳滤波的语音增强算法。 分析小波包多分辨率 在信号频谱划分中的作用,通过小波包对含噪信号作多尺度分解,对不同尺度的小波包系数进行自适应维纳滤波,使用滤 波后的小波包系数重构进而获取增强的语音信号。 仿真实验结果表明,与传统增强算法相比,该算法在低信噪比的非平 稳噪声环境下不仅可以更有效地提高含噪语音的信噪比,而且能较好地保存语音的谱特征,提高了含噪语音的质量。 关键词:语音增强;小波包;自适应维纳滤波;多分辨率分析;多尺度分解 中图分类号:TP301. 6摇 摇 摇 摇 摇 摇 文献标识码:A摇 摇 摇 摇 摇 摇 文章编号:1673-629X(2020)01-0050-04 doi:10. 3969 / j. issn. 1673-629X. 2020. 01. 009




























wˆ j,k
{wj,k 0,
| wj,k | | wj,k |
wˆ j,k

sign(w {

| ),
| wj,k | | wj,k |

在去噪的过程中必须综合考虑去噪和避免失真这两方 面的因素。
语音信号的5层小波分解 S cA5 cD1 cD2 cD3 cD4 cD5
能量元定义:设某数据序列为{xi}i1,2, ,N 则该序列的能量元为
i 1

wˆ j,k

wj,k sgn(wj,k )(1 ),


exp(2w j ,k
| wj,k | | wj,k |
其中 0 0 a 1 图是在λ = 1,
新算法去噪后MSE 0.0532 0.0467 0.0423 0.0397
提出了一种新阈值函数。仿真实验表明,相对于传统 方法,新阈值函数法不仅能较好地反映原始信号的概 貌,而且有良好的去噪效果 。



































三、研究方法和思路1. 收集语音信号增强相关的文献资料,并对现有方法进行分析和总结;2. 分析小波变换的基本原理和特点,以及小波变换在语音信号处理中的应用;3. 设计实验方案,采集语音数据并运用小波变换进行分析和处理;4. 基于实验结果,对小波基函数、小波分解层数和阈值选择等关键技术进行优化和改进;5. 验证优化算法在实际应用场景中的效果,并与已有方法进行对比分析。

四、预期成果和应用价值本项目的预期成果包括:1. 基于小波变换的语音信号增强优化算法;2. 在实际应用中验证优化算法的效果,并与已有方法进行对比分析;3. 针对不同场景下的语音信号增强问题,提出相应的解决方案;4. 发表学术论文和申请相关专利。





首先通过小波变换对带噪语音进行多尺度分解,然后对不同尺度的小波系数采用MMSE 算法,最后对MMSE 处理过的小波系数进行重构得到增强语音。


%Anew algorithm combining MMSE with wavelet transformation is put forward to improve the performance of speech recognition. Firstly, the noisy speech is processed with wavelet transformation by multi-scale decomposition, then the multi-scale wavelet coefficient is determined with MMSE algorithm. Finally the wavelet coefficient is reconstructed to enhance the speech quality. By comparing the presented algorithm with the traditional ones, as indicated by the result of MATLAB experiment, the developed algorithm is found to be of better performance in terms of speech enhancement.【期刊名称】《宁波大学学报(理工版)》【年(卷),期】2016(029)003【总页数】4页(P68-71)【关键词】最小均方误差;小波变换;语音增强【作者】潘小龙;张卫强;郦元宏【作者单位】宁波大学信息科学与工程学院,浙江宁波 315211;宁波大学信息科学与工程学院,浙江宁波 315211;宁波大学信息科学与工程学院,浙江宁波315211【正文语种】中文【中图分类】TN912.35语音识别在安静的环境中可以有较好的性能,但现实环境中不可避免地夹杂着噪声,对语音识别产生了较大影响,因此对带噪语音的处理就显得特别重要. 目前,语音增强方法主要有基于语音短时谱幅度估计法(Short Time Spectral Amplitude,STSA)、基音周期法、语音参数模型法、听觉场景分析法等.在基于语音短时谱幅度估计法(STSA)中,最小均方误差(Minimum Mean-Square Error, MMSE)比谱减法等传统方法能更多地利用语音和噪声统计特性的先验知识[1],对语音系数的先验概率进行估计,并在此基础上按照最小均方误差的准则对语音系数进行估计. MMSE准则中,下信号短时间可以看成是平稳信号,但当信号为非平衡时,其去噪效果大大降低,而且在低信噪比情况下,由语音增强失真引起的包括语音频谱变形和增强后的剩余噪声对信号有较大的干扰,从而降低语音识别系统的精度. 为解决上述问题,人们提出了基于小波变换的语音增强算法. 鉴于小波变换在低信噪比情况下去噪效果好和处理非平稳信号的优势,可以改善MMSE在低信噪比下信号的失真,因此,将MMSE与小波变换结合在一起处理语音信号可以有效提高语音识别系统的性能.对于传统去噪算法如谱减法[2]等进行语音增强后所产生的“音乐噪声”,MMSE算法可以较好地解决该缺陷,这是由于MMSE方法能更多地利用语音和噪声统计特性的先验知识. 以下基于MMSE的短时频谱分析来介绍MMSE原理.语音短时频谱可用指数形式作如下表示[3]:上述公式将信号分为若干帧信号,其中,i代表信号为第i帧,X(k,i)和Y(k,i)分别为纯净语音和带噪语音的频谱函数,A(k,i)和N(k,i)分别为纯净语音和带噪语音的语音短时谱幅度. 对于带噪语音信号目的是使纯净语音的短时谱幅度估计值与纯净语音短时谱幅度最接近,亦即使信号失真度最小.假设各频谱分量之间相互独立,则可以推导出语音短时谱的MMSE估计值为:其中,a(k)为A(k)的一帧信号所对应的谱幅度.假设噪声谱服从零均值高斯分布,则:假设语音谱服从高斯分布,则其幅值和相位的联合分布为:其中, D(k)为噪声的谱幅度;λX(k)为纯净语音能量期望植;λD(k)为噪声能量期望值,可由语音间歇时静音帧估计得到,代入式(3)可得公式如下:其中,为超几何函数[4],则:其中,和分别为先验信噪比和后验信噪比. 可将式(8)写成如下形式:其中,为增益函数. 因此由以上公式可得到纯净语音谱幅度估计值,对其添加含噪信号的相位及反傅里叶变换后就可得到增强后的语音.作为时频域分析方法,傅里叶变换能将信号不同时刻的相同频率成分都映射到同一频率点上,因此无法对信号某个时间点的某个频率的特征进行分析,相当于不能表述信号的时频局部性质,这正是非平稳信号的关键. 小波变换是空间和频率的局部变换,因而能有效地从信号中提取信息,克服上述缺点. 小波去噪首先进行多尺度小波变换,然后对小波系数进行处理,去除噪声中的小波系数,保留原始信号的小波系数,对小波系数进行小波逆变换(小波重构),最后得到去噪后的近似真实信号[5-6]. 假设函数()xφ为平方可积函数,即且满足以下容许性条件:则称()xφ为小波基函数. 对于任意实数为小波变换的尺度因子,b为平移因子,对其进行离散小波变换,此时通常定义其中,,jkZ∈. 由小波基()xφ生成的依赖于(a,b)的离散小波函数如下所示[7-8]:则信号f(t)以x(φ)为基的小波变换为:在实际应用中,通常取02a=,01b=. 由此可见,离散小波变换的提出,使得小波变换在计算机的分析成为可能.带噪声语音通过小波变换划分得到不同尺度的小波系数,小波变换的有用信号能量主要集中在大的小波系数和特定的频率范围内,而噪声分布于整个小波域内,因此小波分解后有用的信号系数幅值会大于噪声的系数幅度,可采取阈值方法来得到有用信号. 然后对各个尺度的小波系数进行MMSE处理,最后对处理过的小波系数进行小波重构,得到增强后的语音信号. 新算法的整体框架如图1所示.首先对带噪声语音进行Mallat小波分解,得到不同尺度的分解系数,在频域中对应不同频率的子带信号. 假定0~P频率空间为F0,经过一级分解后被分为0/2P-的低频子空间F1和/2PP-的高频空间W1,然后将低频空间一直分解下去,得到Fn、Wn、…、W2、W1,它们之间的空间互不相交,它们所对应的小波系数为cd1、cd2、cd3、cd4、cd5、ca5(其中cdn为高频系数, ca5为第五层的低频系数). 然后对不同频率范围、尺度的小波系数进行MMSE估计,得到改进的小波系数. 最后对处理过的小波系数利用Mallat算法进行重构,得到增强语音. 运用小波变换进行去噪处理,可选择不同小波基函数. 仿真选Daubechies为正交小波基,阶次为4,分解层数为5层;语音为在噪声环境下的“宁波”两个字的发音信号, 16kHz的采样率, PCM, 16位,单声道; MMSE对小波系数进行最小均方误差估计.图2为带噪信号经过小波去噪处理和新算法处理过的信号对比. 由图可见,新算法中的信号幅度比小波去噪算法处理过信号幅度有所减小,噪声部分幅度变得更为平滑,对噪声有较好的抑制作用.图3~图5为去噪前语音信号小波系数、小波去噪后小波系数和新算法去噪后小波系数的图形. cd1~cd5分别对应滤波前第1层高频系数到第5层高系数,其频率范围对应8000~16000Hz, 4000~8000Hz, 2000~4000Hz, 1000~2000Hz, 500~1000 Hz, ca5为第5层低频系数,频率范围为0~500Hz. cd1~cd5分别对应滤波后第1层高频系数到第5层高系数. 对比不同方法滤波的图形,可见利用新算法滤波后的小波系数噪声明显减少,这使得小波重构的信号对噪声有较好的抑制作用.文中语音采集通过Cooledit处理软件完成,采集1~10这10个数字的英语发音作为实验对象. 采用16000Hz的采样频率,单声道录音,精度为16位,通过对24个人录音,得到240个语音数据,其中120个数据用来训练语音识别模型,另外120个数据用来进行语音识别测试. 提取每帧信号的24阶MFCC作为语音特征参数,噪声为高斯白噪声.通过语音识别可以得到加噪语音、MMSE去噪、小波变换去噪以及新方法去噪后的语音识别率,从表1数据中可以得出改进方法的去噪效果较其他2种方法有一定的提高.5 结论MMSE相比于谱减法等传统方法更多地利用了语音和噪声统计特性的先验知识,但前提是信号短时平稳,这大大局限了该方法的应用. 而小波变换在非平稳和低信噪比的情况下具有较好的语音增强能力. 提出的新方法结合了MMSE方法和小波变换的各自优点,实验结果表明新方法处理后得到的信号损害较小,而且去噪效果较好,从而提高了语音识别系统的性能,相比于传统的去噪方法有一定的提高. 但其算法复杂度较高,并且对波形会造成一些损害,有待于今后继续深入研究.【相关文献】[1]宁更新. 抗噪声语音识别新技术的研究[D]. 广州:华南理工大学, 2006.[2]熊燕. 抗噪声语音识别技术研究[J]. 信息科技及现代服务, 2006(7):204-205.[3]方瑜. 语音增强相关问题研究[D]. 北京:北京邮电大学, 2011.[4]丁沛. 语音识别中的抗噪声技术[D]. 北京:清华大学,2003.[5]王苏敏,谢小云,邓茜. 基于小波去噪的语音识别系统[J]. 数字技术与应用, 2012(5):232.[6]毛艳辉. 小波去噪在语音识别预处理中的应用[D]. 上海:上海交通大学, 2010.[7]胡惠英,吴善培. 小波去噪在语音识别中的应用[J].北京邮电大学学报,1999, 22(3):31-34.[8]崔晓,张松炜. 基于小波和先验信器噪比维纳滤波的语音增强[J]. 河南师范大学学报(自然科学版), 2013,41(1):43-46.。



小波包分解下的多窗谱估计语音增强算法查诚;杨平;潘平【期刊名称】《计算机工程》【年(卷),期】2012(038)005【摘要】传统谱减法是基于短时傅里叶变换的单一分辨率算法,具有较大方差.为此,提出一种基于小波包分解下的多窗谱估计语音增强算法.将含噪语音在小波包下分解成不同频段,在不同频段下进行多窗谱谱减运算,并逐一进行小波包重构,以得到去噪后的语音信号.仿真结果表明,该算法能提高含噪语音的信噪比,降低语言失真度.%Traditional spectral subtraction based on Short Time Fourier Transform(STFT) is a kind method of single resolution, and has large variance. In order to solve this problem, this paper proposes a speech enhancement algorithm of multiple window spectrum estimation under wavelet packet decomposition. It decomposes the noisy speech signal into different frequency band under wavelet packet, does take the spectral subtraction operation of multiple window under different frequency, and does, the wavelet packet reconstruction to get denoising speech signal. Simulation result shows that this algorithm can improve the speech Signal Noise Ratio(SNR) with noise, and improve the speech distortion degree.【总页数】3页(P291-293)【作者】查诚;杨平;潘平【作者单位】贵州大学计算机科学与信息学院,贵阳550025;贵州大学计算机科学与信息学院,贵阳550025;贵州大学计算机科学与信息学院,贵阳550025【正文语种】中文【中图分类】TN912【相关文献】1.失真控制下的短时谱估计语音增强算法 [J], 刘晓明;班超帆;冯晓荣2.基于多窗谱估计的改进的维纳滤波语音增强算法 [J], 崔旭3.基于多窗谱估计的改进维纳滤波语音增强 [J], 张青;吴进4.基于多窗谱估计的维纳滤波语音增强算法 [J], 张正文;周航麒5.基于多窗谱估计和几何谱减的低信噪比语音增强方法 [J], 李湑;胡俊;刘新;黄石磊因版权原因,仅展示原文概要,查看原文内容请购买。


e to o e t a e Thi a rp o e ha i lis c r le to y t si t st d n a e o e es e — n r py t si t . m sp pe r v st tusng mu t-pe ta n p o e tma eha he a va t g v rt p c r h

要: 多带谱熵不仅 能体现和谱熵一样 的频率特性 , 还能体现能量的分布情况, 因此在进语音检测时更趋 向
于采用多带谱熵估计。通过仿真 , 明多带谱熵估计在 非平稳信号检测 中相 比于谱熵估计的优越性, 证 确定适 合 坦克环境的多带谱熵噪声估计算法。结合 多带谱熵估计、 关加权、 相 分帧相减等理论, 出一种 以多窗谱估 提 计为基础 的改进的语音增 强算法。仿真结果表 明, 出的算法不仅能更好地抑制背景噪声和音 乐噪声, 提 而且 还较 好地保 持 了语 音 的可懂度 和 自然度 。 关键词 : 语音增强; 多带谱熵估计; 多窗谱估计; 相关加权; 分帧相减 文 章编 号 :0 28 3 (02 1—140 文献 标识码 : 中图 分类号 :N923 10 —3 12 1)90 1—5 A T 1.5
C m u r n i ei d p lai s o p t gn r g n A pi t n计算机工程与应用 eE e na c o
◎数 据 库、 号 与信 息处 理◎ 信
多窗谱估计 的语 音增 强减法研究
彭雨晨 , 王 忠
P ENG Yuc e W ANG og h n, Zh n
P ENG u h n W ANG o g S u y o p e h e h n e n l o ih b s d o u tl y rs e t a si a Y e e, Zh n . t d fs e c n a c me t a g r t m a e n m li e p c r l t a e m —


算法 2. 1 算法步骤 ( 1 计算被噪声污染的信号的正交小波变 )
换。选择合适的小波和小波分解层数, 采用下 式对被噪声污染的信号进行小波分解至层 :
目 就是从带有噪声的语音 的 信号中 提取尽可
能纯净的原始语音, 提高信噪比, 改善语音质
信号通过一个滤波器, 滤掉噪声频率成分。但 对于瞬间信号、宽带噪声信号、非平稳信号 等, 这种方法有时会给信号本身带来较大的畸 形, 而且, 线性滤波方法存在着保护信号局部 特征和抑制噪声之间的矛盾, 小波变换理论由 于具有时 一频局部化分析特点及小波函数选 择的灵活性, 为解决这一矛盾提供了有力的工 具。因此, 本文就以小波变换为工具, 寻求一 种合适的语音增强算法。
[ yw r s pcrlu t cin mut l wn o pcrm; vlt ak t eo oio ; os; in l i a oS ) Ke o d set br t ; lpe id w se t I as a o i u waee ce cmp sin n i Sg aNos R t (NR p d t e e i
[ src] Ta io a p crl u t c o ae nS ot i o r rTa som(T T) sakn to f igerslt n a dh sl g Ab ta t rdt n l et br t nb sdo h r TmeF u e rn fr S F i idmeh do n l eoui , n a a e i s as ai i s o r
so a iag rh cni rv esec in l i t (NR) t os,n rv esec itro ere h wshths lo tm a o et e hSga s RaoS t t i mp h p No e i wi n i adi oet pehds t nd ge: h e mp h oi
第3 8卷 第 5期
V_1 8 o. 3 No. 5

21 0 2年 3月
M a c 2 2 r h 01
Co p e g n e i g m utrEn i e rn
・开发 研 究 与设o9_0 文 0 -3 8 02 5一21.3 0 -4 2 o 献标识 A 码:
S e t um tm a i n Un e a ee c e c m p sto p cr Esi to d rW v l t Pa k t De o oi n i
ZHA e g YANG i g P Ch n , P n , AN i g Pn ( olg f mp tr ce c n f r t n Guz o ie s y Guy n 5 0 5 C i a C l eo e Co ue i ea dI omai , i u S n n o h Un v ri , i a g5 0 2 , h n ) t
中 圈分类号:T 92 N1
小 波 包分 解 下 的 多 窗谱 估 计 语 音 增 强算 法
查 诚 ,杨 平 ,潘 平
( 贵州大学计算机科学与信 息学院,贵阳 5 0 2 ) 505

要 :传统谱减法是基于短 时傅里 叶变换 的单一分 辨率 算法 ,具有较 大方差。为此 ,提出一种基于小波包分解下 的多窗谱估计语音增强
va i n e I r e o v sp o e , i p r r p s sa s e c n n c m e tag rt m f u t l n o s e t ra c . n o d rt s l et r blm t s o hi h pa e o o e p e h e ha e n l o h o l p e wi d w p c r p i m i um si t n u d rwa e e et ma i n e v lt o p c e e o ost n. td c mpo e h o s p e h s g a n o d fe e tfe u n y b n n e v l tp c e , o st k h p c r ls b r c i n a k td c mp i o I e o i s st e n i y s e c i n li t if r n r q e c a d u d r wa e e a k t d e a e t e s e ta u ta to o e ai n o li l n o u d rd fe e e u nc , n o st v l t c t e o s r c i n t e e o s n p e h sg a . i u a i n r s l p r to fmu tp e wi d w n e if r nt r q e y a d d e wa e e ke c n tu t o g t n i i g s e c i n 1 S m lto e u t f he pa r o d
卡尔曼滤波 J 、维纳滤波 】 波去噪 f 、小 。
较大方差 的周期图功率谱估计 ,使得谱 减法在 去除背景噪声 的同时,随之带来 的令人厌恶的音乐噪声 。拟采 用多窗谱估 计替代周期 图进行功率谱估计 ,能降低功率谱 估计 的方差 , 最终达到能抑制音乐噪声的产生 ,提高谱估计质量 。 谱减法是基于 S F 的单一分辨率算法 ,且功率谱估计 TT 的方差较大 ,本文针对该问题 ,提出一种小波包分解下 的多
算法。将含噪语音在小波包下分解成 不同频 段 , 不 同频段下进行多窗谱谱减运算 ,并逐一进行小波包重构 ,以得到去噪后的语音信号 。 在
仿 真结果表明 ,该算法能提高含 噪语 音的信 噪比 ,降低语言失真度 。
关健 词 :谱减法 ;多窗谱 ;小波 包分解 ;噪声 ;信噪比
S ec p e h Enh n e e t g rt m f u tp eW i d w a c m n o ih 0 li l n o Al M
DOI 1 .9 9 .s.0 03 2 .0 20 .9 : 03 6 /i n10 —4 82 1.50 0 js
1 概 述
近年来 ,移 动电话激 增和语 音识 别技 术飞速 发展 ,对数 字语音信 号去噪提出更高 的要求 。为使语音通信系统和 自动 语音处理 系统能更好地应用于实际环境 ,人们一直在努力研 究有效 的语 音增强算法。 目前 ,基于单通道输入 的语音增强 算法正被广 泛的应 用在这 些系统 中。 代表性算法有谱减法…、