期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Speech enhancement based on undecimated wavelet packet-perceptual filterbanks and MMSE–STSA estimation in various noise environments

Hac&#x; Ergun 《Digital Signal Processing》2008,18(5):797-812

In this paper, we proposed a new speech enhancement system, which integrates a perceptual filterbank and minimum mean square error–short time spectral amplitude (MMSE–STSA) estimation, modified according to speech presence uncertainty. The perceptual filterbank was designed by adjusting undecimated wavelet packet decomposition (UWPD) tree, according to critical bands of psycho-acoustic model of human auditory system. The MMSE–STSA estimation (modified according to speech presence uncertainty) was used for estimation of speech in undecimated wavelet packet domain. The perceptual filterbank provides a good auditory representation (sufficient frequency resolution), good perceptual quality of speech and low computational load. The MMSE–STSA estimator is based on a priori SNR estimation. A priori SNR estimation, which is a key parameter in MMSE–STSA estimator, was performed by using “decision directed method.” The “decision directed method” provides a trade off between noise reduction and signal distortion when correctly tuned. The experiments were conducted for various noise types. The results of proposed method were compared with those of other popular methods, Wiener estimation and MMSE–log spectral amplitude (MMSE–LSA) estimation in frequency domain. To test the performance of the proposed speech enhancement system, three objective quality measurement tests (SNR, segSNR and Itakura–Saito distance (ISd)) were conducted for various noise types and SNRs. Experimental results and objective quality measurement test results proved the performance of proposed speech enhancement system. The proposed speech enhancement system provided sufficient noise reduction and good intelligibility and perceptual quality, without causing considerable signal distortion and musical background noise. 相似文献

2.

基于听觉感知模型的多通道语音增强系统 总被引：1，自引：0，他引：1

孟晓辉肖灵崔杰《计算机工程》2010,36(13):9-12

提出以模拟听觉感知模型的非均匀滤波器组为基础的多通道语音增强系统,与基于均匀滤波器组的语音增强系统相比,该方法达到相同频率分辨率所需的通道数较少。采用Itakura-Saito距离对系统进行客观评价,仿真结果表明,该系统增强后的语音比均匀多通道系统增强后的语音具有更好的改善效果。相似文献

3.

Supervector-based approaches in a discriminative framework for speaker verification in noisy environments

Sourjya Sarkar K. Sreenivasa Rao 《International Journal of Speech Technology》2017,20(2):387-416

This paper explores the robustness of supervector-based speaker modeling approaches for speaker verification (SV) in noisy environments. In this paper speaker modeling is carried out in two different frameworks: (i) Gaussian mixture model-support vector machine (GMM-SVM) combined method and (ii) total variability modeling method. In the GMM-SVM combined method, supervectors obtained by concatenating the mean of an adapted speaker GMMs are used to train speaker-specific SVMs during the training/enrollment phase of SV. During the evaluation/testing phase, noisy test utterances transformed into supervectors are subjected to SVM-based pattern matching and classification. In the total variability modeling method, large size supervectors are reduced to a low dimensional channel robust vector (i-vector) prior to SVM training and subsequent evaluation. Special emphasis has been laid on the significance of a utterance partitioning technique for mitigating data-imbalance and utterance duration mismatches. An adaptive boosting algorithm is proposed in the total variability modeling framework for enhancing the accuracy of SVM classifiers. Experiments performed on the NIST-SRE-2003 database with training and test utterances corrupted with additive noises indicate that the aforementioned modeling methods outperform the standard GMM-universal background model (GMM-UBM) framework for SV. It is observed that the use of utterance partitioning and adaptive boosting in the speaker modeling frameworks result in substantial performance improvements under degraded conditions. 相似文献

4.

Performance of speaker identification using CSM and TM

R.?Visalakshi Email author P.?Dhanalakshmi 《International Journal of Speech Technology》2016,19(3):457-465

The main objective of this paper is to develop the system of speaker identification. Speaker identification is a technology that allows a computer to automatically identify the person who is speaking, based on the information received from speech signal. One of the most difficult problems in speaker recognition is dealing with noises. The performance of speaker recognition using close speaking microphone (CSM) is affected in background noises. To overcome this problem throat microphone (TM) which has a transducer held at the throat resulting in a clean signal and unaffected by background noises is used. Acoustic features namely linear prediction coefficients, linear prediction cepstral coefficients, Mel frequency cepstral coefficients and relative spectral transform-perceptual linear prediction are extracted. These features are classified using RBFNN and AANN and their performance is analyzed. A new method was proposed for identification of speakers in clean and noisy using combined CSM and TM. The identification performance of the combined system is increased than individual system due to complementary nature of CSM and TM. 相似文献

5.

A perceptually motivated stationary wavelet packet filterbank using improved spectral over-subtraction for enhancement of speech in various noise environments

Navneet Upadhyay Abhijit Karmakar 《International Journal of Speech Technology》2014,17(2):117-132

In this paper, we propose a speech enhancement method where the front-end decomposition of the input speech is performed by temporally processing using a filterbank. The proposed method incorporates a perceptually motivated stationary wavelet packet filterbank (PM-SWPFB) and an improved spectral over-subtraction (I-SOS) algorithm for the enhancement of speech in various noise environments. The stationary wavelet packet transform (SWPT) is a shift invariant transform. The PM-SWPFB is obtained by selecting the stationary wavelet packet tree in such a manner that it matches closely the non-linear resolution of the critical band structure of the psychoacoustic model. After the decomposition of the input speech, the I-SOS algorithm is applied in each subband, separately for the estimation of speech. The I-SOS uses a continuous noise estimation approach and estimate noise power from each subband without the need of explicit speech silence detection. The subband noise power is estimated and updated by adaptively smoothing the noisy signal power. The smoothing parameter in each subband is controlled by a function of the estimated signal-to-noise ratio (SNR). The performance of the proposed speech enhancement method is tested on speech signals degraded by various real-world noises. Using objective speech quality measures (SNR, segmental SNR (SegSNR), perceptual evaluation of speech quality (PESQ) score), and spectrograms with informal listening tests, we show that the proposed speech enhancement method outperforms than the spectral subtractive-type algorithms and improves quality and intelligibility of the enhanced speech. 相似文献

6.

采用子带谱减法的语音增强

蔡宇郝程鹏侯朝焕《计算机应用》2014,34(2):567-571

为了抑制语音信号中的环境噪声,提出了一种基于子带谱减法进行噪声抑制的语音增强方法。首先通过滤波器组将时域信号分成若干个频（子）带,然后在每个子带中,独立使用改进的谱减法技术进行语音增强。由于实际环境中的背景噪声绝大多数都不是随频率均匀分布的,因此这种在不同频带内进行噪声估计和频谱相减的方法更具有针对性,且更加准确。在实际语音处理实验中证明,所提方法在达到噪声抑制效果的同时较好地保留了语音的结构,使增强后的语音具有更高的听觉舒适度和可理解度。相似文献

7.

Speaker recognition using pyramid match kernel based support vector machines

A. D. Dileep C. Chandra Sekhar 《International Journal of Speech Technology》2012,15(3):365-379

Gaussian mixture model (GMM) based approaches have been commonly used for speaker recognition tasks. Methods for estimation of parameters of GMMs include the expectation-maximization method which is a non-discriminative learning based method. Discriminative classifier based approaches to speaker recognition include support vector machine (SVM) based classifiers using dynamic kernels such as generalized linear discriminant sequence kernel, probabilistic sequence kernel, GMM supervector kernel, GMM-UBM mean interval kernel (GUMI) and intermediate matching kernel. Recently, the pyramid match kernel (PMK) using grids in the feature space as histogram bins and vocabulary-guided PMK (VGPMK) using clusters in the feature space as histogram bins have been proposed for recognition of objects in an image represented as a set of local feature vectors. In PMK, a set of feature vectors is mapped onto a multi-resolution histogram pyramid. The kernel is computed between a pair of examples by comparing the pyramids using a weighted histogram intersection function at each level of pyramid. We propose to use the PMK-based SVM classifier for speaker identification and verification from the speech signal of an utterance represented as a set of local feature vectors. The main issue in building the PMK-based SVM classifier is construction of a pyramid of histograms. We first propose to form hard clusters, using k-means clustering method, with increasing number of clusters at different levels of pyramid to design the codebook-based PMK (CBPMK). Then we propose the GMM-based PMK (GMMPMK) that uses soft clustering. We compare the performance of the GMM-based approaches, and the PMK and other dynamic kernel SVM-based approaches to speaker identification and verification. The 2002 and 2003 NIST speaker recognition corpora are used in evaluation of different approaches to speaker identification and verification. Results of our studies show that the dynamic kernel SVM-based approaches give a significantly better performance than the state-of-the-art GMM-based approaches. For speaker recognition task, the GMMPMK-based SVM gives a performance that is better than that of SVMs using many other dynamic kernels and comparable to that of SVMs using state-of-the-art dynamic kernel, GUMI kernel. The storage requirements of the GMMPMK-based SVMs are less than that of SVMs using any other dynamic kernel. 相似文献

8.

Speech enhancement based on stationary bionic wavelet transform and maximum a posterior estimator of magnitude-squared spectrum

Talbi Mourad 《International Journal of Speech Technology》2017,20(1):75-88

Numerous efforts have focused on the problem of reducing the impact of noise on the performance of various speech systems such as speech coding, speech recognition and speaker recognition. These approaches consider alternative speech features, improved speech modeling, or alternative training for acoustic speech models. In this paper, we propose a new speech enhancement technique, which integrates a new proposed wavelet transform which we call stationary bionic wavelet transform (SBWT) and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP). The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the bionic wavelet transform. The MSS-MAP estimation was used for estimation of speech in the SBWT domain. The experiments were conducted for various noise types and different speech signals. The results of the proposed technique were compared with those of other popular methods such as Wiener filtering and MSS-MAP estimation in frequency domain. To test the performance of the proposed speech enhancement system, four objective quality measurement tests [signal to noise ratio (SNR), segmental SNR, Itakura–Saito distance and perceptual evaluation of speech quality] were conducted for various noise types and SNRs. Experimental results and objective quality measurement test results proved the performance of the proposed speech enhancement technique. It provided sufficient noise reduction and good intelligibility and perceptual quality, without causing considerable signal distortion and musical background noise. 相似文献

9.

Support vector machines for speaker and language recognition

《Computer Speech and Language》2006,20(2-3):210-229

Support vector machines (SVMs) have proven to be a powerful technique for pattern classification. SVMs map inputs into a high-dimensional space and then separate classes with a hyperplane. A critical aspect of using SVMs successfully is the design of the inner product, the kernel, induced by the high dimensional mapping. We consider the application of SVMs to speaker and language recognition. A key part of our approach is the use of a kernel that compares sequences of feature vectors and produces a measure of similarity. Our sequence kernel is based upon generalized linear discriminants. We show that this strategy has several important properties. First, the kernel uses an explicit expansion into SVM feature space—this property makes it possible to collapse all support vectors into a single model vector and have low computational complexity. Second, the SVM builds upon a simpler mean-squared error classifier to produce a more accurate system. Finally, the system is competitive and complimentary to other approaches, such as Gaussian mixture models (GMMs). We give results for the 2003 NIST speaker and language evaluations of the system and also show fusion with the traditional GMM approach. 相似文献

10.

Combining Auditory Preprocessing and Bayesian Estimation for Robust Formant Tracking

《IEEE transactions on audio, speech, and language processing》2010,18(2):224-236

We present a framework for estimating formant trajectories. Its focus is to achieve high robustness in noisy environments. Our approach combines a preprocessing based on functional principles of the human auditory system and a probabilistic tracking scheme. For enhancing the formant structure in spectrograms we use a Gammatone filterbank, a spectral preemphasis, as well as a spectral filtering using Difference-of-Gaussians (DoG) operators. Finally, a contrast enhancement mimicking a competition between filter responses is applied. The probabilistic tracking scheme adopts the mixture modeling technique for estimating the joint distribution of formants. In conjunction with an algorithm for adaptive frequency range segmentation as well as Bayesian smoothing an efficient framework for estimating formant trajectories is derived. Comprehensive evaluations of our method on the VTR–Formant database emphasize its high precision and robustness. We obtained superior performance compared to existing approaches for clean as well as echoic noisy speech. Finally, an implementation of the framework within the scope of an online system using instantaneous feature-based resynthesis demonstrates its applicability to real-world scenarios. 相似文献

11.

一种模拟听觉感知模型的非均匀滤波器组

下载免费PDF全文

孟晓辉肖灵崔杰《计算机工程与应用》2010,46(29):140-143

结合均匀滤波器组的加权叠接相加法和短时全通变换,设计了一种可以模拟听觉感知模型的非均匀滤波器组。该滤波器组可用于实时信号处理。仿真结果显示,设计的非均匀滤波器组可以模拟Bark频率尺度,且具有比较理想的信号重建功能。相似文献

12.

一种适用于说话人识别的改进Mel滤波器 总被引：1，自引：0，他引：1

项要杰杨俊安李晋徽陆俊《计算机工程》2013,(11):214-217,222

Mel倒谱系数（MFcc）侧重提取语音信号的低频信息,对语音信号的频谱分布特性描述不充分,不能有效区分说话人个性信息。为此,通过分析语音信号各频段所含说话人个性信息的不同,结合Mel滤波器和反Mel滤波器在高低频段的不同特性,提出一种适于说话人识别的改进Mel滤波器。实验结果表明,改进Mel滤波器提取的新特征能够获得比传统Mel倒谱系数以及反Mel倒谱系数（IMFCC）更好的识别效果,并且基本不增加说话人识别系统训练和识别的时间开销。相似文献

13.

基于支撑向量机的说话人确认系统 总被引：2，自引：1，他引：1

何昕刘重庆李介谷《计算机工程与应用》2000,36(12):70-71,91

支撑向量机(SVM)是一种新的统计学习方法,和以往的学习方法不同的是SVM的学习原则是使结构风险(Structural Risk)最小,而经典的学习方法遵循经验风险(Empirical Risk)最小原则,这使得SVM具有较好的总体性能.文章提出一种基于支撑向量机的文本无关的说话人确认系统,实验表明同基于向量量化(VQ)和高斯混合模式(GMM)的经典方法相比,基于SVM的方法具有更高的区分力和更好的总体性能. 相似文献

14.

A study on the effects of using short utterance length development data in the design of GPLDA speaker verification systems

Ahilan Kanagasundaram David Dean Sridha Sridharan Houman Ghaemmaghami Clinton Fookes 《International Journal of Speech Technology》2017,20(2):247-259

This paper studies the performance degradation of Gaussian probabilistic linear discriminant analysis (GPLDA) speaker verification system, when only short-utterance data is used for speaker verification system development. Subsequently, a number of techniques, including utterance partitioning and source-normalised weighted linear discriminant analysis (SN-WLDA) projections are introduced to improve the speaker verification performance in such conditions. Experimental studies have found that when short utterance data is available for speaker verification development, GPLDA system overall achieves best performance with a lower number of universal background model (UBM) components. As a lower number of UBM components significantly reduces the computational complexity of speaker verification system, that is a useful observation. In limited session data conditions, we propose a simple utterance-partitioning technique, which when applied to the LDA-projected GPLDA system shows over 8% relative improvement on EER values over baseline system on NIST 2008 truncated 10–10 s conditions. We conjecture that this improvement arises from the apparent increase in the number of sessions arising from our partitioning technique and this helps to better model the GPLDA parameters. Further, partitioning SN-WLDA-projected GPLDA shows over 16% and 6% relative improvement on EER values over LDA-projected GPLDA systems respectively on NIST 2008 truncated 10–10 s interview-interview, and NIST 2010 truncated 10–10 s interview-interview and telephone-telephone conditions. 相似文献

15.

Auditory inspired methods for localization of multiple concurrent speakers

Tania Habib Harald Romsdorfer 《Computer Speech and Language》2013,27(3):634-659

The use of microphone arrays offers enhancements of speech signals recorded in meeting rooms and office spaces. A common solution for speech enhancement in realistic environments with ambient noise and multi-path propagation is the application of so-called beamforming techniques. Such beamforming algorithms enhance signals at the desired angle using constructive interference while attenuating signals coming from other directions by destructive interference. However, these techniques require as a priori the time difference of arrival information of the source. Therefore, the source localization and tracking algorithms are an integral part of such a system. The conventional localization algorithms deteriorate in realistic scenarios with multiple concurrent speakers. In contrast to conventional methods, the techniques presented in this paper make use of pitch information of speech signals in addition to the location information. This “position–pitch”-based algorithm pre-processes the speech signals by a multiband gammatone filterbank that is inspired from the auditory model of the human inner ear. The role of this gammatone filterbank is analyzed and discussed in details. For a robust localization of multiple concurrent speakers, a frequency-selective criterion is explored that is based on a study of the human neural system's use of correlations between adjacent sub-band frequencies. This frequency-selective criterion leads to improved localization performance. To further improve localization accuracy, an algorithm based on grouping of spectro-temporal regions formed by pitch cues is presented. All proposed speaker localization algorithms are tested using a multichannel database where multiple concurrent speakers are active. The real-world recordings were made with a 24-channel uniform circular microphone array using loudspeakers and human speakers under various acoustic environments including moving concurrent speaker scenarios. The proposed techniques produced a localization performance that was significantly better than the state-of-the-art baseline in the scenarios tested. 相似文献

16.

Modulation Spectral Features for Robust Far-Field Speaker Identification

Falk T.H. Wai-Yip Chan 《IEEE transactions on audio, speech, and language processing》2010,18(1):90-100

In this paper, auditory inspired modulation spectral features are used to improve automatic speaker identification (ASI) performance in the presence of room reverberation. The modulation spectral signal representation is obtained by first filtering the speech signal with a 23-channel gammatone filterbank. An eight-channel modulation filterbank is then applied to the temporal envelope of each gammatone filter output. Features are extracted from modulation frequency bands ranging from 3-15 H z and are shown to be robust to mismatch between training and testing conditions and to increasing reverberation levels. To demonstrate the gains obtained with the proposed features, experiments are performed with clean speech, artificially generated reverberant speech, and reverberant speech recorded in a meeting room. Simulation results show that a Gaussian mixture model based ASI system, trained on the proposed features, consistently outperforms a baseline system trained on mel-frequency cepstral coefficients. For multimicrophone ASI applications, three multichannel score combination and adaptive channel selection techniques are investigated and shown to further improve ASI performance. 相似文献

17.

Processing degraded speech for text dependent speaker verification

Banriskhem K. Khonglah Ramesh K. Bhukya S. R. Mahadeva Prasanna 《International Journal of Speech Technology》2017,20(4):839-850

This work explores the use of speech enhancement for enhancing degraded speech which may be useful for text dependent speaker verification system. The degradation may be due to noise or background speech. The text dependent speaker verification is based on the dynamic time warping (DTW) method. Hence there is a necessity of the end point detection. The end point detection can be performed easily if the speech is clean. However the presence of degradation tends to give errors in the estimation of the end points and this error propagates into the overall accuracy of the speaker verification system. Temporal and spectral enhancement is performed on the degraded speech so that ideally the nature of the enhanced speech will be similar to the clean speech. Results show that the temporal and spectral processing methods do contribute to the task by eliminating the degradation and improved accuracy is obtained for the text dependent speaker verification system using DTW. 相似文献

18.

A Novel Approach to Improve the Speech Intelligibility Using Fractional Delta-amplitude Modulation Spectrogram

Arul Valiyavalappil Haridas Ramalatha Marimuthu Basabi Chakraborty 《控制论与系统》2013,44(7-8):421-451

Abstract

Speech enhancement is an interesting research area that aims at improving the quality and intelligibility of the speech that is affected by the additive noises, such as airport noise, train noise, restaurant noise, and so on. The presence of these background noises degrades the comfort of listening of the end user. This article proposes a speech enhancement method that uses a novel feature extraction which removes the noise spectrum from the noisy speech signal using a novel fractional delta-AMS (amplitude modulation spectrogram) feature extraction and the D-matrix feature extraction method. The fractional delta-AMS feature extraction strategy is the modification of the delta-AMS with the fractional calculus that increases the sharpness of the feature extraction. The extracted features from the frames are used to determine the optimal mask of all the frames of the noisy speech signal and the mask is employed for training the deep belief neural networks (DBN). The two metrics root mean square error (RMSE) and perceptual evaluation of speech quality (PESQ) are used to evaluate the method. The proposed method yields a better value of PESQ at all level of noise and RMSE decreases with increased noise level. 相似文献

19.

基于LPC谱和支持向量机的船舶辐射噪声识别

康春玉章新华《计算机工程与应用》2007,43(12):215-217

船舶辐射噪声是非常复杂的,寻找新的特征是目前水下目标识别中的一项非常迫切而艰巨的任务。基于线性预测编码(LPC)原理提出了一种加权交叠平均的LPC谱估计算法,同时给出了支持向量机解决多类分类问题的一对多方法。利用得到的LPC谱特征矢量用支持向量机分类器和BP神经网络分类器对海上实测的三类目标噪声数据进行了分类识别,并与一般的LPC谱特征进行了对比。结果表明,加权交叠平均的LPC谱特征对三类目标的总体正确识别概率在95.02%以上,并且比一般的LPC谱特征具有更好的分类性能,支持向量机的分类性能也优于BP神经网络的分类性能。相似文献

20.

Robust formant tracking for continuous speech with speaker variability

Mustafa K. Bruce I.C. 《IEEE transactions on audio, speech, and language processing》2006,14(2):435-444

Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises. 相似文献