期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Static and Dynamic Variance Compensation for Recognition of Reverberant Speech With Dereverberation Preprocessing

Delcroix M. Nakatani T. Watanabe S. 《IEEE transactions on audio, speech, and language processing》2009,17(2):324-334

The performance of automatic speech recognition is severely degraded in the presence of noise or reverberation. Much research has been undertaken on noise robustness. In contrast, the problem of the recognition of reverberant speech has received far less attention and remains very challenging. In this paper, we use a dereverberation method to reduce reverberation prior to recognition. Such a preprocessor may remove most reverberation effects. However, it often introduces distortion, causing a dynamic mismatch between speech features and the acoustic model used for recognition. Model adaptation could be used to reduce this mismatch. However, conventional model adaptation techniques assume a static mismatch and may therefore not cope well with a dynamic mismatch arising from dereverberation. This paper proposes a novel adaptation scheme that is capable of managing both static and dynamic mismatches. We introduce a parametric model for variance adaptation that includes static and dynamic components in order to realize an appropriate interconnection between dereverberation and a speech recognizer. The model parameters are optimized using adaptive training implemented with the expectation maximization algorithm. An experiment using the proposed method with reverberant speech for a reverberation time of 0.5 s revealed that it was possible to achieve an 80% reduction in the relative error rate compared with the recognition of dereverberated speech (word error rate of 31%), and the final error rate was 5.4%, which was obtained by combining the proposed variance compensation and MLLR adaptation. 相似文献

2.

Harmonicity-Based Blind Dereverberation for Single-Channel Speech Signals

Tomohiro Nakatani Keisuke Kinoshita Masato Miyoshi 《IEEE transactions on audio, speech, and language processing》2007,15(1):80-95

The distant acquisition of acoustic signals in an enclosed space often produces reverberant artifacts due to the room impulse response. Speech dereverberation is desirable in situations where the distant acquisition of acoustic signals is involved. These situations include hands-free speech recognition, teleconferencing, and meeting recording, to name a few. This paper proposes a processing method, named Harmonicity-based dEReverBeration (HERB), to reduce the amount of reverberation in the signal picked up by a single microphone. The method makes extensive use of harmonicity, a unique characteristic of speech, in the design of a dereverberation filter. In particular, harmonicity enhancement is proposed and demonstrated as an effective way of estimating a filter that approximates an inverse filter corresponding to the room impulse response. Two specific harmonicity enhancement techniques are presented and compared; one based on an average transfer function and the other on the minimization of a mean squared error function. Prototype HERB systems are implemented by introducing several techniques to improve the accuracy of dereverberation filter estimation, including time warping analysis. Experimental results show that the proposed methods can achieve high-quality speech dereverberation, when the reverberation time is between 0.1 and 1.0 s, in terms of reverberation energy decay curves and automatic speech recognition accuracy 相似文献

3.

Combination of bottleneck feature extraction and dereverberation for distant-talking speech recognition

Bo Ren Longbiao Wang Liang Lu Yuma Ueda Atsuhiko Kai 《Multimedia Tools and Applications》2016,75(9):5093-5108

The performance of speech recognition in distant-talking environments is severely degraded by the reverberation that can occur in enclosed spaces (e.g., meeting rooms). To mitigate this degradation, dereverberation techniques such as network structure-based denoising autoencoders and multi-step linear prediction are used to improve the recognition accuracy of reverberant speech. Regardless of the reverberant conditions, a novel discriminative bottleneck feature extraction approach has been demonstrated to be effective for speech recognition under a range of conditions. As bottleneck feature extraction is not primarily designed for dereverberation, we are interested in whether it can compensate for other carefully designed dereverberation approaches. In this paper, we propose three schemes covering both front-end processing (cascaded combination and parallel combination) and back-end processing (system combination). Each of these schemes integrates bottleneck feature extraction with dereverberation. The effectiveness of these schemes is evaluated via a series of experiments using the REVERB challenge dataset. 相似文献

4.

Speech Dereverberation Based on Maximum-Likelihood Estimation With Time-Varying Gaussian Source Model

《IEEE transactions on audio, speech, and language processing》2008,16(8):1512-1527

Distant acquisition of acoustic signals in an enclosed space often produces reverberant components due to acoustic reflections in the room. Speech dereverberation is in general desirable when the signal is acquired through distant microphones in such applications as hands-free speech recognition, teleconferencing, and meeting recording. This paper proposes a new speech dereverberation approach based on a statistical speech model. A time-varying Gaussian source model (TVGSM) is introduced as a model that represents the dynamic short time characteristics of nonreverberant speech segments, including the time and frequency structures of the speech spectrum. With this model, dereverberation of the speech signal is formulated as a maximum-likelihood (ML) problem based on multichannel linear prediction, in which the speech signal is recovered by transforming the observed signal into one that is probabilistically more like nonreverberant speech. We first present a general ML solution based on TVGSM, and derive several dereverberation algorithms based on various source models. Specifically, we present a source model consisting of a finite number of states, each of which is manifested by a short time speech spectrum, defined by a corresponding autocorrelation (AC) vector. The dereverberation algorithm based on this model involves a finite collection of spectral patterns that form a codebook. We confirm experimentally that both the time and frequency characteristics represented in the source models are very important for speech dereverberation, and that the prior knowledge represented by the codebook allows us to further improve the dereverberated speech quality. We also confirm that the quality of reverberant speech signals can be greatly improved in terms of the spectral shape and energy time-pattern distortions from simply a short speech signal using a speaker-independent codebook. 相似文献

5.

Automatic speech recognition performance in different room acoustic environments with and without dereverberation preprocessing

Alexandros Tsilfidis Iosif Mporas John Mourjopoulos Nikos Fakotakis 《Computer Speech and Language》2013,27(1):380-395

The performance of recent dereverberation methods for reverberant speech preprocessing prior to Automatic Speech Recognition (ASR) is compared for an extensive range of room and source-receiver configurations. It is shown that room acoustic parameters such as the clarity (C50) and the definition (D50) correlate well with the ASR results. When available, such room acoustic parameters can provide insight into reverberant speech ASR performance and potential improvement via dereverberation preprocessing. It is also shown that the application of a recent dereverberation method based on perceptual modelling can be used in the above context and achieve significant Phone Recognition (PR) improvement, especially under highly reverberant conditions. 相似文献

6.

Integrated Speech Enhancement Method Using Noise Suppression and Dereverberation

Yoshioka T. Nakatani T. Miyoshi M. 《IEEE transactions on audio, speech, and language processing》2009,17(2):231-246

This paper proposes a method for enhancing speech signals contaminated by room reverberation and additive stationary noise. The following conditions are assumed. 1) Short-time spectral components of speech and noise are statistically independent Gaussian random variables. 2) A room's convolutive system is modeled as an autoregressive system in each frequency band. 3) A short-time power spectral density of speech is modeled as an all-pole spectrum, while that of noise is assumed to be time-invariant and known in advance. Under these conditions, the proposed method estimates the parameters of the convolutive system and those of the all-pole speech model based on the maximum likelihood estimation method. The estimated parameters are then used to calculate the minimum mean square error estimates of the speech spectral components. The proposed method has two significant features. 1) The parameter estimation part performs noise suppression and dereverberation alternately. (2) Noise-free reverberant speech spectrum estimates, which are transferred by the noise suppression process to the dereverberation process, are represented in the form of a probability distribution. This paper reports the experimental results of 1500 trials conducted using 500 different utterances. The reverberation time RT₆₀ was 0.6 s, and the reverberant signal to noise ratio was 20, 15, or 10 dB. The experimental results show the superiority of the proposed method over the sequential performance of the noise suppression and dereverberation processes. 相似文献

7.

Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction

《IEEE transactions on audio, speech, and language processing》2009,17(4):534-545

A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades automatic speech recognition (ASR) performance. One way to solve this problem is to dereverberate the observed signal prior to ASR. In this paper, a room impulse response is assumed to consist of three parts: a direct-path response, early reflections and late reverberations. Since late reverberations are known to be a major cause of ASR performance degradation, this paper focuses on dealing with the effect of late reverberations. The proposed method first estimates the late reverberations using long-term multi-step linear prediction, and then reduces the late reverberation effect by employing spectral subtraction. The algorithm provided good dereverberation with training data corresponding to the duration of one speech utterance, in our case, less than 6 s. This paper describes the proposed framework for both single-channel and multichannel scenarios. Experimental results showed substantial improvements in ASR performance with real recordings under severe reverberant conditions. 相似文献

8.

Indeterminacy Free Frequency-Domain Blind Separation of Reverberant Audio Sources

Di Persia L. Milone D. Yanagida M. 《IEEE transactions on audio, speech, and language processing》2009,17(2):299-311

Blind separation of convolutive mixtures is a very complicated task that has applications in many fields of speech and audio processing, such as hearing aids and man-machine interfaces. One of the proposed solutions is the frequency-domain independent component analysis. The main disadvantage of this method is the presence of permutation ambiguities among consecutive frequency bins. Moreover, this problem is worst when reverberation time increases. Presented in this paper is a new frequency-domain method, that uses a simplified mixing model, where the impulse responses from one source to each microphone are expressed as scaled and delayed versions of one of these impulse responses. This assumption, based on the similitude among waveforms of the impulse responses, is valid for a small spacing of the microphones. Under this model, separation is performed without any permutation or amplitude ambiguity among consecutive frequency bins. This new method is aimed mainly to obtain separation, with a small reduction of reverberation. Nevertheless, as the reverberation is included in the model, the new method is capable of performing separation for a wide range of reverberant conditions, with very high speed. The separation quality is evaluated using a perceptually designed objective measure. Also, an automatic speech recognition system is used to test the advantages of the algorithm in a real application. Very good results are obtained for both, artificial and real mixtures. The results are significantly better than those by other standard blind source separation algorithms. 相似文献

9.

Two-microphone separation of speech mixtures. 总被引：1，自引：0，他引：1

Michael Syskind Pedersen DeLiang Wang Jan Larsen Ulrik Kjems 《Neural Networks, IEEE Transactions on》2008,19(3):475-492

Separation of speech mixtures, often referred to as the cocktail party problem, has been studied for decades. In many source separation tasks, the separation method is limited by the assumption of at least as many sensors as sources. Further, many methods require that the number of signals within the recorded mixtures be known in advance. In many real-world applications, these limitations are too restrictive. We propose a novel method for underdetermined blind source separation using an instantaneous mixing model which assumes closely spaced microphones. Two source separation techniques have been combined, independent component analysis (ICA) and binary time - frequency (T-F) masking. By estimating binary masks from the outputs of an ICA algorithm, it is possible in an iterative way to extract basis speech signals from a convolutive mixture. The basis signals are afterwards improved by grouping similar signals. Using two microphones, we can separate, in principle, an arbitrary number of mixed speech signals. We show separation results for mixtures with as many as seven speech signals under instantaneous conditions. We also show that the proposed method is applicable to segregate speech signals under reverberant conditions, and we compare our proposed method to another state-of-the-art algorithm. The number of source signals is not assumed to be known in advance and it is possible to maintain the extracted signals as stereo signals. 相似文献

10.

Dereverberation and Denoising Using Multichannel Linear Prediction

Delcroix M. Hikichi T. Miyoshi M. 《IEEE transactions on audio, speech, and language processing》2007,15(6):1791-1801

Reverberation in a room severely degrades the characteristics and auditory quality of speech captured by distant microphones, thus posing a severe problem for many speech applications. Several dereverberation techniques have been proposed with a view to solving this problem. There are, however, few reports of dereverberation methods working under noisy conditions. In this paper, we propose an extension of a dereverberation algorithm based on multichannel linear prediction that achieves both the dereverberation and noise reduction of speech in an acoustic environment with a colored noise source. The method consists of two steps. First, the speech residual is estimated from the observed signals by employing multichannel linear prediction. When we use a microphone array, and assume, roughly speaking, that one of the microphones is closer to the speaker than the noise source, the speech residual is unaffected by the room reverberation or the noise. However, the residual is degraded because linear prediction removes an average of the speech characteristics. In a second step, the average of the speech characteristics is estimated and used to recover the speech. Simulations were conducted for a reverberation time of 0.5 s and an input signal-to-noise ratio of 0 dB. With the proposed method, the reverberation was suppressed by more than 20 dB and the noise level reduced to -18 dB. 相似文献

11.

A two-stage algorithm for one-microphone reverberant speech enhancement

Mingyang Wu DeLiang Wang 《IEEE transactions on audio, speech, and language processing》2006,14(3):774-784

Under noise-free conditions, the quality of reverberant speech is dependent on two distinct perceptual components: coloration and long-term reverberation. They correspond to two physical variables: signal-to-reverberant energy ratio (SRR) and reverberation time, respectively. Inspired by this observation, we propose a two-stage reverberant speech enhancement algorithm using one microphone. In the first stage, an inverse filter is estimated to reduce coloration effects or increase SRR. The second stage employs spectral subtraction to minimize the influence of long-term reverberation. The proposed algorithm significantly improves the quality of reverberant speech. A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that our algorithm performs substantially better. 相似文献

12.

A stereophonic acoustic signal extraction scheme for noisy and reverberant environments

Klaus Reindl Yuanhang Zheng Andreas Schwarz Stefan Meier Roland Maas Armin Sehr Walter Kellermann 《Computer Speech and Language》2013,27(3):726-745

In this contribution, a novel two-channel acoustic front-end for robust automatic speech recognition in adverse acoustic environments with nonstationary interference and reverberation is proposed. From a MISO system perspective, a statistically optimum source signal extraction scheme based on the multichannel Wiener filter (MWF) is discussed for application in noisy and underdetermined scenarios. For free-field and diffuse noise conditions, this optimum scheme reduces to a Delay & Sum beamformer followed by a single-channel Wiener postfilter. Scenarios with multiple simultaneously interfering sources and background noise are usually modeled by a diffuse noise field. However, in reality, the free-field assumption is very weak because of the reverberant nature of acoustic environments. Therefore, we propose to estimate this simplified MWF solution in each frequency bin separately to cope with reverberation. We show that this approach can very efficiently be realized by the combination of a blocking matrix based on semi-blind source separation (‘directional BSS’), which provides a continuously updated reference of all undesired noise and interference components separated from the desired source and its reflections, and a single-channel Wiener postfilter. Moreover, it is shown, how the obtained reference signal of all undesired components can efficiently be used to realize the Wiener postfilter, and at the same time, generalizes well-known postfilter realizations. The proposed front-end and its integration into an automatic speech recognition (ASR) system are analyzed and evaluated in noisy living-room-like environments according to the PASCAL CHiME challenge. A comparison to a simplified front-end based on a free-field assumption shows that the introduced system substantially improves the speech quality and the recognition performance under the considered adverse conditions. 相似文献

13.

Joint Dereverberation and Residual Echo Suppression of Speech Signals in Noisy Environments

《IEEE transactions on audio, speech, and language processing》2008,16(8):1433-1451

Hands-free devices are often used in a noisy and reverberant environment. Therefore, the received microphone signal does not only contain the desired near-end speech signal but also interferences such as room reverberation that is caused by the near-end source, background noise and a far-end echo signal that results from the acoustic coupling between the loudspeaker and the microphone. These interferences degrade the fidelity and intelligibility of near-end speech. In the last two decades, postfilters have been developed that can be used in conjunction with a single microphone acoustic echo canceller to enhance the near-end speech. In previous works, spectral enhancement techniques have been used to suppress residual echo and background noise for single microphone acoustic echo cancellers. However, dereverberation of the near-end speech was not addressed in this context. Recently, practically feasible spectral enhancement techniques to suppress reverberation have emerged. In this paper, we derive a novel spectral variance estimator for the late reverberation of the near-end speech. Residual echo will be present at the output of the acoustic echo canceller when the acoustic echo path cannot be completely modeled by the adaptive filter. A spectral variance estimator for the so-called late residual echo that results from the deficient length of the adaptive filter is derived. Both estimators are based on a statistical reverberation model. The model parameters depend on the reverberation time of the room, which can be obtained using the estimated acoustic echo path. A novel postfilter is developed which suppresses late reverberation of the near-end speech, residual echo and background noise, and maintains a constant residual background noise level. Experimental results demonstrate the beneficial use of the developed system for reducing reverberation, residual echo, and background noise. 相似文献

14.

Robust Speech Dereverberation Using Multichannel Blind Deconvolution With Spectral Subtraction

Furuya K. Kataoka A. 《IEEE transactions on audio, speech, and language processing》2007,15(5):1579-1591

A robust dereverberation method is presented for speech enhancement in a situation requiring adaptation where a speaker shifts his/her head under reverberant conditions causing the impulse responses to change frequently. We combine correlation-based blind deconvolution with modified spectral subtraction to improve the quality of inverse-filtered speech degraded by the estimation error of inverse filters obtained in practice. Our method computes inverse filters by using the correlation matrix between input signals that can be observed without measuring room impulse responses. Inverse filtering reduces early reflection, which has most of the power of the reverberation, and then, spectral subtraction suppresses the tail of the inverse-filtered reverberation. The performance of our method in adaptation is demonstrated by experiments using measured room impulse responses. The subjective results indicated that this method provides superior speech quality to each of the individual methods: blind deconvolution and spectral subtraction. 相似文献

15.

Online blind speech separation using multiple acoustic speaker tracking and time–frequency masking

P. Pertilä 《Computer Speech and Language》2013,27(3):683-702

Separating speech signals of multiple simultaneous talkers in a reverberant enclosure is known as the cocktail party problem. In real-time applications online solutions capable of separating the signals as they are observed are required in contrast to separating the signals offline after observation. Often a talker may move, which should also be considered by the separation system. This work proposes an online method for speaker detection, speaker direction tracking, and speech separation. The separation is based on multiple acoustic source tracking (MAST) using Bayesian filtering and time–frequency masking. Measurements from three room environments with varying amounts of reverberation using two different designs of microphone arrays are used to evaluate the capability of the method to separate up to four simultaneously active speakers. Separation of moving talkers is also considered. Results are compared to two reference methods: ideal binary masking (IBM) and oracle tracking (O-T). Simulations are used to evaluate the effect of number of microphones and their spacing. 相似文献

16.

Speedup convergence and reduce noise for enhanced speech separation and recognition

Yunxin Zhao Rong Hu Xiaolong Li 《IEEE transactions on audio, speech, and language processing》2006,14(4):1235-1244

Novel techniques are proposed to enhance time-domain adaptive decorrelation filtering (ADF) for separation and recognition of cochannel speech in reverberant room conditions. The enhancement techniques include whitening filtering on cochannel speech to improve condition of adaptive estimation, block-iterative formulation of ADF to speed up convergence, and integration of multiple ADF outputs through post filtering to reduce reverberation noise. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in real room environment, with approximately 2 m microphone-source distance and initial target-to-interference ratio of about 0 dB. The proposed techniques significantly improved ADF convergence rate, target-to-interference ratio, and accuracy of phone recognition. 相似文献

17.

Enhancement of speech signals separated from their convolutive mixture by FDICA algorithm

Rajkishore Prasad Hiroshi Saruwatari Kyohiro Shikano 《Digital Signal Processing》2009,19(1):127-133

This paper presents a novel method for the enhancement of independent components of mixed speech signal segregated by the frequency domain independent component analysis (FDICA) algorithm. The enhancement algorithm proposed here is based on maximum a posteriori (MAP) estimation of the speech spectral components using generalized Gaussian distribution (GGD) function as the statistical model for the time–frequency series of speech (TFSS) signal. The proposed MAP estimator has been used and evaluated as the post-processing stage for the separation of convolutive mixture of speech signals by the fixed-point FDICA algorithm. It has been found that the combination of separation algorithm with the proposed enhancement algorithm provides better separation performance under both the reverberant and non-reverberant conditions. 相似文献

18.

Modulation Spectral Features for Robust Far-Field Speaker Identification

Falk T.H. Wai-Yip Chan 《IEEE transactions on audio, speech, and language processing》2010,18(1):90-100

In this paper, auditory inspired modulation spectral features are used to improve automatic speaker identification (ASI) performance in the presence of room reverberation. The modulation spectral signal representation is obtained by first filtering the speech signal with a 23-channel gammatone filterbank. An eight-channel modulation filterbank is then applied to the temporal envelope of each gammatone filter output. Features are extracted from modulation frequency bands ranging from 3-15 H z and are shown to be robust to mismatch between training and testing conditions and to increasing reverberation levels. To demonstrate the gains obtained with the proposed features, experiments are performed with clean speech, artificially generated reverberant speech, and reverberant speech recorded in a meeting room. Simulation results show that a Gaussian mixture model based ASI system, trained on the proposed features, consistently outperforms a baseline system trained on mel-frequency cepstral coefficients. For multimicrophone ASI applications, three multichannel score combination and adaptive channel selection techniques are investigated and shown to further improve ASI performance. 相似文献

19.

独立分量分析在多光谱遥感图像分类中的应用 总被引：6，自引：0，他引：6

曾生根夏德深《计算机工程与应用》2004,40(21):108-110,145

多光谱遥感图像反映了不同地物的光谱特征,其分类是遥感应用的基础。但是在多光谱遥感波段图像中存在不同地物对应着相同的灰度,即异物同谱的问题。独立分量分析算法对未知的源信号的混合信号进行估计,可以获得相互独立的源信号的近似。独立分量分析算法利用了信号的高阶统计信息,对于多光谱遥感图像而言,算法去除了波段图像之间的相关性,获得的波段图像是相互独立的。但是独立分量分析算法有一个缺点,即计算量太大,影响了在多光谱遥感图像分类上的应用。文章对独立分量分析的一种快速算法FastICA进行改进,减少了计算量,提高了算法的有效性。在性能相当的情况下,改进FastICA算法能有效地减少算法的计算量。由于FastICA算法是线性ICA算法,对于非线性混合的光谱信号的估计存在一定误差,因此应用BP神经网络的非线性特性对其进行自动分类。在同原始遥感图像的BP神经网络分类结果进行比较,结果表明独立分量分析算法能提高多光谱遥感图像的分类的正确率。相似文献

20.

Distributed particle filter based speaker tracking in distributed microphone networks under non-Gaussian noise environments

《Digital Signal Processing》2017

Non-Gaussian noise distorts the speech signals and leads to the degradation of speaker tracking performance. In this paper, a distributed particle filter (DPF) based speaker tracking method in distributed microphone networks under non-Gaussian noise and reverberant environments is proposed. A generalized correntropy function is first presented to estimate the time differences of arrival (TDOA) for speech signals at each node in distributed microphone networks. Next, to address spurious TDOA estimations caused by noise and reverberation, a multiple-hypothesis likelihood model is introduced to calculate the local likelihood functions of the DPF. Finally, a DPF fusing local likelihood functions with an average consensus algorithm is employed to estimate a moving speaker's positions. The proposed method can accurately track the speaker under non-Gaussian noise and reverberant environments, and it is scalable and robust against the nodes failure in distributed networks. Simulation experiments show the validation of the proposed speaker tracking method. 相似文献