期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Analysis of lip geometric features for audio-visual speech recognition

Kaynak M.N. Qi Zhi Cheok A.D. Sengupta K. Zhang Jian Ko Chi Chung 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2004,34(4):564-570

Audio-visual speech recognition employing both acoustic and visual speech information is a novel extension of acoustic speech recognition and it significantly improves the recognition accuracy in noisy environments. Although various audio-visual speech-recognition systems have been developed, a rigorous and detailed comparison of the potential geometric visual features from speakers' faces is essential. Thus, in this paper the geometric visual features are compared and analyzed rigorously for their importance in audio-visual speech recognition. Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results. Speech signals are modeled by hidden Markov models (HMMs) and using the optimized HMMs and geometric visual features the accuracy of acoustic-only, visual-only, and audio-visual speech recognition methods are compared. The audio-visual speech recognition scheme has a much improved recognition accuracy compared to acoustic-only and visual-only speech recognition especially at high noise levels. The experimental results showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal-to-noise ratio of 0 dB). 相似文献

2.

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Jianrong Wang Ju Zhang Kiyoshi Honda Jianguo Wei Jianwu Dang 《Multimedia Systems》2016,22(3):315-323

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments. 相似文献

3.

Multiple cameras audio visual speech recognition using active appearance model visual features in car environment

Astik Biswas P. K. Sahu Mahesh Chandra 《International Journal of Speech Technology》2016,19(1):159-171

Consideration of visual speech features along with traditional acoustic features have shown decent performance in uncontrolled auditory environment. However, most of the existing audio-visual speech recognition (AVSR) systems have been developed in the laboratory conditions and rarely addressed the visual domain problems. This paper presents an active appearance model (AAM) based multiple-camera AVSR experiment. The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments. At first, a series of visual speech recognition (VSR) experiments are carried out to study the impact of each camera on multi-stream VSR. Four cameras in car audio-visual corpus is used to perform the experiments. The individual camera stream is fused to have four-stream synchronous hidden Markov model visual speech recognizer. Finally, optimum four-stream VSR is combined with single stream acoustic HMM to build five-stream AVSR. The dual modality AVSR system shows more robustness compared to acoustic speech recognizer across all driving conditions. 相似文献

4.

多模式汉语连续语音识别中视觉特征的提取和应用 总被引：3，自引：0，他引：3

刘鹏王作英《中文信息学报》2004,18(4):80-85

本文对在汉语多模式汉语语音识别系统中利用视觉特征进行了研究,给出了基于多流隐马尔科夫模型(Multi-stream HMM, MSHMM)的听视觉融合方案,并对有关视觉特征的两项关键技术:嘴唇定位和视觉特征提取进行了详细讨论。首先,我们研究了基于模板匹配的嘴唇跟踪方法;然后研究了基于线性变换的低级视觉特征,并与基于动态形状模型的特征作了比较;实验结果表明,引入视觉信息后无噪环境下语音识别声学层首选错误率相对下降36.09%,在噪声环境下的鲁棒性也有明显提高。相似文献

5.

Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial,spectral and temporal modeling of sounds

Marc Delcroix Keisuke Kinoshita Tomohiro Nakatani Shoko Araki Atsunori Ogawa Takaaki Hori Shinji Watanabe Masakiyo Fujimoto Takuya Yoshioka Takanobu Oba Yotaro Kubo Mehrez Souden Seong-Jun Hahm Atsushi Nakamura 《Computer Speech and Language》2013,27(3):851-873

Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech–noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy. 相似文献

6.

Extraction of visual features for lipreading 总被引：4，自引：0，他引：4

Matthews I. Cootes T.F. Bangham J.A. Cox S. Harvey R. 《IEEE transactions on pattern analysis and machine intelligence》2002,24(2):198-213

The multimodal nature of speech is often ignored in human-computer interaction, but lip deformations and other body motion, such as those of the head, convey additional information. We integrate speech cues from many sources and this improves intelligibility, especially when the acoustic signal is degraded. The paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively. The third, bottom-up, method uses a nonlinear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multitalker visual speech recognition task of isolated letters 相似文献

7.

用于口型识别的实时唇定位方法 总被引：8，自引：0，他引：8

姚鸿勋高文李静梅吕雅娟王瑞《软件学报》2000,11(8):1126-1132

在许多应用于有噪声环境下的语音识别系统中,唇读技术能有效地降低噪声的影响,通过视觉通道来补充仅取决于听觉通道的信息量,从而提高语音识别系统的识别率.该文提出了一种有效和稳健的唇定位跟踪方法,以满足不用特殊标识物和规范性照明就能对信息进行有效提取的应用需求.该方法首先用肤色模型查找脸;然后用迭代算法搜索脸部区域内的眼睛;再根据眼睛的位置来确定脸的大小和位置,并对脸的下半部分采用彩色坐标变换法将唇从肤色中明显地区分出来;最后,用可变模板将上下唇的内外轮廓描述出来. 相似文献

8.

自适应视听信息融合用于抗噪语音识别

梁冰陈德运程慧《控制理论与应用》2011,28(10):1461-1466

为了提高噪音环境中语音识别的准确性和鲁棒性,提出了基于自适应视听信息融合的抗噪语音识别方法,视听信息在识别过程中具有变化的权重,动态的自适应于环境输入的信噪比．根据信噪比和反馈的识别性能,通过学习自动机计算视觉信息的最优权重;根据视听信息的特征向量,利用隐马尔科夫模型进行视听信息的模式匹配,并根据最优权重组合视觉和声音隐马尔科夫模型的决策,获得最终的识别结果．实验结果表明,在各种噪音水平下,自适应权重比不变权重的视听信息融合的语音识别性能更优．相似文献

9.

Recognition of human speech phonemes using a novel fuzzy approach

《Applied Soft Computing》2007,7(3):828-839

相似文献

10.

Super-human multi-talker speech recognition: A graphical modeling approach

John R. Hershey Steven J. Rennie Peder A. Olsen Trausti T. Kristjansson 《Computer Speech and Language》2010,24(1):45-66

We present a system that can separate and recognize the simultaneous speech of two people recorded in a single channel. Applied to the monaural speech separation and recognition challenge, the system out-performed all other participants – including human listeners – with an overall recognition error rate of 21.6%, compared to the human error rate of 22.3%. The system consists of a speaker recognizer, a model-based speech separation module, and a speech recognizer. For the separation models we explored a range of speech models that incorporate different levels of constraints on temporal dynamics to help infer the source speech signals. The system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. For inference, we compare a 2-D Viterbi algorithm and two loopy belief-propagation algorithms. We show how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model. The best belief-propagation method results in nearly the same recognition error rate as exact inference. 相似文献

11.

多噪声环境下的层级语音识别模型

曹晶晶许洁萍邵聖淇《计算机应用》2018,38(6):1790-1794

针对多噪声环境下的语音识别问题,提出了将环境噪声作为语音识别上下文考虑的层级语音识别模型。该模型由含噪语音分类模型和特定噪声环境下的声学模型两层组成,通过含噪语音分类模型降低训练数据与测试数据的差异,消除了特征空间研究对噪声稳定性的限制,并且克服了传统多类型训练在某些噪声环境下识别准确率低的弊端,又通过深度神经网络（DNN）进行声学模型建模,进一步增强声学模型分辨噪声的能力,从而提高模型空间语音识别的噪声鲁棒性。实验中将所提模型与多类型训练得到的基准模型进行对比,结果显示所提层级语音识别模型较该基准模型的词错率（WER）相对降低了20.3%,表明该层级语音识别模型有利于增强语音识别的噪声鲁棒性。相似文献

12.

Robust speech recognition in noisy environments based on subband spectral centroid histograms

Gajic B. Paliwal K.K. 《IEEE transactions on audio, speech, and language processing》2006,14(2):600-608

We investigate how dominant-frequency information can be used in speech feature extraction to increase the robustness of automatic speech recognition against additive background noise. First, we review several earlier proposed auditory-based feature extraction methods and argue that the use of dominant-frequency information might be one of the major reasons for their improved noise robustness. Furthermore, we propose a new feature extraction method, which combines subband power information with dominant subband frequency information in a simple and computationally efficient way. The proposed features are shown to be considerably more robust against additive background noise than standard mel-frequency cepstrum coefficients on two different recognition tasks. The performance improvement increased as we moved from a small-vocabulary isolated-word task to a medium-vocabulary continuous-speech task, where the proposed features also outperformed a computationally expensive auditory-based method. The greatest improvement was obtained for noise types characterized by a relatively flat spectral density. 相似文献

13.

Robust speech recognition method based on discriminative environment feature extraction

下载免费PDF全文

韩纪庆高文《计算机科学技术学报》2001,16(5):458-464

It is an effective approach to learn the influence of environmental parameters,such as additive noise and channel distortions,from training data for robust speech recognition.Most of the previous methods are based on maximum likelihood estimation criterion.However,these methods do not lead to a minimum error rate result.In this paper,a novel discriinative learning method of environmental parameters,which is based on Minimum Classification Error (MCE) criterion,is proposed.In the method,a simple classifier and the Generalized Probabilistic Descent (GPD)algorithm are adopted to iteratively learn the environmental parameters.Consequently,the clean speech features are estimated from the noisy speech features with the estimated environmental parameters,and then the estimations of clean speech features are utilized in the back-end HMM classifier,Experiments show that the best error rate reudction of 32.1% is obtained,tested on a task of 18 isolated confusion Korean words,relative to a conventional HMM system. 相似文献

14.

基于发音特征的声效相关鲁棒语音识别算法 总被引：1，自引：0，他引：1

晁浩宋成彭维平《计算机应用》2015,35(1):257-261

针对声效(VE)相关的语音识别鲁棒性问题,提出了基于多模型框架的语音识别算法.首先,分析了不同声效模式下语音信号的声学特性以及声效变化对语音识别精度的影响;然后,提出了基于高斯混合模型(GMM)的声效模式检测方法;最后,根据声效检测的结果,训练专门的声学模型用于耳语音识别,而将发音特征与传统的谱特征一起用于其余4种声效模式的语音识别.基于孤立词识别的实验结果显示,采用所提方法后语音识别准确率有了明显的提高:与基线系统相比,所提方法5种声效的平均字错误率降低了26.69%;与声学模型混合语料训练方法相比,平均字错误率降低了14.51%;与最大似然线性回归(MLLR)自适应方法相比,平均字错误率降低了15.30%.实验结果表明:与传统谱特征相比发音特征对于声效变化更具鲁棒性,而多模型框架是解决声效相关的语音识别鲁棒性问题的有效方法. 相似文献

15.

Addressing noise and pitch sensitivity of speech recognition system through variational mode decomposition based spectral smoothing

《Digital Signal Processing》2019

In this paper, we propose a novel front-end speech parameterization technique for automatic speech recognition (ASR) that is less sensitive towards ambient noise and pitch variations. First, using variational mode decomposition (VMD), we break up the short-time magnitude spectrum obtained by discrete Fourier transform into several components. In order to suppress the ill-effects of noise and pitch variations, the spectrum is then sufficiently smoothed. The desired spectral smoothing is achieved by discarding the higher-order variational mode functions and reconstructing the spectrum using the first-two modes only. As a result, the smoothed spectrum closely resembles the spectral envelope. Next, the Mel-frequency cepstral coefficients (MFCC) are extracted using the VMD-based smoothed spectra. The proposed front-end acoustic features are observed to be more robust towards ambient noise and pitch variations than the conventional MFCC features as demonstrated by the experimental evaluations presented in this study. For this purpose, we developed an ASR system using speech data from adult speakers collected under relatively clean recording conditions. State-of-the-art acoustic modeling techniques based on deep neural networks (DNN) and long short-term memory recurrent neural networks (LSTM-RNN) were employed. The ASR systems were then evaluated under noisy test conditions for assessing the noise robustness of the proposed features. To assess robustness towards pitch variations, experimental evaluations were performed on another test set consisting of speech data from child speakers. Transcribing children's speech helps in simulating an ASR task where pitch differences between training and test data are significantly large. The signal domain analyses as well as the experimental evaluations presented in this paper support our claims. 相似文献

16.

资源稀缺蒙语语音识别研究 总被引：1，自引：1，他引：0

张爱英倪崇嘉《计算机科学》2017,44(10):318-322

随着语音识别技术的发展,资源稀缺语言的语音识别系统的研究吸引了更广泛的关注。以蒙语为目标识别语言,研究了在资源稀缺的情况下(如仅有10小时的带标注的语音)如何利用其他多语言信息提高识别系统的性能。借助基于多语言深度神经网络的跨语言迁移学习和基于多语言深度Bottleneck神经网络的抽取特征可以获得更具有区分度的声学模型。通过搜索引擎以及网络爬虫的定向抓取获得大量的网页数据,有助于获得文本数据,以增强语言模型的性能。融合多个不同识别结果以进一步提高识别精度。与基线系统相比,多种系统融合的识别绝对错误率减少12%。相似文献

17.

Environment dependent noise tracking for speech enhancement

Nitish Krishnamurthy John H. L. Hansen 《International Journal of Speech Technology》2013,16(3):303-312

Numerous efforts have focused on the problem of reducing the impact of noise on the performance of various speech systems such as speech recognition, speaker recognition, and speech coding. These approaches consider alternative speech features, improved speech modeling, or alternative training for acoustic speech models. This study presents an alternative viewpoint by approaching the same problem from the noise perspective. Here, a framework is developed to analyze and use the noise information available for improving performance of speech systems. The proposed framework focuses on explicitly modeling the noise and its impact on speech system performance in the context of speech enhancement. The framework is then employed for development of a novel noise tracking algorithm for achieving better speech enhancement under highly evolving noise types. The first part of this study employs a noise update rate in conjunction with a target enhancement algorithm to evaluate the need for tracking in many enhancement algorithms. It is shown that noise tracking is more beneficial in some environments than others. This is evaluated using the Log-MMSE enhancement scheme for a corpus of four noise types consisting of Babble (BAB), White Gaussian (WGN), Aircraft Cockpit (ACN), and Highway Car (CAR) using the Itakura-Saito (IS) (Gray et al. in IEEE Trans. Acoust. Speech Signal Process. 28:367–376, 1980) quality measure. A test set of 200 speech utterances from the TIMIT corpus are used for evaluations. The new Environmentally Aware Noise Tracking (EA-NT) method is shown to be superior in comparison with the contemporary noise tracking algorithms. Evaluations are performed for speech degraded using a corpus of four noise types consisting of: Babble (BAB), Machine Gun (MGN), Large Crowd (LCR), and White Gaussian (WGN). Unlike existing approaches, this study provides an effective foundation for addressing noise in speech by emphasizing noise modeling so that available resources can be used to achieve more reliable overall performance in speech systems. 相似文献

18.

Maximum entropy direct models for speech recognition

Hong-Kwang Jeff Kuo Yuqing Gao 《IEEE transactions on audio, speech, and language processing》2006,14(3):873-881

Traditional statistical models for speech recognition have mostly been based on a Bayesian framework using generative models such as hidden Markov models (HMMs). This paper focuses on a new framework for speech recognition using maximum entropy direct modeling, where the probability of a state or word sequence given an observation sequence is computed directly from the model. In contrast to HMMs, features can be asynchronous and overlapping. This model therefore allows for the potential combination of many different types of features, which need not be statistically independent of each other. In this paper, a specific kind of direct model, the maximum entropy Markov model (MEMM), is studied. Even with conventional acoustic features, the approach already shows promising results for phone level decoding. The MEMM significantly outperforms traditional HMMs in word error rate when used as stand-alone acoustic models. Preliminary results combining the MEMM scores with HMM and language model scores show modest improvements over the best HMM speech recognizer. 相似文献

19.

复杂环境下基于自适应深度神经网络的鲁棒语音识别

张开生赵小芬《计算机工程与科学》2022,44(6):1105-1113

在连续语音识别系统中,针对复杂环境（包括说话人及环境噪声的多变性）造成训练数据与测试数据不匹配导致语音识别率低下的问题,提出一种基于自适应深度神经网络的语音识别算法。结合改进正则化自适应准则及特征空间的自适应深度神经网络提高数据匹配度;采用融合说话人身份向量i-vector及噪声感知训练克服说话人及环境噪声变化导致的问题,并改进传统深度神经网络输出层的分类函数,以保证类内紧凑、类间分离的特性。通过在TIMIT英文语音数据集和微软中文语音数据集上叠加多种背景噪声进行测试,实验结果表明,相较于目前流行的GMM-HMM和传统DNN语音声学模型,所提算法的识别词错误率分别下降了5.151%和3.113%,在一定程度上提升了模型的泛化性能和鲁棒性。相似文献

20.

基于时域建模的自动语音识别

王海坤伍大勇刘江王士进胡国平胡郁《计算机工程与应用》2017,53(20):243-248

端到端神经网络能够根据特定的任务自动学习从原始数据到特征的变换,解决人工设计的特征与任务不匹配的问题。以往语音识别的端到端网络采用一层时域卷积网络作为特征提取模型,递归神经网络和全连接前馈深度神经网络作为声学模型的方式,在效果和效率两个方面具有一定的局限性。从特征提取模块的效果以及声学模型的训练效率角度,提出多时间频率分辨率卷积网络与带记忆模块的前馈神经网络相结合的端到端语音识别模型。实验结果表明,所提方法语音识别在真实录制数据集上较传统方法字错误率下降10%,训练时间减少80%。相似文献