期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A statistical parametric approach to video-realistic text-driven talking avatar

Lei Xie Naicai Sun Bo Fan 《Multimedia Tools and Applications》2014,73(1):377-396

This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation. 相似文献

2.

Audio-visual sports highlights extraction using Coupled Hidden Markov Models

Ziyou Xiong 《Pattern Analysis & Applications》2005,8(1-2):62-71

We present our studies on the application of Coupled Hidden Markov Models(CHMMs) to sports highlights extraction from broadcast video using both audio and video information. First, we generate audio labels using audio classification via Gaussian mixture models, and video labels using quantization of the average motion vector magnitudes. Then, we model sports highlights using discrete-observations CHMMs on audio and video labels classified from a large training set of broadcast sports highlights. Our experimental results on unseen golf and soccer content show that CHMMs outperform Hidden Markov Models(HMMs) trained on audio-only or video-only observations. Next, we study how the coupling between the two single-modality HMMs offers improvement on modelling capability by making refinements on the states of the models. We also show that the number of states optimized in this fashion also gives better classification results than other number of states. We conclude that CHMMs provide a promising tool for information fusion techniques in the sports domain for audio-visual event detection and analysis. 相似文献

3.

Learning dynamic audio-visual mapping with input-output Hidden Markov models 总被引：1，自引：0，他引：1

《Multimedia, IEEE Transactions on》2006,8(3):542-549

In this paper, we formulate the problem of synthesizing facial animation from an input audio sequence as a dynamic audio-visual mapping. We propose that audio-visual mapping should be modeled with an input-output hidden Markov model, or IOHMM. An IOHMM is an HMM for which the output and transition probabilities are conditional on the input sequence. We train IOHMMs using the expectation-maximization(EM) algorithm with a novel architecture to explicitly model the relationship between transition probabilities and the input using neural networks. Given an input sequence, the output sequence is synthesized by the maximum likelihood estimation. Experimental results demonstrate that IOHMMs can generate natural and good-quality facial animation sequences from the input audio. 相似文献

4.

Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling 总被引：1，自引：0，他引：1

Lei Xie Zhi-Qiang Liu 《Multimedia, IEEE Transactions on》2007,9(3):500-510

This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Welch DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable 相似文献

5.

基于机器学习的语音驱动人脸动画方法 总被引：19，自引：0，他引：19

陈益强高文王兆其姜大龙《软件学报》2003,14(2):215-221

语音与唇动面部表情的同步是人脸动画的难点之一.综合利用聚类和机器学习的方法学习语音信号和唇动面部表情之间的同步关系,并应用于基于MEPG-4标准的语音驱动人脸动画系统中.在大规模音视频同步数据库的基础上,利用无监督聚类发现了能有效表征人脸运动的基本模式,采用神经网络学习训练,实现了从含韵律的语音特征到人脸运动基本模式的直接映射,不仅回避了语音识别鲁棒性不高的缺陷,同时学习的结果还可以直接驱动人脸网格.最后给出对语音驱动人脸动画系统定量和定性的两种分析评价方法.实验结果表明,基于机器学习的语音驱动人脸动画不仅能有效地解决语音视频同步的难题,增强动画的真实感和逼真性,同时基于MPEG-4的学习结果独立于人脸模型,还可用来驱动各种不同的人脸模型,包括真实视频、2D卡通人物以及3维虚拟人脸. 相似文献

6.

手语合成动画中基于隐马尔可夫模型的头动合成

宋汉营《数字社区&智能家居》2009,(16)

手语是一种靠动作、视觉进行交流的特殊语言,在手语表达过程中,头部运动蕴涵语义以及情感信息。本文分析了手语表达中手势动作和头部动作的运动相关性,利用隐马尔可夫模型(HMMs)为每一个离散的头部动作表示建模,基于一阶马尔科夫模型和插值算法生成平滑的头动动画。相似文献

7.

人脸动画中语音可视化算法研究进展 总被引：1，自引：1，他引：0

下载免费PDF全文

周东生张强魏小鹏《计算机工程与应用》2007,43(9):36-39

从人脸动画合成技术的发展、特点及应用出发,介绍了五种主要的人脸动画合成技术,重点对语音驱动的人脸动画中矢量量化、高斯混合模型、神经网络、隐马尔可夫等四种语音可视化算法的特点进行了对比分析,展望了语音可视化算法的发展与应用前景。 [关键词]：人脸动画;语音可视化相似文献

8.

Automatic phonetic segmentation of Hindi speech using hidden Markov model

Archana Balyan S. S. Agrawal Amita Dev 《AI & Society》2012,27(4):543-549

相似文献

9.

Audio/visual mapping with cross-modal hidden Markov models 总被引：1，自引：0，他引：1

Shengli Fu Gutierrez-Osuna R. Esposito A. Kakumanu P.K. Garcia O.N. 《Multimedia, IEEE Transactions on》2005,7(2):243-252

The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data. 相似文献

10.

基于语音情感识别的多表情人脸动画方法

谢金晶陈益强刘军发《计算机辅助设计与图形学学报》2008,20(4):520-525

基于MPEG-4标准,实现了一种由彩铃语音及蕴含情感共同驱动生成人脸动画的方法和系统．选用HMM作为分类器,训练使其识别语音库中嗔怒、欣喜、可爱、无奈和兴奋5类情感,并对每类情感建立一组与之对应的表情人脸动画参数（FAP）．分析语音强弱得到综合表情函数,并用此函数融合表情FAP与唇动FAP,实现人脸表情多源信息合成,得到综合FAP驱动人脸网格生成动画．实验结果表明,彩铃语音情感识别率可达94．44％,该系统生成的人脸动画也具有较高的真实感．相似文献

11.

Analysis of lip geometric features for audio-visual speech recognition

Kaynak M.N. Qi Zhi Cheok A.D. Sengupta K. Zhang Jian Ko Chi Chung 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2004,34(4):564-570

Audio-visual speech recognition employing both acoustic and visual speech information is a novel extension of acoustic speech recognition and it significantly improves the recognition accuracy in noisy environments. Although various audio-visual speech-recognition systems have been developed, a rigorous and detailed comparison of the potential geometric visual features from speakers' faces is essential. Thus, in this paper the geometric visual features are compared and analyzed rigorously for their importance in audio-visual speech recognition. Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results. Speech signals are modeled by hidden Markov models (HMMs) and using the optimized HMMs and geometric visual features the accuracy of acoustic-only, visual-only, and audio-visual speech recognition methods are compared. The audio-visual speech recognition scheme has a much improved recognition accuracy compared to acoustic-only and visual-only speech recognition especially at high noise levels. The experimental results showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal-to-noise ratio of 0 dB). 相似文献

12.

On the Computational Complexity of Approximating Distributions by Probabilistic Automata 总被引：2，自引：2，他引：0

Abe Naoki Warmuth Manfred K. 《Machine Learning》1992,9(2-3):205-260

Machine Learning - We introduce a rigorous performance criterion for training algorithms for probabilistic automata (PAs) and hidden Markov models (HMMs), used extensively for speech recognition,... 相似文献

13.

A Study of Variable-Parameter Gaussian Mixture Hidden Markov Modeling for Noisy Speech Recognition

Cui X. Gong Y. 《IEEE transactions on audio, speech, and language processing》2007,15(4):1366-1376

To improve recognition performance in noisy environments, multicondition training is usually applied in which speech signals corrupted by a variety of noise are used in acoustic model training. Published hidden Markov modeling of speech uses multiple Gaussian distributions to cover the spread of the speech distribution caused by noise, which distracts the modeling of speech event itself and possibly sacrifices the performance on clean speech. In this paper, we propose a novel approach which extends the conventional Gaussian mixture hidden Markov model (GMHMM) by modeling state emission parameters (mean and variance) as a polynomial function of a continuous environment-dependent variable. At the recognition time, a set of HMMs specific to the given value of the environment variable is instantiated and used for recognition. The maximum-likelihood (ML) estimation of the polynomial functions of the proposed variable-parameter GMHMM is given within the expectation-maximization (EM) framework. Experiments on the Aurora 2 database show significant improvements of the variable-parameter Gaussian mixture HMMs compared to the conventional GMHMMs 相似文献

14.

Automatic facial expression recognition using facial animation parameters and multistream HMMs 总被引：5，自引：0，他引：5

Aleksic P.S. Katsaggelos A.K. 《Information Forensics and Security, IEEE Transactions on》2006,1(1):3-11

The performance of an automatic facial expression recognition system can be significantly improved by modeling the reliability of different streams of facial expression information utilizing multistream hidden Markov models (HMMs). In this paper, we present an automatic multistream HMM facial expression recognition system and analyze its performance. The proposed system utilizes facial animation parameters (FAPs), supported by the MPEG-4 standard, as features for facial expression classification. Specifically, the FAPs describing the movement of the outer-lip contours and eyebrows are used as observations. Experiments are first performed employing single-stream HMMs under several different scenarios, utilizing outer-lip and eyebrow FAPs individually and jointly. A multistream HMM approach is proposed for introducing facial expression and FAP group dependent stream reliability weights. The stream weights are determined based on the facial expression recognition results obtained when FAP streams are utilized individually. The proposed multistream HMM facial expression system, which utilizes stream reliability weights, achieves relative reduction of the facial expression recognition error of 44% compared to the single-stream HMM system. 相似文献

15.

Continuous speech recognition using linear dynamic models

Tao Ma Sundararajan Srinivasan Georgios Lazarou Joseph Picone 《International Journal of Speech Technology》2014,17(1):11-16

Hidden Markov models (HMMs) with Gaussian mixture distributions rely on an assumption that speech features are temporally uncorrelated, and often assume a diagonal covariance matrix where correlations between feature vectors for adjacent frames are ignored. A Linear Dynamic Model (LDM) is a Markovian state-space model that also relies on hidden state modeling, but explicitly models the evolution of these hidden states using an autoregressive process. An LDM is capable of modeling higher order statistics and can exploit correlations of features in an efficient and parsimonious manner. In this paper, we present a hybrid LDM/HMM decoder architecture that postprocesses segmentations derived from the first pass of an HMM-based recognition. This smoothed trajectory model is complementary to existing HMM systems. An Expectation-Maximization (EM) approach for parameter estimation is presented. We demonstrate a 13 % relative WER reduction on the Aurora-4 clean evaluation set, and a 13 % relative WER reduction on the babble noise condition. 相似文献

16.

3D surface analysis using coupled HMMs

Franz Pernkopf 《Machine Vision and Applications》2005,16(5):298-305

We propose a coupled hidden Markov model (CHMM) for analysis of steel surfaces containing three-dimensional flaws. The aim is to model surface errors, which are stretched across one or more surface segments because of their strongly varying size. Due to scale on the surface, the reflection property across the intact surface changes and intensity imaging fails. Light sectioning is used to acquire the surface range data. The steel block is vibrating on the conveyor during data acquisition, which complicates robust feature extraction. After depth map recovery and feature extraction, segments of the surface are classified using CHMMs. The CHMM achieves a recognition rate of 98.57%. We compare the CHMM approach to the naïve Bayes classifier, the Hidden Markov Model, the k-nearest neighbor classifier, and to the Support Vector Machine. Franz Pernkopf received his MSc (Dipl. Ing.) degree in Electrical Engineering at Graz University of Technology, Austria, in summer 1999. He earned a PhD degree from the University of Leoben, Austria, in 2002. In 2002 he was awarded the Erwin Schrödinger Fellowship. In 2004 he was a research associate at the Department of Electrical Engineering at the University of Washington, Seattle. Currently, he is an university assistant at the Signal Processing and Speech Communication Laboratory at Graz University of Technology, Austria. His research interests include graphical models, generative and discriminative learning of Bayesian network classifiers, feature selection, finite mixture models, image processing and vision, and statistical pattern recognition. 相似文献

17.

基于硬件加速模块的嵌入式语音识别系统解决方案 总被引：2，自引：0，他引：2

智强李鹏董明梁维谦刘润生《电子技术应用》2008,34(8)

在基于 CHMM 模型的语音识别原理的基础上,设计了一个以 MCU 和自行设计的语音识别加速模块(ASIC 模块)为核心的低成本、高性能的嵌入式语音识别系统。该系统配合外围电路,能够独立完成语音识别工作,并且有大幅度的性能提升,从而使嵌入式语音识别更加方便简洁。以 ARM7作为系统的控制内核,语音识别加速模块负责完成隐含马尔可夫模型识别算法中运算量最大的 Ma-halanobis 距离运算部分。该系统具有低成本、高性能、高通用性、可裁剪性强等特点。相似文献

18.

Speech Recognition Using Augmented Conditional Random Fields

《IEEE transactions on audio, speech, and language processing》2009,17(2):354-365

Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT phone recognition task, a phone error rate of 23.0% was recorded on the full test set, a significant improvement over comparable HMM-based systems. 相似文献

19.

Realistic mouth synthesis based on shape appearance dependence mapping 总被引：2，自引：0，他引：2

Yangzhou Du Xueyin Lin 《Pattern recognition letters》2002,23(14):308-1885

Mouth images are difficult to synthesize because they vary greatly according to different illumination, size and shape of mouth opening, and especially visibility of teeth and tongue. Conventional approaches such as manipulating 3D model or warping images do not produce very realistic animation. To overcome these difficulties, we describe a method of producing large variations of mouth shape and gray-level appearance using a compact parametric appearance model, which represents both shape and gray-level appearance. We find the high correlation between shape model parameters and gray-level model parameters, and design a shape appearance dependence mapping (SADM) strategy that converts one to the other. Once mouth shape parameters are derived from speech analysis, a proper full mouth appearance can be reconstructed with SADM. Some synthetic results of representative mouth appearance are shown in our experiments, they are very close to real mouth images. The proposed technique can be integrated into a speech-driven face animation system. In effect, SADM can synthesize not only the mouth image but also different kinds of dynamic facial texture, such as furrow, dimple and cheekbone shadows. 相似文献

20.

Audio-visual speech modeling for continuous speech recognition 总被引：3，自引：0，他引：3

Dupont S. Luettin J. 《Multimedia, IEEE Transactions on》2000,2(3):141-151

This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate 相似文献