首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
彭柏  许刚 《电声技术》2007,31(1):39-43
在研究频谱搬移方法和分析语音共振峰性质及变化规律的基础上,提出了利用频谱搬移调整共振峰的算法,能有效控制共振峰轨迹合成声道模型。讨论了语音转换的实现流程,并将合成的声源模型应用于男、女声之间的转换,实验结果和分析表明,该方法可实现对共振峰的灵活控制,使语音转换具有更高的融合度。  相似文献   

2.
汉语语音正弦模型特征分析和听觉辨识   总被引:1,自引:0,他引:1  
张毅楠  肖熙 《电声技术》2011,35(8):38-41
为了研究汉语语音的声学特征,将语音信号的正弦模型应用于语音的特征提取和分析,通过对语音的模型参数应用峰值匹配算法,得到了基于正弦模型的语谱图.该语谱图能直观地反映出语音信号中基音频率及共振峰的细节及其变化规律,为语音信号的分析提供了可视化的工具.在此基础上,对汉语单韵母音节的前两个共振峰进行了分析,在控制使用少数几个主...  相似文献   

3.
人脸语音动画中语音特征参数提取算法研究   总被引:1,自引:1,他引:0  
人脸语音动画是虚拟现实领域的热点,语音特征参数提取是实现语音同步动画的前提和关键所在。为了能够提取鲁棒性更好的语音特征参数,在小波变换的理论基础上,借鉴MFCC特征参数的提取方法,运用表征语音动态特征的特征差分算法,提出了一种基于离散小波变换的语音特征参数(DWTMFCC)提取方法,并与反映语音情感特征的韵律参数相结合。通过基于LGB算法的VQ模型进行说话人语音识别,可以得到组合特征参数的识别率较高。  相似文献   

4.
基于BP神经网络的耳语音转换为正常语音的研究   总被引:1,自引:1,他引:0  
提出了一种基于BP神经网络的汉语耳语音转换为正常语音的方法.首先提取正常语音、耳语音的共振峰参数,使用BP神经网络训练出耳语音到正常语音共振峰参数的转换模型;然后根据模型求出与耳语音对应的正常语音共振峰参数,采用共振峰合成的方法将耳语音转换为正常语音.实验结果表明:使用该方法转换的正常语音DRT得分为80%,MOS得分为3.5,在可懂度和音质方面均达到了满意的效果.  相似文献   

5.
本文介绍从语音的有限个共振峰估计声道面积函数模型参数的扰动实现方法,在Schroedar的声管喇叭方程的扰动解法基础,采用改进的声道面积函数模型,针对不同共振模式之间接的交叉影响,采用了交叉敏感扰动矩阵关系式,及交叉扰动敏感矩阵的测试方法,设计了用递归式扰动法实现声道面积参数估计的算法,利用“共振峰-扰动”映射码书作为语音共振峰匹最佳的初始面积扰动矢量,并用多级扰动目标和自适应增量控制,可有效地实  相似文献   

6.
康永国  陶建华  徐波 《信号处理》2005,21(Z1):220-222
提出了一种按频域子带对语音信号进行幅度调制和频率调制联合分解的分析方法.该方法首先使用动态规划的算法将语音的频谱划分为多个互不重叠的频域子带,每个子带内只含有一个共振峰即是单成分信号,然后在各频带内使用能量分离算法做调幅-调频分解,解决了以往方法中用于多成分分离的带通滤波器参数的经验选择问题.并将此方法用于共振峰估计实验,实验结果表明不仅准确地估计出了共振峰频率,而且避免了复杂的共振峰轨迹跟踪过程.  相似文献   

7.
吴晓军  鞠光亮 《电子学报》2016,44(9):2141-2147
提出了一种无标记点的人脸表情捕捉方法.首先根据ASM(Active Shape Model)人脸特征点生成了覆盖人脸85%面部特征的人脸均匀网格模型;其次,基于此人脸模型提出了一种表情捕捉方法,使用光流跟踪特征点的位移变化并辅以粒子滤波稳定其跟踪结果,以特征点的位移变化驱动网格整体变化,作为网格跟踪的初始值,使用网格的形变算法作为网格的驱动方式.最后,以捕捉到的表情变化数据驱动不同的人脸模型,根据模型的维数不同使用不同的驱动方法来实现表情动画重现,实验结果表明,提出的算法能很好地捕捉人脸表情,将捕捉到的表情映射到二维卡通人脸和三维虚拟人脸模型都能取得较好的动画效果.  相似文献   

8.
王坤赤  蒋华 《现代电子技术》2007,30(21):168-170
共振峰声码器因其在理论上具有最低码率而一直是参数语音编码算法研究的重点。共振峰编码器的关键算法是基频和共振峰等语音参数的提取。在高分辨率语谱图基础上,利用语音信号的频域特性设计了一种简单有效的基频和共振峰提取算法。通过评价重建语音信号的音质,证明了参数提取算法的准确性。根据语音实验确定编码参数包含基频和前4个共振峰,并在保证语音质量的前提下制定各参数的量化指标。应用实际语音信号对算法的性能进行测试,试验结果证明算法在码率为1 400 b/s时具有良好的语音质量。  相似文献   

9.
LPC方法提防语音信号共振峰的分析   总被引:3,自引:1,他引:2  
通过对LPC(线性预测编码)方法提取语音信号共振峰进行的研究表明,采用相一频特性与对数幅-频特性同样能提取语音信号共振峰。与对数据-频特性的二次导数相比,相-频特性的三导数有更高的频率分辨率,能更有效地解决共振峰合并的问题,撮更精确的共振峰参数。  相似文献   

10.
语音识别指利用计算机识别语音信号所表达的内容,其目的是要准确地理解语音所蕴含的含义。本文着重研究了语音识别实现过程的特征提取,针对特征提取的多种方法,选用LPC倒谱系数作为特征参数提取,较彻底地去除了语音信号产生过程的激励信息,主要反映了声道模型,而且只需十几个倒谱系数就较好地描述了语音的共振峰特性。通过对语音信号进行预加重、分帧、加窗、自相关分析,而后提取出LPC倒谱系数。根据流程编写VC程序,对语音信号进行分析处理,去除对语音识别无关紧要的冗余信息,从而获得用于语音识别的重要信息。  相似文献   

11.
Speech-driven facial animation combines techniques from different disciplines such as image analysis, computer graphics, and speech analysis. Active shape models (ASM) used in image analysis are excellent tools for characterizing lip contour shapes and approximating their motion in image sequences. By controlling the coefficients for an ASM, such a model can also be used for animation. We design a mapping of the articulatory parameters used in phonetics into ASM coefficients that control nonrigid lip motion. The mapping is designed to minimize the approximation error when articulatory parameters measured on training lip contours are taken as input to synthesize the training lip movements. Since articulatory parameters can also be estimated from speech, the proposed technique can form an important component of a speech-driven facial animation system.  相似文献   

12.
Audio-visual integration in multimodal communication   总被引:7,自引:0,他引:7  
We review recent research that examines audio-visual integration in multimodal communication. The topics include bimodality in human speech, human and automated lip reading, facial animation, lip synchronization, joint audio-video coding, and bimodal speaker verification. We also study the enabling technologies for these research topics, including automatic facial-feature tracking and audio-to-visual mapping. Recent progress in audio-visual research shows that joint processing of audio and video provides advantages that are not available when the audio and video are processed independently  相似文献   

13.
Lifelike talking faces for interactive services   总被引:1,自引:0,他引:1  
Lifelike talking faces for interactive services are an exciting new modality for man-machine interactions. Recent developments in speech synthesis and computer animation enable the real-time synthesis of faces that look and behave like real people, opening opportunities to make interactions with computers more like face-to-face conversations. This paper focuses on the technologies for creating lifelike talking heads, illustrating the two main approaches: model-based animations and sample-based animations. The traditional model-based approach uses three-dimensional wire-frame models, which can be animated from high-level parameters such as muscle actions, lip postures, and facial expressions. The sample-based approach, on the other hand, concatenates segments of recorded videos, instead of trying to model the dynamics of the animations in detail. Recent advances in image analysis enable the creation of large databases of mouth and eye images, suited for sample-based animations. The sample-based approach tends to generate more naturally looking animations at the expense of a larger size and less flexibility than the model-based animations. Beside lip articulation, a talking head must show appropriate head movements, in order to appear natural. We illustrate how such "visual prosody" is analyzed and added to the animations. Finally, we present four applications where the use of face animation in interactive services results in engaging user interfaces and an increased level of trust between user and machine. Using an RTP-based protocol, face animation can be driven with only 800 bits/s in addition to the rate for transmitting audio.  相似文献   

14.
A system capable of producing near video-realistic animation of a speaker given only speech inputs is presented. The audio input is a continuous speech signal, requires no phonetic labelling and is speaker-independent. The system requires only a short video training corpus of a subject speaking a list of viseme-targeted words in order to achieve convincing realistic facial synthesis. The system learns the natural mouth and face dynamics of a speaker to allow new facial poses, unseen in the training video, to be synthesised. To achieve this the authors have developed a novel approach which utilises a hierarchical and nonlinear principal components analysis (PCA) model which couples speech and appearance. Animation of different facial areas, defined by the hierarchy, is performed separately and merged in post-processing using an algorithm which combines texture and shape PCA data. It is shown that the model is capable of synthesising videos of a speaker using new audio segments from both previously heard and unheard speakers.  相似文献   

15.
This paper describes a new and efficient method for facial expression generation on cloned synthetic head models. The system uses abstract facial muscles called action units (AUs) based on both anatomical muscles and the facial action coding system. The facial expression generation method has real-time performance, is less computationally expensive than physically based models, and has greater anatomical correspondence than rational free-form deformation or spline-based, techniques. Automatic cloning of a real human head is done by adapting a generic facial and head mesh to Cyberware laser scanned data. The conformation of the generic head to the individual data and the fitting of texture onto it are based on a fully automatic feature extraction procedure. Individual facial animation parameters are also automatically estimated during the conformation process. The entire animation system is hierarchical; emotions and visemes (the visual mouth shapes that occur during speech) are defined in terms of the AUs, and higher-level gestures are defined in terms of AUs, emotions, and visemes as well as the temporal relationships between them. The main emphasis of the paper is on the abstract muscle model, along with limited discussion on the automatic cloning process and higher-level animation control aspects.  相似文献   

16.
The author's goal is to generate a virtual space close to the real communication environment between network users or between humans and machines. There should be an avatar in cyberspace that projects the features of each user with a realistic texture-mapped face to generate facial expression and action controlled by a multimodal input signal. Users can also get a view in cyberspace through the avatar's eyes, so they can communicate with each other by gaze crossing. The face fitting tool from multi-view camera images is introduced to make a realistic three-dimensional (3-D) face model with texture and geometry very close to the original. This fitting tool is a GUI-based system using easy mouse operation to pick up each feature point on a face contour and the face parts, which can enable easy construction of a 3-D personal face model. When an avatar is speaking, the voice signal is essential in determining the mouth shape feature. Therefore, a real-time mouth shape control mechanism is proposed by using a neural network to convert speech parameters to lip shape parameters. This neural network can realize an interpolation between specific mouth shapes given as learning data. The emotional factor can sometimes be captured by speech parameters. This media conversion mechanism is described. For dynamic modeling of facial expression, a muscle structure constraint is introduced for making a facial expression naturally with few parameters. We also tried to obtain muscle parameters automatically from a local motion vector on the face calculated by the optical flow in a video sequence  相似文献   

17.
MATLAB在语音信号处理辅助教学中的应用   总被引:3,自引:0,他引:3  
本文将MATLAB语言引入语音信号处理教学中,以基音周期估计和端点检测为例,阐述了利用MATLAB语言编程的编程思想、程序的编写及实现。将语音信号处理过程编程开发成可执行的程序,生成可以执行的动画文件。可根据实际的分析需要改变其参数观察变化过程,有助于将语音信号处理中抽象的概念形象化,促进学生对其理论的深刻理解。  相似文献   

18.
MPEG-4 standard allows composition of natural or synthetic video with facial animation. Based on this standard, an animated face can be inserted into natural or synthetic video to create new virtual working environments such as virtual meetings or virtual collaborative environments. For these applications, audio-to-visual conversion techniques can be used to generate a talking face that is synchronized with the voice. In this paper, we address audio-to-visual conversion problems by introducing a novel Hidden Markov Model Inversion (HMMI) method. In training audio-visual HMMs, the model parameters {av} can be chosen to optimize some criterion such as maximum likelihood. In inversion of audio-visual HMMs, visual parameters that optimize some criterion can be found based on given speech and model parameters {av}. By using the proposed HMMI technique, an animated talking face can be synchronized with audio and can be driven realistically. The HMMI technique combined with MPEG-4 standard to create a virtual conference system, named VIRTUAL-FACE, is introduced to show the role of HMMI for applications of MPEG-4 facial animation.  相似文献   

19.
20.
We describe the components of the system used for real-time facial communication using a cloned head. We begin with describing the automatic face cloning using two orthogonal photographs of a person. The steps in this process are the face model matching and texture generation. After an introduction to the MPEG-4 parameters that we are using, we proceed with the explanation of the facial feature tracking using a video camera. The technique requires an initialization step and is further divided into mouth and eye tracking. These steps are explained in detail. We then explain the speech processing techniques used for real-time phoneme extraction and subsequent speech animation module. We conclude with the results and comments on the integration of the modules towards a complete system  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号