首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 877 毫秒
1.
李皓  陈艳艳  唐朝京 《信号处理》2012,28(3):322-328
针对汉语是基于音节的语言,发音过程具有“枣核型”的特点,提出一种描述汉语动态视位的模型,模型分别对音节自身和音节之间的唇部运动进行建模。对音节利用基于声韵母的唇部子运动模型描述,先提取声母和韵母发音时的唇部特征参数,并按参数对口型归类,得到化简的音节视位模型,再计算唇部子运动与音节发音过程在口型上的相似性。在音节间采用元音影响分级的权重函数模拟协同发音影响,先分析各元音与其后接辅音的口型影响,再通过权重函数控制实际发音口型。实验结果表明,相对于单音子或三音子模型表征汉语动态视位,方法提高了动画效率,使得汉语音唇动画更为合理,自然。   相似文献   

2.
用1/3音节作匹配基元研究汉语单字音识别[Ⅰ]   总被引:1,自引:0,他引:1  
本文通过对汉语语音特点的分析,结合数字信号处理的有关理论,提出了用1/3音节作匹配基元进行汉语单字音识别的方法。它是按汉语音节结构将单字音分成三个匹配基元进行识别的,然后再把结果相拼组成单字音。这种方法介于音素识别和音节识别之间。小字库(104个音节)实验表明:该方法基本上保持了音素识别法的运算量,存贮量低的优点:同时也保持了音节识别法的识别率较高的优势,是一种值得探索的方法。本文主要介绍该方法的原理及韵母识别部分的实验结果。  相似文献   

3.
本文首先介绍了骨导汉语音节清晰度测试方法,然后给出了在不同使用部位、不同质量拾振器、不同频率宽度、不同噪声级等多种条件下的测量数值。还分析讨论了声母、韵母的误听情况及音节清晰度与听感之间的关系,指出了骨导语言的特征和最佳状态条件下清晰度可能达到的最大值。  相似文献   

4.
基于模糊粗神经网络的汉语声韵母切分   总被引:1,自引:1,他引:0  
针对汉语连续语音,提出一种声韵母切分方法.根据扩展的声韵母为识别基元.采用汉语音节的重叠音素分割策略,利用模糊粗神经网络进行声韵母自动切分.实验室实验证明了该方法进行音节分割的有效性和合理性.  相似文献   

5.
汉语连续语音识别中不同基元声学模型的复合   总被引:1,自引:0,他引:1  
张辉  杜利民 《电子与信息学报》2006,28(11):2045-2049
该文研究由不同声学基元训练的声学模型的复合。在汉语连续语音识别中,流行的基元包括上下文相关的声韵母基元和音素基元。实验发现,有些汉语音节在声韵母模型下有更高的识别率,有些音节在音素模型下有更高的识别率。该文提出一种复合这两种声学模型的方法,一方面在识别过程中同时使用两种模型,另一方面在识别过程中避开造成低识别率的模型。实验表明,采用本文的方法后,音节错误率比音素模型和声韵母模型分别下降了9.60%和6.10%。  相似文献   

6.
众所周知,传统算法在实现汉语语音的声韵母分割方面存在着很大的困难,这主要是因其处理方法从本质上讲属于一种线性映射,故在处理声韵母分割这一类非线性问题方面有其局限性.为此本文提出并实现了一种基于BP人工神经元网络的汉语语音声韵母分割算法。计算机模拟实验结果表明,该算法只需对极少数典型音节进行简单训练,便可实现汉语语音的声韵母分割且分割精度远大于传统算法所能获得的精度。  相似文献   

7.
在介绍和评述了当今各种语音合成方式优缺点的基础上,作者认为用参数合成方式实现音节型语音合成系统是汉语合成较优越方式,设计了一个用有限音节合成无限词汇的汉语语音合成系统模型,试验证明其可行性,并指出进一步提高合成汉语语音自然度的途径。  相似文献   

8.
全音节连续汉语语音识别的研究   总被引:3,自引:0,他引:3  
在大词汇量连续汉语语音识别的研究中,我们根据汉语语音自身的特点,选用全音节作为识别单元,与音对文法相结合,以期达到实现大词汇量语音识别的目的.实现连续语音识别时,不需预先切分音节,而使用帧同步型网络搜索算法并在搜索中予以音长控制,用180句未训练过的任意词组成的无文本约束的连续语句对系统进行测试评估,获得了40.40%的音节识别率.  相似文献   

9.
本文基于谱包络参数对孤立音节频域分段的常用分段方法进行了较为系统的分析和比较,在此基础上提出了基于离散卡洛变换(KLT)的谱压缩分段法和以段内离散度最小,段间离散度最大为准则的聚类分段方法.实验表明,采用这两种方法后,对汉语孤立音节的分段效果均有相当程度的提高,从而为汉语语音的音素分段和特征提取提供了新的手段和方法。  相似文献   

10.
本文通过对汉语语音的特性分析,及各类音素的DFT谱特性,特别是清/浊音的DFT谱差异的研究,概括出了可用于连续语音音节分割的两个相对最佳的动态特征;同时,提出了动态特征曲线极小值区域分布情况的一种定量描述方法——凹谷函数描述法。在这些研究的基础上,本文给出了一个具体的分段算法。实验验证表明,本文的分段方法对连续汉语语音的音节分割是有效的。最后,本文将这种方法应用到语图分析中,并首次实现了连续语音动态语图按音节的自动分割。  相似文献   

11.
汉语语音正弦模型特征分析和听觉辨识   总被引:1,自引:0,他引:1  
张毅楠  肖熙 《电声技术》2011,35(8):38-41
为了研究汉语语音的声学特征,将语音信号的正弦模型应用于语音的特征提取和分析,通过对语音的模型参数应用峰值匹配算法,得到了基于正弦模型的语谱图.该语谱图能直观地反映出语音信号中基音频率及共振峰的细节及其变化规律,为语音信号的分析提供了可视化的工具.在此基础上,对汉语单韵母音节的前两个共振峰进行了分析,在控制使用少数几个主...  相似文献   

12.
This paper describes the implementation of a Speech Understanding System component which tracks the formants of pseudo-syllabic nuclei containing voiced consonants. The nuclei are isolated from continuous speech after a precategorical classification in which feature extraction is carried out by modules organized in a hierarchy of levels. FFT and LPC spectra are the input to the formant tracking system. It works under the control of rules specifying the possible formant evolutions given previously hypothesized phonetic features and produces fuzzy graphs rather than usual formant patterns because formants are not always evident in the spectrogram pattern.  相似文献   

13.
A linear predictive coding (LPC) model based on time-dependent poles which has yielded promising results when applied to synthetic data is applied to real speech data. The data are processed pitch-synchronously using a simple procedure to identify regions of the data that best fit the model. The maximum-likelihood technique, which has been found to be robust in the presence of noise, is used to estimate the parameters. Resulting formant estimates for several diphthongs are presented. The algorithm tracks the formants well, both in stable regions and in regions of transition. This ability to track formant variation within analysis intervals is a definite advantage over traditional LPC. Results from speech data involving final stop consonants are presented. Rapid changes, particularly in the first and second formants, in the region immediately prior to the stop are detected. Such abrupt transitions are often not detected by traditional time-invariant methods  相似文献   

14.
Two methods for estimating the vocal-tract length equivalent to the homogeneous acoustic tube length are investigated. One method is based on calculating the tract length from the difference between the frequencies of the adjacent local spectral maxima, which exceed 4 kHz. In the other method, the vocal-tract length is calculated according to the average frequency of the second formant determined by the frequencies of first three formants. In addition, various variants of analysis are discussed irrespective of the context and with allowance for known vowels. The probability that the speaker gender is correctly recognized via two methods is about 13%, and its value is almost independent of the knowledge of the context. The probabilities that male and female voices are correctly recognized according to the difference of higher formants are, respectively, 31 and 25.5% regardless of the context and 37 and 31% with allowance for it. The probabilities of correct recognition of male and female voices reach to 27 and 21.5%, respectively, if context-independent recognition is performed from the average frequency of the second formant and 43 and 35.5% after context-dependent recognition with the known vowel type.  相似文献   

15.
In this paper, we describe a group delay-based signal processing technique for the analysis and detection of hypernasal speech. Our preliminary acoustic analysis on nasalized vowels shows that, even though additional resonances are introduced at various frequency locations, the introduction of a new resonance in the low-frequency region (around 250 Hz) is found to be consistent. This observation is further confirmed by a perceptual analysis carried out on vowel sounds that are modified by introducing different nasal resonances, and an acoustic analysis on hypernasal speech. Based on this, for subsequent experiments the focus is given only to the low-frequency region. The additive property of the group delay function can be exploited to resolve two closely spaced formants. However, when the formants are very close with considerably wider bandwidths as in hypernasal speech, the group delay function also fails to resolve. To overcome this, we suggest a band-limited approach to estimate the locations of the formants. Using the band-limited group delay spectrum, we define a new acoustic measure for the detection of hypernasality. Experiments are carried out on the phonemes /a/, /i/, and /u/ uttered by 33 hypernasal speakers and 30 normal speakers. Using the group delay-based acoustic measure, the performance on a hypernasality detection task is found to be 100% for /a/, 88.78% for /i/ and 86.66% for /u/. The effectiveness of this acoustic measure is further cross-verified on a speech data collected in an entirely different recording environment.  相似文献   

16.
Huang  X.D. Jack  M.A. Duncan  G. 《Electronics letters》1987,23(20):1047-1048
An algorithm is introduced which labels formants from the peaks of pole-focused spectra. A clustering procedure is first used to produce line segments of possible formants. These can be considered as anchor traces for the later processing. Rule-based labelling is then applied to provide final formant trace estimates. Experimental results show that the proposed algorithm offers improved formant labelling accuracy.  相似文献   

17.
本文提出了一种新的语音信号共振峰的提取方法。在LPC幅度谱上搜寻最大的极大值点所对应的频率,并将它作为构成声道参数的某一谐振腔所对应的共轭复根的角度,再通过LPC系数的相—频特性的一次导数和三次导数相结合的方法求出这对共轭复根的幅度,从而确定了该谐振腔,也就得到了该谐振腔的共振峰。然后,用LPC的多项式对该谐振腔所对应的多项式做多项式除法,得到新的LPC系数,接着重复前面的步骤,可以较好地求出在LPC谱中对应幅度最大的两个共振峰。  相似文献   

18.
An approach to finding readable patterns corresponding to Arabic numerals is presented. A two-dimensional CRT indicator for depicting the approximate locations of the first and second formants of the spoken numerals is described.  相似文献   

19.
This paper describes a novel end-to-end deep generative model-based speaker recognition system using prosodic features. The usefulness of variational autoencoders (VAE) in learning the speaker-specific prosody representations for the speaker recognition task is examined herein for the first time. The speech signal is first automatically segmented into syllable-like units using vowel onset points (VOP) and energy valleys. Prosodic features, such as the dynamics of duration, energy, and fundamental frequency ( F 0 ), are then extracted at the syllable level and used to train/adapt a speaker-dependent VAE from a universal VAE. The initial comparative studies on VAEs and traditional autoencoders (AE) suggest that the former can efficiently learn speaker representations. Investigations on the impact of gender information in speaker recognition also point out that gender-dependent impostor banks lead to higher accuracies. Finally, the evaluation on the NIST SRE 2010 dataset demonstrates the usefulness of the proposed approach for speaker recognition.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号