期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Encoding frequency modulation to improve cochlear implant performance in noise 总被引：10，自引：0，他引：10

Nie K Stickney G Zeng FG 《IEEE transactions on bio-medical engineering》2005,52(1):64-73

Different from traditional Fourier analysis, a signal can be decomposed into amplitude and frequency modulation components. The speech processing strategy in most modern cochlear implants only extracts and encodes amplitude modulation in a limited number of frequency bands. While amplitude modulation encoding has allowed cochlear implant users to achieve good speech recognition in quiet, their performance in noise is severely compromised. Here, we propose a novel speech processing strategy that encodes both amplitude and frequency modulations in order to improve cochlear implant performance in noise. By removing the center frequency from the subband signals and additionally limiting the frequency modulation's range and rate, the present strategy transforms the fast-varying temporal fine structure into a slowly varying frequency modulation signal. As a first step, we evaluated the potential contribution of additional frequency modulation to speech recognition in noise via acoustic simulations of the cochlear implant. We found that while amplitude modulation from a limited number of spectral bands is sufficient to support speech recognition in quiet, frequency modulation is needed to support speech recognition in noise. In particular, improvement by as much as 71 percentage points was observed for sentence recognition in the presence of a competing voice. The present result strongly suggests that frequency modulation be extracted and encoded to improve cochlear implant performance in realistic listening situations. We have proposed several implementation methods to stimulate further investigation. Index Terms-Amplitude modulation, cochlear implant, fine structure, frequency modulation, signal processing, speech recognition, temporal envelope. 相似文献

2.

基于音色一致的语音克隆说话人特征提取方法

下载免费PDF全文

李嘉欣张连海李宜亭《信号处理》2023,39(4):719-729

当前基于预训练说话人编码器的语音克隆方法可以为训练过程中见到的说话人合成较高音色相似性的语音,但对于训练中未看到的说话人,语音克隆的语音在音色上仍然与真实说话人音色存在明显差别。针对此问题,本文提出了一种基于音色一致的说话人特征提取方法,该方法使用当前先进的说话人识别模型TitaNet作为说话人编码器的基本架构,并依据说话人音色在语音片段中保持不变的先验知识,引入一种音色一致性约束损失用于说话人编码器训练,以此提取更精确的说话人音色特征,增加说话人表征的鲁棒性和泛化性,最后将提取的特征应用端到端的语音合成模型VITS进行语音克隆。实验结果表明,本文提出的方法在2个公开的语音数据集上取得了相比基线系统更好的性能表现,提高了对未见说话人克隆语音的音色相似度。相似文献

3.

Morphological normalization of vowel images for articulatory speech recognition

《Journal of Visual Communication and Image Representation》2016

Minimizing morphological variances of the vocal tract across speakers is a challenge for articulatory analysis and modeling. In order to reduce morphological differences in speech organs among speakers and retain speakers’ speech dynamics, our study proposes a method of normalizing the vocal-tract shapes of Mandarin and Japanese speakers by using a Thin-Plate Spline (TPS) method. We apply the properties of TPS in a two-dimensional space in order to normalize vocal-tract shapes. Furthermore, we also use DNN (Deep Neural Networks) based speech recognition for our evaluations. We obtained our template for normalization by measuring three speakers’ palates and tongue shapes. Our results show a reduction in variances among subjects. The similar vowel structure of pre/post-normalization data indicates that our framework retains speaker specific characteristics. Our results for the articulatory recognition of isolated phonemes show an improvement of 25%. Moreover, our phone error rate of continuous speech reduced by 5.84%. 相似文献

4.

汉语数码语音识别自适应算法 总被引：4，自引：0，他引：4

李虎生杨明杰《电路与系统学报》1999,4(2):1-6

说话人自适应是提高非特定人语音识别性能的有效方法之一。本文将ＭＡＰ算法应用于汉语数码语音识别中,并讨论了几种加快自适应速度的方法以及自适应对非自适应人的影响。实验表明,ＭＡＰ算法可以有效地降低汉语数码识别对被适应人的误识率,而且对非自适应人性能影响很小。相似文献

5.

Estimation of vowel recognition with cochlear implant simulations

Liu C Fu QJ 《IEEE transactions on bio-medical engineering》2007,54(1):74-81

Because there are many parameters in the cochlear implant (CI) device that can be optimized for individual patients, it is important to estimate a parameter's effect before patient evaluation. In this paper, Mel-frequency cepstrum coefficients (MFCCs) were used to estimate the acoustic vowel space for vowel stimuli processed by the CI simulations. The acoustic space was then compared to vowel recognition performance by normal-hearing subjects listening to the same processed speech. Five CI speech processor parameters were simulated to produce different degree of spectral resolution, spectral smearing, spectral warping, spectral shifting, and amplitude distortion. The acoustic vowel space was highly correlated with normal hearing subjects' vowel recognition performance for parameters that affected the spectral channels and spectral smearing. However, the acoustic vowel space was not significantly correlated with perceptual performance for parameters that affected the degree of spectral warping, spectral shifting, and amplitude distortion. In particular, while spectral warping and shifting did not significantly reshape the acoustic space, vowel recognition performance was significantly affected by these parameters. The results from the acoustic analysis suggest that the CI device can preserve phonetic distinctions under conditions of spectral warping and shifting. Auditory training may help CI patients better perceive these speech cues transmitted by their speech processors. 相似文献

6.

Reviewing automatic language identification

Muthusamy Y.K. Barnard E. Cole R.A. 《Signal Processing Magazine, IEEE》1994,11(4):33-41

The Oregon Graduate Institute Multi-language Telephone Speech Corpus (OGI-TS) was designed specifically for language identification research. It currently consists of spontaneous and fixed-vocabulary utterances in 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. These utterances were produced by 90 native speakers in each language over real telephone lines. Language identification is related to speaker-independent speech recognition and speaker identification in several interesting ways. It is therefore not surprising that many of the recent developments in language identification can be related to developments in those two fields. We review some of the more important recent approaches to language identification against the background of successes in speaker and speech recognition. In particular, we demonstrate how approaches to language identification based on acoustic modeling and language modeling, respectively, are similar to algorithms used in speaker-independent continuous speech recognition. Thereafter, prosodic and duration-based information sources are studied. We then review an approach to language identification that draws heavily on speaker identification. Finally, the performance of some representative algorithms is reported 相似文献

7.

Speech feature extracted from adaptive wavelet for speechrecognition

Sungwook Chang Kwon Y. Sung-Il Yang 《Electronics letters》1998,34(23):2211-2213

The speech signal is decomposed through adapted local trigonometric transforms. The decomposed signal is classified by M uniform sub-bands for each subinterval. The energy of each sub-band is used as a speech feature. This feature is applied to vector quantisation and the hidden Markov model. The new speech feature shows a slightly better recognition rate than the cepstrum for speaker independent speech recognition. The new speech feature also shows a lower standard deviation between speakers than does the cepstrum 相似文献

8.

Speaker identification based on adaptive discriminative vector quantisation

Zhou G. Mikhael W.B. 《Vision, Image and Signal Processing, IEE Proceedings -》2006,153(6):754-760

A novel adaptive discriminative vector quantisation technique for speaker identification (ADVQSI) is introduced. In the training mode of ADVQSI, for each speaker, the speech feature vector space is divided into a number of subspaces. The feature space segmentation is based on the difference between the probability distribution of the speech feature vectors from each speaker and that from all speakers in the speaker identification (SI) group. Then, an optimal discriminative weight, which represents the subspace's role in SI, is calculated for each subspace of each speaker by employing adaptive techniques. The largest template differences between speakers in the SI group are achieved by using optimal discriminative weights. In the testing mode of ADVQSI, discriminative weighted average vector quantisation (VQ) distortions are used for SI decisions. The performance of ADVQSI is analysed and tested experimentally. The experimental results confirm the performance improvement employing the proposed technique in comparison with existing VQ techniques for SI and recently reported discriminative VQ techniques for SI (DVQSI) 相似文献

9.

基于说话人特有特征集的GMM和i-矢量方法的说话人识别

沈思秋吕勇杨芸齐彦云《电子设计工程》2014,(23):184-188

在说话人识别中,当存在两个或多个发声类似的说话人时,会导致错误识别。为了提高在这种情况下的识别准确率,在音素层次上找出说话人特有的特征,将这些特征的子集构成一个该说话人特有的特征集,然后在这些特征集的基础上用GMM和i-矢量的方法对说话人进行识别。在实验室环境下收集了50个说话人的声音,分别在不同信噪比的环境下进行测试。实验结果表明提出的方法能够提高当存在发声类似的说话人时的识别准确率。相似文献

10.

Feature classification criterion for missing features mask estimation in robust speaker recognition

Dayana Ribas González José Ramón Calvo de Lara 《Signal, Image and Video Processing》2014,8(2):365-375

Currently, many speaker recognition applications must handle speech corrupted by environmental additive noise without having a priori knowledge about the characteristics of noise. Some previous works in speaker recognition have used the missing feature (MF) approach to compensate for noise. In most of those applications, the spectral reliability decision step is performed using the signal to noise ratio (SNR) criterion, which attempts to directly measure the relative signal to noise energy at each frequency. An alternative approach to spectral data reliability has been used with some success in the MF approach to speech recognition. Here, we compare the use of this new criterion with the SNR criterion for MF mask estimation in speaker recognition. The new reliability decision is based on the extraction and analysis of several spectro-temporal features from across the entire speech frame, but not across the time, which highlight the differences between spectral regions dominated by speech and by noise. We call it the feature classification (FC) criterion. It uses several spectral features to establish spectrogram reliability unlike SNR criterion that relies only in one feature: SNR. We evaluated our proposal through speaker verification experiments, in Ahumada speech database corrupted by different types of noise at various SNR levels. Experiments demonstrated that the FC criterion achieves considerably better recognition accuracy than the SNR criterion in the speaker verification tasks tested. 相似文献