期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

邓杰房宁赵群飞《微型电脑应用》2012,28(4):50-52

为了增加读书机器人（JoyT0n）朗读声音的多样性,设计了一种基于单一语音库的声音变换系统。将读书机器TTS（text to speech）合成出的初始声音分解成声音激励信号和声道滤波器信号,并转换到频域进行修改。利用短时傅立叶幅度谱重构激励信号的方法以及通过修改声道滤波器参数的方法来变换音速、音调和音色。修改后的声音激励信号和声道滤波器信号被重新合成产生新的声音信号。该变声系统能在不增加语音库容量的情况下使读书机器人用丰富多彩的感情和声调朗读。相似文献

2.

结合双流网络和双向五元组损失的跨人脸-语音匹配

柳欣王锐钟必能王楠楠《计算机研究与发展》2022,59(3):694-705

面部视觉信息和语音信息是人机交互过程中最为直接和灵活的方式,从而基于智能方式的人脸和语音跨模态感知吸引了国内外研究学者的广泛关注.然而,由于人脸-语音样本的异质性以及语义鸿沟问题,现有方法并不能很好地解决一些难度比较高的跨人脸-语音匹配任务.提出了一种结合双流网络和双向五元组损失的跨人脸-语音特征学习框架,该框架学到的... 相似文献

3.

A novel method for voice conversion based on non-parallel corpus

Abolghasem Sayadian Fatemeh Mozaffari 《International Journal of Speech Technology》2017,20(3):587-592

This article puts forward a new algorithm for voice conversion which not only removes the necessity of parallel corpus in the training phase but also resolves the issue of insufficiency of the target speaker’s corpus. The proposed approach is based on one of the new voice conversion models utilizing classical LPC analysis-synthesis model combined with GMM. Through this algorithm, the conversion functions among vowels and demi-syllables are derived. We assumed that these functions are rather the same for different speakers if their genders, accents, and languages are alike. Therefore, we will be able to produce the demi-syllables with just having access to few sentences from the target speaker and forming the GMM for one of his/her vowels. The results from the appraisal of the proposed method for voice conversion clarifies that this method has the ability to efficiently realize the speech features of the target speaker. It can also provide results comparable to the ones obtained through the parallel-corpus-based approaches. 相似文献

4.

改进的跨语种语音合成模型自适应方法 总被引：1，自引：0，他引：1

刘航凌震华郭武戴礼荣《模式识别与人工智能》2011,24(4):457-463

统计参数语音合成中的跨语种模型自适应主要应用于目标说话人语种与源模型语种不同时,使用目标发音人少量语音数据快速构建具有其音色特征的源模型语种合成系统。本文对传统的基于音素映射和三音素模型的跨语种自适应方法进行改进,一方面通过结合数据挑选的音素映射方法以提高音素映射的可靠性,另一方面引入跨语种的韵律信息映射以弥补原有方法中三音素模型在韵律表征上的不足。在中英文跨语种模型自适应系统上的实验结果表明,改进后系统合成语音的自然度与相似度相对传统方法都有了明显提升。相似文献

5.

On the study of replay and voice conversion attacks to text-dependent speaker verification

Zhizheng Wu Haizhou Li 《Multimedia Tools and Applications》2016,75(9):5311-5327

Automatic speaker verification (ASV) is to automatically accept or reject a claimed identity based on a speech sample. Recently, individual studies have confirmed the vulnerability of state-of-the-art text-independent ASV systems under replay, speech synthesis and voice conversion attacks on various databases. However, the behaviours of text-dependent ASV systems have not been systematically assessed in the face of various spoofing attacks. In this work, we first conduct a systematic analysis of text-dependent ASV systems to replay and voice conversion attacks using the same protocol and database, in particular the RSR2015 database which represents mobile device quality speech. We then analyse the interplay of voice conversion and speaker verification by linking the voice conversion objective evaluation measures with the speaker verification error rates to take a look at the vulnerabilities from the perspective of voice conversion. 相似文献

6.

A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS

《IEEE transactions on audio, speech, and language processing》2009,17(6):1231-1239

相似文献

7.

Development of syllable-based text to speech synthesis system in Bengali

N. P. Narendra K. Sreenivasa Rao Krishnendu Ghosh Ramu Reddy Vempada Sudhamay Maity 《International Journal of Speech Technology》2011,14(3):167-181

This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4. 相似文献

8.

基于GMM的说话人识别算法的研究与应用

吴慧玲杜成东毛鹤《现代计算机》2014,(5):31-35

说话人识别是根据检测到的语音进行说话人身份的认证．是将待识别语音与数据库中的说话人语音进行匹配的过程。设计基于高斯混合模型（GMM）说话人识别系统,提取输入语音的Mel倒谱系数作为观察向量,用GMM算法进行说话人语音模型训练和识别。同时设计基于TMS320DM3730DSP的嵌入式硬件平台,并在该平台上实现所设计的说话人识别系统。为进行性能测试,自行录制相应的语音材料库,录音的人数为38人,其中男19人,女19人。经测试表明,在正常环境下．该设计的说话人系统识别率可达到95％以上。相似文献

9.

Detecting changing emotions in human speech by machine and humans 总被引：1，自引：1，他引：0

C. Natalie van der Wal Wojtek Kowalczyk 《Applied Intelligence》2013,39(4):675-691

The goals of this research were: (1) to develop a system that will automatically measure changes in the emotional state of a speaker by analyzing his/her voice, (2) to validate this system with a controlled experiment and (3) to visualize the results to the speaker in 2-d space. Natural (non-acted) human speech of 77 (Dutch) speakers was collected and manually divided into meaningful speech units. Three recordings per speaker were collected, in which he/she was in a positive, neutral and negative state. For each recording, the speakers rated 16 emotional states on a 10-point Likert Scale. The Random Forest algorithm was applied to 207 speech features that were extracted from recordings to qualify (classification) and quantify (regression) the changes in speaker’s emotional state. Results showed that predicting the direction of change of emotions and predicting the change of intensity, measured by Mean Squared Error, can be done better than the baseline (the most frequent class label and the mean value of change, respectively). Moreover, it turned out that changes in negative emotions are more predictable than changes in positive emotions. A controlled experiment investigated the difference in human and machine performance on judging the emotional states in one’s own voice and that of another. Results showed that humans performed worse than the algorithm in the detection and regression problems. Humans, just like the machine algorithm, were better in detecting changing negative emotions rather than positive ones. Finally, results of applying the Principal Component Analysis (PCA) to our data provided a validation of dimensional emotion theories and they suggest that PCA is a promising technique for visualizing user’s emotional state in the envisioned application. 相似文献

10.

基于计算听觉场景分析和语者模型信息的语音识别鲁棒前端研究 总被引：1，自引：0，他引：1

关勇李鹏刘文举徐波《自动化学报》2009,35(4):410-416

传统抗噪算法无法解决人声背景下语音识别(Automatic speech recognition, ASR)系统的鲁棒性问题. 本文提出了一种基于计算听觉场景分析(Computational auditory scene analysis, CASA)和语者模型信息的混合语音分离系统. 该系统在CASA框架下, 利用语者模型信息和因子最大矢量量化(Factorial-max vector quantization, MAXVQ)方法进行实值掩码估计, 实现了两语者混合语音中有效地分离出目标说话人语音的目标, 从而为ASR系统提供了鲁棒的识别前端. 在语音分离挑战(Speech separation challenge, SSC)数据集上的评估表明, 相比基线系统, 本文所提出的系统的语音识别正确率提高了15.68%. 相关的实验结果也验证了本文提出的多语者识别和实值掩码估计的有效性. 相似文献

11.

多说话人环境下目标说话人语音提取方案

叶于林莫建华刘夏《计算机系统应用》2016,25(4):8-15

于目标声源的方位信息与非线性时频掩蔽语音欠定盲分离方法和BP说话人识别技术的研究基础上,针对现实生活中多说话人交流场景,设计并提出了一种行之有效的解决方案,实现了对处于任意方位的任意目标说话人语音的提取.该方案总体上分目标语音搜索与提取两个阶段,搜索阶段采用了BP说话人识别技术,提取阶段采用了一种改进的势函数聚类声源方位信息与非线性时频掩蔽的语音欠定盲分离方法.实验结果表明:该方案具有可行性,可从混合语音流中有效提取处于任意方位的目标说话人语音,且效果较好,信噪比增益平均为8.68dB,相似系数为85%,识别率为61%,运行时间为20.6S. 相似文献

12.

Speaker recognition—general classifier approaches and data fusion methods

Ravi P. Kevin R. Roopashri Richard J. 《Pattern recognition》2002,35(12):2801-2821

Speaker recognition refers to the concept of recognizing a speaker by his/her voice or speech samples. Some of the important applications of speaker recognition include customer verification for bank transactions, access to bank accounts through telephones, control on the use of credit cards, and for security purposes in the army, navy and airforce. This paper is purely a tutorial that presents a review of the classifier based methods used for speaker recognition. Both unsupervised and supervised classifiers are described. In addition, practical approaches that utilize diversity, redundancy and fusion strategies are discussed with the aim of improving performance. 相似文献

13.

On-line experimental methods to evaluate text-to-speech (TTS) synthesis: effects of voice gender and signal quality on intelligibility,naturalness and preference

《Computer Speech and Language》2005,19(2):129-146

Three experiments are reported that use new experimental methods for the evaluation of text-to-speech (TTS) synthesis from the user's perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete “call centre” word stimuli, investigated the effect of voice gender and signal quality on the intelligibility of three concatenative TTS synthesis systems. Accuracy and search time were recorded as on-line, implicit indices of intelligibility during phoneme detection tasks. It was found that both voice gender and noise affect intelligibility. Results also indicate interactions of voice gender, signal quality, and TTS synthesis system on accuracy and search time. In Experiment 3 the method of paired comparisons was used to yield ranks of naturalness and preference. As hypothesized, preference and naturalness ranks were influenced by TTS system, signal quality and voice, in isolation and in combination. The pattern of results across the four dependent variables – accuracy, search time, naturalness, preference – was consistent. Natural speech surpassed synthetic speech, and TTS system C elicited relatively high scores across all measures. Intelligibility, judged naturalness and preference are modulated by several factors and there is a need to tailor systems to particular commercial applications and environmental conditions. 相似文献

14.

Conversational speech synthesis and the need for some laughter

Campbell N. 《IEEE transactions on audio, speech, and language processing》2006,14(4):1171-1178

This paper reports progress in the synthesis of conversational speech, from the viewpoint of work carried out on the analysis of a very large corpus of expressive speech in normal everyday situations. With recent developments in concatenative techniques, speech synthesis has overcome the barrier of realistically portraying extra-linguistic information by using the actual voice of a recognizable person as a source for units, combined with minimal use of signal processing. However, the technology still faces the problem of expressing paralinguistic information, i.e., the variety in the types of speech and laughter that a person might use in everyday social interactions. Paralinguistic modification of an utterance portrays the speaker's affective states and shows his or her relationships with the speaker through variations in the manner of speaking, by means of prosody and voice quality. These inflections are carried on the propositional content of an utterance, and can perhaps be modeled by rule, but they are also expressed through nonverbal utterances, the complexity of which may be beyond the capabilities of many current synthesis methods. We suggest that this problem may be solved by the use of phrase-sized utterance units taken intact from a large corpus. 相似文献

15.

Arabic speech synthesis and diacritic recognition

Ilyes Rebai Yassine BenAyed 《International Journal of Speech Technology》2016,19(3):485-494

Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech. 相似文献

16.

基于HMM-UBM的声纹密码识别

章钊郭武戴礼荣《模式识别与人工智能》2012,25(4):664-668

声纹识别中,提出基于隐马尔可夫－通用背景模型的识别算法。针对声纹密码中每个人的注册语音数据量很少的问题,提出使用大量其他人数据先建立话者无关的声韵母隐马尔可夫模型作为通用背景模型,再根据最大后验概率准则,以通用背景模型为基础使用训练语音自适应获得说话人模型。该方法解决在声纹密码识别中训练数据不足的问题。在讯飞桌面数据库Ⅱ上,采用该算法的系统的等错误率为6。8%。相似文献

17.

基于LabVlEW的语音身份认证系统

唐夫乾汪亚明郑俊褒《工业控制计算机》2011,24(12):22-23

设计了一套基于LabVIEW的语音身份认证系统,以LabVIEW2009为开发平台,采用改进的美尔倒频谱系数法进行语音信号特征提取,采用矢量量化模型进行语音识别,实现了与文本、性别无关的声纹识别.实验结果表明该系统能够有效克服环境噪声、说话人声音变异带来的影响. 相似文献

18.

广播电视新闻中的主持人跟踪系统

汪洋甘涛向军《计算机系统应用》2014,23(10):40-45

针对广播电视新闻节目中的主持人跟踪问题,提出了一种将说话人分割聚类和说话人确认有效结合的算法,并根据该算法设计了一套主持人跟踪系统.该系统首先利用音频活动检测算法去除新闻音频资料中的静音段,然后说话人分割聚类算法将多说话人语音段分成若干单一话者语段,最后通过基于 GMM-UBM 的说话人确认算法辨认每段单一话者语段的话者身份是否为目标主持人.此外,分析了 T-Norm 对系统性能的影响.以中央电视台《新闻联播》为评测数据集,实验结果表明,该算法取得了良好的效果,跟踪系统的查准率(Precision)和查全率(Recall)分别为93.03%和84.34%. 相似文献

19.

Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis

Chung-Hsien Wu Chi-Chun Hsia Te-Hsien Liu Jhing-Fa Wang 《IEEE transactions on audio, speech, and language processing》2006,14(4):1109-1116

This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis. 相似文献

20.

多子系统似然度评分融合说话人识别 总被引：1，自引：0，他引：1

李恒杰《计算机应用》2008,28(1):116-119,119

针对短电话语音条件下文本无关说话人识别问题中语音数据不充分和电话信道失配问题,提出了一种基于话者聚类的多子系统输出似然度评分融合策略。采用KLD和GLR测度下的模型相似度聚类方法对目标话者聚类,并在每个话者类内构建由MFCC、LPCC和SSFE三个不同类型特征参数子系统组成的输出似然度评分融合系统,通过不同参数子系统的互补,即MFCC和LPCC参数的识别准确性结合SSFE的良好鲁棒性,以及不同话者类采用不同的输出似然度评分融合网络,提高了系统的整体性能。使用NIST SRE 05数据作为评估数据,实验结果表明,与传统的不分类多系统输出似然度评分融合相比,采用KLD和GLR测度的话者聚类融合策略使系统等误识率分别下降了10.3%和8.7%。相似文献