期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

张昕胡航烨曹欣怡王蔚《数据采集与处理》2022,37(4):909-916

语音合成技术日趋成熟,为了提高合成情感语音的质量,提出了一种端到端情感语音合成与韵律修正相结合的方法。在Tacotron模型合成的情感语音基础上,进行韵律参数的修改,提高合成系统的情感表达力。首先使用大型中性语料库训练Tacotron模型,再使用小型情感语料库训练,合成出具有情感的语音。然后采用Praat声学分析工具对语料库中的情感语音韵律特征进行分析并总结不同情感状态下的参数规律,最后借助该规律,对Tacotron合成的相应情感语音的基频、时长和能量进行修正,使情感表达更为精确。客观情感识别实验和主观评价的结果表明,该方法能够合成较为自然且表现力更加丰富的情感语音。相似文献

2.

手语到情感语音的转换

王伟喆郭威彤杨鸿武《计算机工程与科学》2022,44(10):1869-1876

为了解决语言障碍者与健康人之间的交流障碍问题,提出了一种基于神经网络的手语到情感语音转换方法。首先,建立了手势语料库、人脸表情语料库和情感语音语料库;然后利用深度卷积神经网络实现手势识别和人脸表情识别,并以普通话声韵母为合成单元,训练基于说话人自适应的深度神经网络情感语音声学模型和基于说话人自适应的混合长短时记忆网络情感语音声学模型;最后将手势语义的上下文相关标注和人脸表情对应的情感标签输入情感语音合成模型,合成出对应的情感语音。实验结果表明,该方法手势识别率和人脸表情识别率分别达到了95.86%和92.42%,合成的情感语音EMOS得分为4.15,合成的情感语音具有较高的情感表达程度,可用于语言障碍者与健康人之间正常交流。相似文献

3.

具有情感表现力的可视语音合成研究综述

下载免费PDF全文

曹亮赵晖《计算机工程与科学》2015,37(4):813-818

总结和分析了近年来情感可视语音合成领域的一些关键研究成果和研究方法,并根据可视语音合成机制的不同,从基于图像的方法和基于模型的方法两个角度对情感可视语音合成技术进行了系统归类和阐述,分析对比了其各自的优缺点及性能差异。重点讨论了各文献合成的可视语音在真实性和情感表现力两个方面的实现机理和程度。最后指出了合成具有情感表现力的可视语音应该重点考虑的一些问题,为情感可视语音合成的进一步研究指明了方向。相似文献

4.

基于PAD情感模型的可训练语音合成研究

陈雁翔龙润田《模式识别与人工智能》2013,26(11):1019-1025

情感语音合成是情感计算和语音信号处理研究的热点之一,进行准确的语音情感分析是合成高质量情感语音的前提.文中采用PAD情感模型作为情感分析量化模型,对情感语料库中的语音进行情感分析和聚类,获得各情感PAD参数模型.由HMM语音合成系统合成的情感语音,通过PAD模型进行参数修正,使得合成语音的情感参数更加准确,从而提高情感语音合成的质量.实验表明该方法能较好地提高合成语音的自然度和情感清晰度,在同性别不同说话人中也能达到较好的性能. 相似文献

5.

基于韵律特征参数的情感语音合成算法研究 总被引：1，自引：0，他引：1

何凌黄华刘肖珩《计算机工程与设计》2013,34(7)

为了合成更为自然的情感语音,提出了基于语音信号声学韵律参数及时域基音同步叠加算法的情感语音合成系统.实验通过对情感语音数据库中生气、无聊、高兴和悲伤4种情感的韵律参数分析,建立4种情感模板,采用波形拼接语音合成技术,运用时域基音同步叠加算法合成含有目标感情色彩的语音信号.实验结果表明,运用波形拼接算法,调节自然状态下语音信号的韵律特征参数,可合成较理想的情感语音.合成的目标情感语音具有明显的感情色彩,其主观情感类别判别正确率较高. 相似文献

6.

基于语音识别的英语翻译终端设计

涂琼引成南《自动化与仪器仪表》2023,(1):251-256

针对传统的英语翻译系统存在无法准确识别说话者语音和语气的问题。设计一个基于语音识别和语气语音合成的英语翻译系统,该系统终端主要包括语音识别、语言翻译、语气识别、语气转换和语气语音合成模块。基于CVAE语气语音合成模型对语音识别和语言翻译的英语语句进行语气语音合成,以进行便携式英语翻译终端设计与实现。实验表明,基于CVAE的语气语音合成模型合成语气语音的基频曲线与原始语音间的误差仅为0.02,两者基频曲线十分接近。且在主观评价方面,本模型的语音合成自然度MOS评分为3.84分,方差仅为0.004;情感语气一致性平均打分为3.72,方差为0.002。综合分析可知,本模型可取得较好的语音生成效果,生成语音具备多样性和准确性。系统应用发现,本模型在系统中可提升英语翻译系统终端的语音识别和语气语音合成效果,系统性能优越。相似文献

7.

基于汉语节律特征描述的语音合成模型

下载免费PDF全文

吴禀雅琚春华《计算机工程与科学》2007,29(10):128-131

汉语节律的合理使用能使合成语音表现出语篇的正确内涵和感情色彩。本文介绍了一种基于汉语节律特征描述的语音合成模型。本文首先介绍了汉语节律的停延、词重音、句重音、变调、调模等节律特征的分析和提取,详细描述了节律特征的各类情形,并阐述了基于汉语节律的语音合成算法模型,包括切词、标注、分析、定模、修正、输出的处理流程和合成语音声学参数序列{（h,l,s）}的生成。最后,给出了语音合成模型的实验结果与分析。相似文献

8.

特定说话人之间声学特征参数研究

杜佳陈砚圃杨俊强《计算机应用》2009,29(Z2)

特定人语音特征转换技术是实现个性化语音合成的有力工具,无论军用还是民用都具有重要的应用价值.从算法复杂度低、便于应用的角度出发,简要介绍了当今主流特定人语音转换算法的优缺点,着重研究了特定说话人之间基频和共振峰等声学特征参数的内在联系,最后总结了特定人语音特征转换的处理规则.该规则算法实现简单,能够确保转换语音的主要声学特征与目标人相吻合,为研究目标相似度高、实时性好的特定人语音特征转换算法提供了有力的实验依据. 相似文献

9.

深度学习语音合成技术综述

下载免费PDF全文

张小峰谢钧罗健欣杨涛《计算机工程与应用》2021,57(9):50-59

语音合成技术在人机交互中扮演着重要角色,深度学习的发展带动语音合成技术高速发展。基于深度学习的语音合成技术在合成语音的质量和速度上都超过了传统语音合成技术。从基于深度学习的声码器和声学模型出发对语音合成技术进行综述,探讨各类声码器和声学模型的工作原理及其优缺点,在此基础上对语音合成系统进行综述,系统综述经典的基于深度学习的语音合成系统,对基于深度学习的语音合成技术进行展望。相似文献

10.

基于声学统计建模的语音合成技术研究

胡郁凌震华王仁华戴礼荣《中文信息学报》2011,25(6):127-137

该文介绍基于声学统计建模的语音合成技术,重点回顾中国科学技术大学讯飞语音实验室在语音合成领域这一前沿发展方向的创新性工作成果。具体包括融合发音动作参数与声学参数,提高声学参数生成的灵活性;以最小生成误差准则取代最大似然准则,提高合成语音的音质;使用单元挑选与波形拼接方法取代参数合成器重构,改善参数语音合成器在合成语音音质上的不足。以上技术创新使得语音合成系统在自然度、表现力、灵活性及多语种应用等方面的性能都有进一步的提升,并推动语音合成技术在呼叫中心信息服务、移动嵌入式设备人机语音交互、智能语音教学等领域的广泛引用。相似文献

11.

Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning

Thanh X. Le An T. Le Quang H. Nguyen 《计算机系统科学与工程》2023,44(2):1263-1278

In recent years, speech synthesis systems have allowed for the production of very high-quality voices. Therefore, research in this domain is now turning to the problem of integrating emotions into speech. However, the method of constructing a speech synthesizer for each emotion has some limitations. First, this method often requires an emotional-speech data set with many sentences. Such data sets are very time-intensive and labor-intensive to complete. Second, training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning. In addition, each model for each emotion failed to take advantage of data sets of other emotions. In this paper, we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flowtron model. In addition, we provide a new method to build a speech corpus that is scalable and whose quality is easy to control. Next, to produce a high-quality speech synthesis model, we used this data set to train the Tacotron 2 model. We used it as a pre-trained model to train the Flowtron model. We applied this method to synthesize Vietnamese speech with sadness and happiness. Mean opinion score (MOS) assessment results show that MOS is 3.61 for sadness and 3.95 for happiness. In conclusion, the proposed method proves to be more effective for a high degree of automation and fast emotional sentence generation, using a small emotional-speech data set. 相似文献

12.

计算机语音合成技术研究及发展方向

刘豫军夏聪《网络安全技术与应用》2014,(12):22-22

现代科学技术的发展,计算机运用的普及,其各项智能技术渗透到了各个领域。计算机技术也在实践中得到了较大的提升,其中语音合成技术是现属于语音领域中相关技术人员正在研究的重要课题。人们生活质量的提升,对于计算机的功能要求也在不断提升,人机交流则是其中呼声较高的方面。语音合成的主要目标是使得计算机能够实现语言交流,语音合成系统即为文语转换系统（Text-To-Speech,简称TTS）,其是将文字转变为语音的复杂系统,并要求经过合成的语音较为清晰自然,易懂,且具备一定的表现力,但是现代的技术与人们的期待还存在一定的差距,需要继续深入研究。本文简单的分析了几项计算机的语音合成技术,如参数合成法、录音编辑法、波形合成法、基音同步叠加法等,并分析了其的主要发展方向,包括各种文本的语音阅读功能、语音风格、发音形象构建及可视语音、情感语音等,为相关的技术人员提供一定的参考与借鉴。相似文献

13.

Humanoid Audio–Visual Avatar With Emotive Text-to-Speech Synthesis

《Multimedia, IEEE Transactions on》2008,10(6):969-981

Emotive audio–visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural- sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio–visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar. 相似文献

14.

Prosody conversion from neutral speech to emotional speech 总被引：1，自引：0，他引：1

Jianhua Tao Yongguo Kang Aijun Li 《IEEE transactions on audio, speech, and language processing》2006,14(4):1145-1154

Emotion is an important element in expressive speech synthesis. Unlike traditional discrete emotion simulations, this paper attempts to synthesize emotional speech by using "strong", "medium", and "weak" classifications. This paper tests different models, a linear modification model (LMM), a Gaussian mixture model (GMM), and a classification and regression tree (CART) model. The linear modification model makes direct modification of sentence F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations, and intensities. Further analysis shows that emotional speech is also related to stress and linguistic information. Unlike the linear modification method, the GMM and CART models try to map the subtle prosody distributions between neutral and emotional speech. While the GMM just uses the features, the CART model integrates linguistic features into the mapping. A pitch target model which is optimized to describe Mandarin F0 contours is also introduced. For all conversion methods, a deviation of perceived expressiveness (DPE) measure is created to evaluate the expressiveness of the output speech. The results show that the LMM gives the worst results among the three methods. The GMM method is more suitable for a small training set, while the CART method gives the better emotional speech output if trained with a large context-balanced corpus. The methods discussed in this paper indicate ways to generate emotional speech in speech synthesis. The objective and subjective evaluation processes are also analyzed. These results support the use of a neutral semantic content text in databases for emotional speech synthesis. 相似文献

15.

文本-视觉语音合成综述 总被引：2，自引：1，他引：2

王志明陶建华《计算机研究与发展》2006,43(1):145-152

视觉信息对于理解语音的内容非常重要．不只是听力有障碍的人，普通人在交谈过程中也存在着一定程度的唇读，尤其是在语音质量受损的噪声环境下．正如文语转换系统可以使计算机像人一样讲话，文本-视觉语音合成系统可以使计算机模拟人类语音的双模态性，让计算机界面变得更为友好．回顾了文本-视觉语音合成的发展．文本驱动的视觉语音合成的实现方法可以分为两类：基于参数控制的方法和基于数据驱动的方法．详细介绍了参数控制类中的几个关键问题和数据驱动类中的几种不同实现方法。比较了这两类方法的优缺点及不同的适用环境．相似文献

16.

听觉模型与语音信号处理方法的研究

刘婧婕张刚武淑红《微机发展》2012,(2):61-64,68

在通信领域中,语音编码是语音信号处理的重要分支。为了适合信道传输,语音必须变换形式,基于承载信息并且保留信号,尽可能地处理。在当今的通信、计算机网络等应用领域中,具备低延迟、低码率两大特性的语音编码算法,发挥着决定性作用。在语音编码中,线性预测分析技术主要应用在感觉加权滤波器、综合滤波器及对数增益滤波器,该技术发挥着关键作用。文中的工作是呈现出一种混合LPC（Auditory-Acoustic-Hybrid-LPC）系数,它结合声学特性与听觉特性,以便提高编码后合成语音的听觉质量,这对编码算法的钻研有积极意义。相似文献

17.

基于数据驱动方法的汉语文本-可视语音合成 总被引：7，自引：0，他引：7

王志明蔡莲红艾海舟《软件学报》2005,16(6):1054-1063

计算机文本-可视语音合成系统(TTVS)可以增强语音的可懂度,并使人机交互界面变得更为友好.给出一个基于数据驱动方法(基于样本方法)的汉语文本-可视语音合成系统,通过将小段视频拼接生成新的可视语音.给出一种构造汉语声韵母视觉混淆树的有效方法,并提出了一个基于视觉混淆树和硬度因子的协同发音模型,模型可用于分析阶段的语料库选取和合成阶段的基元选取.对于拼接边界处两帧图像的明显差别,采用图像变形技术进行平滑并.结合已有的文本-语音合成系统(TTS),实现了一个中文文本视觉语音合成系统. 相似文献

18.

Survey on speech emotion recognition: Features, classification schemes, and databases

Moataz El Ayadi Author Vitae Mohamed S. Kamel Author Vitae Author Vitae 《Pattern recognition》2011,44(3):572-587

Recently, increasing attention has been directed to the study of the emotional content of speech signals, and hence, many systems have been proposed to identify the emotional content of a spoken utterance. This paper is a survey of speech emotion classification addressing three important aspects of the design of a speech emotion recognition system. The first one is the choice of suitable features for speech representation. The second issue is the design of an appropriate classification scheme and the third issue is the proper preparation of an emotional speech database for evaluating system performance. Conclusions about the performance and limitations of current speech emotion recognition systems are discussed in the last section of this survey. This section also suggests possible ways of improving speech emotion recognition systems. 相似文献