期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Multi-lingual fingerspelling recognition for handicapped kiosk

A. A. Kindiroglu H. Yalcin O. Aran M. Hruz P. Campr L. Akarun A. Karpov 《Pattern Recognition and Image Analysis》2011,21(3):402-406

This paper presents the design and evaluation of a multi-lingual fingerspelling recognition module that is designed for an information terminal. Through the use of multimodal input and output methods, the information terminal acts as a communication medium between deaf and blind people. The system converts fingerspelled words to speech and vice versa using fingerspelling recognition, fingerspelling synthesis, speech recognition and speech synthesis in Czech, Russian and Turkish Languages. We describe an adaptive skin color based fingersign recognition system with a close to real-time performance and present recognition results on 88 different letters signed by five different signers, using more than four hours of training and test videos. 相似文献

2.

双模态车载语音控制仿真系统的设计与实现

严乐贫奉小慧《计算机与现代化》2010,(8):211-215

针对音、视频双模态语音识别能有效地提高噪声环境下的识别率的特性,本文设计了车载语音控制指令识别实验系统。该系统模拟车载环境,把说话时的视频信息融入到语音识别系统中,系统分为模型训练、离线识别和在线识别3部分。在线识别全程采用语音作为人机交互手段,并具备用户自适应的功能。离线识别部分将系统产生的数据分层次进行统计,非常适合进行双模态语音识别算法研究。相似文献

3.

用于车载无线终端的嵌入式语音处理系统 总被引：2，自引：0，他引：2

刘志刘加刘润生《计算机工程》2005,31(6):182-183,202

介绍了一个用于汽车环境的无线终端中利用语音技术进行语音识别拨号、语音合成和语音提示的系统.系统包括两个主要的模块:语音处理模块和蓝牙通信模块.其中蓝牙通信模块的功能是与具有蓝牙接口的手机进行通信,包括连接手机进行通话,下载手机内的电话号码本并传送给语音处理模块;语音处理模块完成语音识别、语音合成、语音提示、利用语音压缩编解码进行通话录放以及号码查询等功能,并控制整个系统的流程.该系统可以实现对手机电活本的下载并在线生成识别词表,识别词表容量可达1000词;在600词情况下的实验结果表明系统的识别率大于97%;系统基于SoC架构,具有高集成度和高稳定性的特点. 相似文献

4.

Efficient Noise Robust Feature Extraction Algorithms for Distributed Speech Recognition (DSR) Systems

Bojan Kotnik Damjan Vlaj Bogomir Horvat 《International Journal of Speech Technology》2003,6(3):205-219

The evolution of robust speech recognition systems that maintain a high level of recognition accuracy in difficult and dynamically-varying acoustical environments is becoming increasingly important as speech recognition technology becomes a more integral part of mobile applications. In distributed speech recognition (DSR) architecture the recogniser's front-end is located in the terminal and is connected over a data network to a remote back-end recognition server. The terminal performs the feature parameter extraction, or the front-end of the speech recognition system. These features are transmitted over a data channel to the remote back-end recogniser. DSR provides particular benefits for the applications of mobile devices such as improved recognition performance compared to using the voice channel and ubiquitous access from different networks with a guaranteed level of recognition performance. A feature extraction algorithm integrated into the DSR system is required to operate in real-time as well as with the lowest possible computational costs.In this paper, two innovative front-end processing techniques for noise robust speech recognition are presented and compared, time-domain based frame-attenuation (TD-FrAtt) and frequency-domain based frame-attenuation (FD-FrAtt). These techniques include different forms of frame-attenuation, improvement of spectral subtraction based on minimum statistics, as well as a mel-cepstrum feature extraction procedure. Tests are performed using the Slovenian SpeechDat II fixed telephone database and the Aurora 2 database together with the HTK speech recognition toolkit. The results obtained are especially encouraging for mobile DSR systems with limited sizes of available memory and processing power. 相似文献

5.

言语信息处理的进展 总被引：1，自引：0，他引：1

蔡莲红贾珈郑方《中文信息学报》2011,25(6):137-142

该文介绍了言语信息处理的进展,特别提到汉语言语处理的现状。言语信息处理涉及到言语识别、说话人识别、言语合成、言语知觉计算等。带口音和随意发音的言语识别有力的支持了语言学习与口语水平测评等应用;跨信道、环境噪音、多说话人、短语音、时变语音等因素存在的情况下提高识别正确率,是说话人识别的研究热点;言语合成主要关注多语言合成、情感言语合成、可视言语合成等;言语知觉计算开展了言语测听、噪声抑制算法、助听器频响补偿方法、语音信号增强算法等研究。将言语处理技术与语言、网络有效结合,促进了更加和谐的人机言语交互。相似文献

6.

智能机器人语音远程控制系统的设计

下载免费PDF全文

杨世强梁丁宏傅卫平《计算机工程与应用》2009,45(25):71-73

为使智能机器人远程控制更加方便、快捷、人性化,设计并实现了一种智能机器人的语音远程控制系统方案。该方案利用微软语音开发包Microsoft Speech SDK,构建基于听写模式的大词汇量语音识别模块和语音合成模块,利用海量中文智能分词组件构建关键词检测模块,结合VFW（Video For Windows）技术与无线网络技术构建信息传输模块。实验表明,该系统语音识别准确率高,识别范围广,语音输入灵活。相似文献

7.

Evaluation of word confidence for speech recognition systems

Manhung Siu Herbert Gish 《Computer Speech and Language》1999,13(4):299

Confidence measures enable us to assess the output of a speech recognition system. The confidence measure provides us with an estimate of the probability that a word in the recognizer output is either correct or incorrect. In this paper we discuss ways in which to quantify the performance of confidence measures in terms of their discrimination power and bias. In particular, we analyze two different performance metrics: the classification equal error rate and the normalized mutual information metric. We then report experimental results of using these metrics to compare four different confidence measure estimation schemes. We also discuss the relationship between these metrics and the operating point of the speech recognition system and develop an approach to the robust estimation of normalized mutual information. 相似文献

8.

Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method

Mansour Sheikhan Mahdi Bejani Davood Gharavian 《Neural computing & applications》2013,23(1):215-227

The speech signal consists of linguistic information and also paralinguistic one such as emotion. The modern automatic speech recognition systems have achieved high performance in neutral style speech recognition, but they cannot maintain their high recognition rate for spontaneous speech. So, emotion recognition is an important step toward emotional speech recognition. The accuracy of an emotion recognition system is dependent on different factors such as the type and number of emotional states and selected features, and also the type of classifier. In this paper, a modular neural-support vector machine (SVM) classifier is proposed, and its performance in emotion recognition is compared to Gaussian mixture model, multi-layer perceptron neural network, and C5.0-based classifiers. The most efficient features are also selected by using the analysis of variations method. It is noted that the proposed modular scheme is achieved through a comparative study of different features and characteristics of an individual emotional state with the aim of improving the recognition performance. Empirical results show that even by discarding 22% of features, the average emotion recognition accuracy can be improved by 2.2%. Also, the proposed modular neural-SVM classifier improves the recognition accuracy at least by 8% as compared to the simulated monolithic classifiers. 相似文献

9.

An end-to-end model for cross-lingual transformation of paralinguistic information

Takatomo Kano Shinnosuke Takamichi Sakriani Sakti Graham Neubig Tomoki Toda Satoshi Nakamura 《Machine Translation》2018,32(4):353-368

Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language. 相似文献

10.

Statistical multimodal integration for audio-visual speech processing

Nakamura S. 《Neural Networks, IEEE Transactions on》2002,13(4):854-866

Sensory information is indispensable for living things. It is also important for living things to integrate multiple types of senses to understand their surroundings. In human communications, human beings must further integrate the multimodal senses of audition and vision to understand intention. In this paper, we describe speech related modalities since speech is the most important media to transmit human intention. To date, there have been a lot of studies concerning technologies in speech communications, but performance levels still have room for improvement. For instance, although speech recognition has achieved remarkable progress, the speech recognition performance still seriously degrades in acoustically adverse environments. On the other hand, perceptual research has proved the existence of the complementary integration of audio speech and visual face movements in human perception mechanisms. Such research has stimulated attempts to apply visual face information to speech recognition and synthesis. This paper introduces works on audio-visual speech recognition, speech to lip movement mapping for audio-visual speech synthesis, and audio-visual speech translation. 相似文献

11.

Vocal Access to a Newspaper Archive: Assessing the Limitations of Current Voice Information Access Technology

Fabio Crestani 《Journal of Intelligent Information Systems》2003,20(2):161-180

This paper presents the design and the current prototype implementation of an interactive vocal Information Retrieval system that can be used to access articles of a large newspaper archive using a telephone. The implementation of the system highlights the limitations of current voice information retrieval technology, in particular of speech recognition and synthesis. We present our evaluation of these limitations and address the feasibility of intelligent interactive vocal information access systems. 相似文献

12.

Audio-visual speech modeling for continuous speech recognition 总被引：3，自引：0，他引：3

Dupont S. Luettin J. 《Multimedia, IEEE Transactions on》2000,2(3):141-151

This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate 相似文献

13.

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Jianrong Wang Ju Zhang Kiyoshi Honda Jianguo Wei Jianwu Dang 《Multimedia Systems》2016,22(3):315-323

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments. 相似文献

14.

一种适用于说话人识别的改进Mel滤波器 总被引：1，自引：0，他引：1

项要杰杨俊安李晋徽陆俊《计算机工程》2013,(11):214-217,222

Mel倒谱系数（MFcc）侧重提取语音信号的低频信息,对语音信号的频谱分布特性描述不充分,不能有效区分说话人个性信息。为此,通过分析语音信号各频段所含说话人个性信息的不同,结合Mel滤波器和反Mel滤波器在高低频段的不同特性,提出一种适于说话人识别的改进Mel滤波器。实验结果表明,改进Mel滤波器提取的新特征能够获得比传统Mel倒谱系数以及反Mel倒谱系数（IMFCC）更好的识别效果,并且基本不增加说话人识别系统训练和识别的时间开销。相似文献

15.

Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

John Dines Hui Liang Lakshmi Saheer Matthew Gibson William Byrne Keiichiro Oura Keiichi Tokuda Junichi Yamagishi Simon King Mirjam Wester Teemu Hirsimäki Reima Karhila Mikko Kurimo 《Computer Speech and Language》2013,27(2):420-437

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics. 相似文献

16.

MFCC-GMM based accent recognition system for Telugu speech signals

Kasiprasad Mannepalli Panyam Narahari Sastry Maloji Suman 《International Journal of Speech Technology》2016,19(1):87-93

Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. Identification of the accent before the speech recognition can improve performance of the speech recognition systems. If the number of accents is more in a language, the accent recognition becomes crucial. Telugu is an Indian language which is widely spoken in Southern part of India. Telugu language has different accents. The main accents are coastal Andhra, Telangana, and Rayalaseema. In this present work the samples of speeches are collected from the native speakers of different accents of Telugu language for both training and testing. In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples. In the next step Gaussian mixture model (GMM) is used for classification of the speech based on accent. The overall efficiency of the proposed system to recognize the speaker, about the region he belongs, based on accent is 91 %. 相似文献

17.

Dialogue as a basis for construction of speech systems

R. V. Meshcheryakov V. P. Bondarenko 《Cybernetics and Systems Analysis》2008,44(2):175-184

The structure of speech dialogue systems is analyzed. It is shown that the creation of speech recognition and synthesis systems is connected with the solution of direct and inverse problems as applied to the organization of speech dialogue. Attention is paid to the multilevel hierarchical organization of speech dialogue systems in which the interaction between recognition and synthesis channels is realized through a common knowledgebase. A generalized structure of data representation at different hierarchy levels is analyzed, and the hierarchy of qualities determining properties of a dialogue system is considered. __________ Translated from Kibernetika i Sistemnyi Analiz, No. 2, pp. 30–41, March–April 2008. 相似文献

18.

Advances in phone-based modeling for automatic accent classification

Angkititrakul P. Hansen J.H.L. 《IEEE transactions on audio, speech, and language processing》2006,14(2):634-646

It is suggested that algorithms capable of estimating and characterizing accent knowledge would provide valuable information in the development of more effective speech systems such as speech recognition, speaker identification, audio stream tagging in spoken document retrieval, channel monitoring, or voice conversion. Accent knowledge could be used for selection of alternative pronunciations in a lexicon, engage adaptation for acoustic modeling, or provide information for biasing a language model in large vocabulary speech recognition. In this paper, we propose a text-independent automatic accent classification system using phone-based models. Algorithm formulation begins with a series of experiments focused on capturing the spectral evolution information as potential accent sensitive cues. Alternative subspace representations using principal component analysis and linear discriminant analysis with projected trajectories are considered. Finally, an experimental study is performed to compare the spectral trajectory model framework to a traditional hidden Markov model recognition framework using an accent sensitive word corpus. System evaluation is performed using a corpus representing five English speaker groups with native American English, and English spoken with Mandarin Chinese, French, Thai, and Turkish accents for both male and female speakers. 相似文献

19.

抗噪声语音识别及语音增强算法的应用 总被引：1，自引：0，他引：1

汤玲戴斌《计算机仿真》2006,23(9):80-82,143

提高语音识别系统的鲁棒性是语音识别技术一个重要的研究课题。语音识别系统往往由于训练环境下的数据和识别环境下的数据不匹配造成系统的识别性能下降，为了让语音识别系统在含噪的环境下获得令人满意的工作性能，该文根据人耳听觉特性提出了一种鲁棒语音特征提取方法。在MFCC特征提取之前先对含噪语音特征进行掩蔽特性处理，同时结合语音增强方法对特征进行处理，最后得到鲁棒语音特征。通过4种不同试验结果分析表明，将这种方法用于抗噪声分析可以提高系统的抗噪声能力；同时这种特征的处理方法对不同噪声在不同信噪比有很好的适应性。相似文献

20.

Subspace filtering approach based on orthogonal projection for better analysis of stressed speech under clean and noisy environments

Bhanu?Priya Email author S.?Dandapat 《International Journal of Speech Technology》2016,19(4):731-742

This study explores a novel subspace projection-based approach for analysis of stressed speech. Studies have shown that stress influences the speech production system and it results in a large acoustic variation between the neutral and the stressed speech. This degrades the discrimination capability of an automatic speech recognition system trained on neutral speech when tested on stressed speech. An effort is made to reduce the acoustic mismatch by explicitly normalizing the stress-specific attributes. The stress-specific divergences are normalized by exploiting the subspace filtering technique. To accomplish this, an orthogonal projection based linear relationship between the speech and the stress information has been explored to filter an effective speech subspace, which consists of speech information. Speech subspace is constructed using K-means clustering followed by singular value decomposition method using neutral speech data. The speech and the stress information are separated by projecting the stressed speech orthogonally onto an effective speech subspace. Experimental results indicate that, the bases of an effective subspace comprises the first few eigenvectors corresponding to the highest eigenvalues. To further improve the system performance, both the neutral and the stressed speech are projected onto the lower dimensional subspace. The projections derived using the neutral speech employs heteroscedastic linear discriminant analysis in maximum likelihood linear transformations-based semi-tied adaptation framework. Consistent improvements are noted for the proposed technique in all the discussed cases. 相似文献