期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

马利庄吴飞毛启容王鹏杰陈玉珑《中国图象图形学报》2022,27(6):1723-1742

面向复杂场景的人物视觉理解技术能够提升社会智能化协作效率,加速社会治理智能化进程,并在服务人类社会的经济活动、建设智慧城市等方面展现出巨大活力,具有重大的社会效益和经济价值。人物视觉理解技术主要包括实时人物识别、个体行为分析与群体交互理解、人机协同学习、表情与语音情感识别和知识引导下视觉理解等,当环境处于复杂场景中,特别是考虑“人物—行为—场景”整体关联的视觉表达与理解,相关问题的研究更具有挑战性。其中,大规模复杂场景实时人物识别主要集中在人脸检测、人物特征理解以及场景分析等,是复杂场景下人物视觉理解技术的重要研究基础;个体行为分析与群体交互理解主要集中在视频行人重识别、视频动作识别、视频问答和视频对话等,是视觉理解的关键行为组成部分;同时,在个体行为分析和群体交互理解中,形成综合利用知识与先验的机器学习模式,包含视觉问答对话、视觉语言导航两个重点研究方向;情感的识别与合成主要集中在人脸表情识别、语音情感识别与合成以及知识引导下视觉分析等方面,是情感交互的核心技术。本文围绕上述核心关键技术,阐述复杂场景下人物视觉理解领域的研究热点与应用场景,总结国内外相关成果与进展,展望该领域的前沿技术与发展趋势。相似文献

2.

Analysis of lip geometric features for audio-visual speech recognition

Kaynak M.N. Qi Zhi Cheok A.D. Sengupta K. Zhang Jian Ko Chi Chung 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2004,34(4):564-570

Audio-visual speech recognition employing both acoustic and visual speech information is a novel extension of acoustic speech recognition and it significantly improves the recognition accuracy in noisy environments. Although various audio-visual speech-recognition systems have been developed, a rigorous and detailed comparison of the potential geometric visual features from speakers' faces is essential. Thus, in this paper the geometric visual features are compared and analyzed rigorously for their importance in audio-visual speech recognition. Experimental results show that among the geometric visual features analyzed, lip vertical aperture is the most relevant; and the visual feature vector formed by vertical and horizontal lip apertures and the first-order derivative of the lip corner angle leads to the best recognition results. Speech signals are modeled by hidden Markov models (HMMs) and using the optimized HMMs and geometric visual features the accuracy of acoustic-only, visual-only, and audio-visual speech recognition methods are compared. The audio-visual speech recognition scheme has a much improved recognition accuracy compared to acoustic-only and visual-only speech recognition especially at high noise levels. The experimental results showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal-to-noise ratio of 0 dB). 相似文献

3.

A new multi-purpose audio-visual UNMC-VIER database with multiple variabilities

Yee Wan Wong Sue Inn Ch’ngKah Phooi Seng Li-Minn AngSiew Wen Chin Wei Jen ChewKing Hann Lim 《Pattern recognition letters》2011,32(13):1503-1510

Audio-visual recognition system is becoming popular because it overcomes certain problems of traditional audio-only recognition system. However, difficulties due to visual variations in video sequence can significantly degrade the recognition performance of the system. This problem can be further complicated when more than one visual variation happen at the same time. Although several databases have been created in this area, none of them includes realistic visual variations in video sequence. With the aim to facilitate the development of robust audio-visual recognition systems, the new audio-visual UNMC-VIER database is created. This database contains various visual variations including illumination, facial expression, head pose, and image resolution variations. The most unique aspect of this database is that it includes more than one visual variation in the same video recording. For the audio part, the utterances are spoken in slow and normal speech pace to improve the learning process of audio-visual speech recognition system. Hence, this database is useful for the development of robust audio-visual person, speech recognition and face recognition systems. 相似文献

4.

Multiple cameras for audio-visual speech recognition in an automotive environment

Rajitha Navarathna David Dean Sridha Sridharan Patrick Lucey 《Computer Speech and Language》2013,27(4):911-927

Audio-visual speech recognition, or the combination of visual lip-reading with traditional acoustic speech recognition, has been previously shown to provide a considerable improvement over acoustic-only approaches in noisy environments, such as that present in an automotive cabin. The research presented in this paper will extend upon the established audio-visual speech recognition literature to show that further improvements in speech recognition accuracy can be obtained when multiple frontal or near-frontal views of a speaker's face are available. A series of visual speech recognition experiments using a four-stream visual synchronous hidden Markov model (SHMM) are conducted on the four-camera AVICAR automotive audio-visual speech database. We study the relative contribution between the side and central orientated cameras in improving visual speech recognition accuracy. Finally combination of the four visual streams with a single audio stream in a five-stream SHMM demonstrates a relative improvement of over 56% in word recognition accuracy when compared to the acoustic-only approach in the noisiest conditions of the AVICAR database. 相似文献

5.

Audio-visual speech recognition integrating 3D lip information obtained from the Kinect

Jianrong Wang Ju Zhang Kiyoshi Honda Jianguo Wei Jianwu Dang 《Multimedia Systems》2016,22(3):315-323

Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments. 相似文献

6.

Multiple cameras audio visual speech recognition using active appearance model visual features in car environment

Astik Biswas P. K. Sahu Mahesh Chandra 《International Journal of Speech Technology》2016,19(1):159-171

Consideration of visual speech features along with traditional acoustic features have shown decent performance in uncontrolled auditory environment. However, most of the existing audio-visual speech recognition (AVSR) systems have been developed in the laboratory conditions and rarely addressed the visual domain problems. This paper presents an active appearance model (AAM) based multiple-camera AVSR experiment. The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments. At first, a series of visual speech recognition (VSR) experiments are carried out to study the impact of each camera on multi-stream VSR. Four cameras in car audio-visual corpus is used to perform the experiments. The individual camera stream is fused to have four-stream synchronous hidden Markov model visual speech recognizer. Finally, optimum four-stream VSR is combined with single stream acoustic HMM to build five-stream AVSR. The dual modality AVSR system shows more robustness compared to acoustic speech recognizer across all driving conditions. 相似文献

7.

Audio-Visual Affect Recognition

Zhihong Zeng Jilin Tu Ming Liu Thomas S. Huang Brian Pianfetti Dan Roth Stephen Levinson 《Multimedia, IEEE Transactions on》2007,9(2):424-428

The ability of a computer to detect and appropriately respond to changes in a user's affective state has significant implications to human-computer interaction (HCI). In this paper, we present our efforts toward audio-visual affect recognition on 11 affective states customized for HCI application (four cognitive/motivational and seven basic affective states) of 20 nonactor subjects. A smoothing method is proposed to reduce the detrimental influence of speech on facial expression recognition. The feature selection analysis shows that subjects are prone to use brow movement in face, pitch and energy in prosody to express their affects while speaking. For person-dependent recognition, we apply the voting method to combine the frame-based classification results from both audio and visual channels. The result shows 7.5% improvement over the best unimodal performance. For person-independent test, we apply multistream HMM to combine the information from multiple component streams. This test shows 6.1% improvement over the best component performance 相似文献

8.

一种快速嘴部检测方法在视听语音识别的应用

刘家涛 ;陈一民《微机发展》2008,(10):16-19

在改进噪音环境下的语音识别率中,来自于说话人嘴部的可视化语音信息有着显著的作用。介绍了在视听语音识别（AVSR）中的重要组成部分之一：可视化信息的前端设计;描述了一种用于快速处理图像并能达到较高识别率的人脸嘴部检测的机器学习方法,此方法引入了旋转Hart-like特征在积分图像中的应用,在基于AdaBoost学习算法上通过使用单值分类作为基础特征分类器,以级联的方式合并强分类器,最后划分检测区域用于嘴部定位。将上述方法应用于AVSR系统中,基本上达到了对人脸嘴部实时准确的检测效果。相似文献

9.

一种噪音环境下的基于特征口形的音频视频混合连续语音识别系统 总被引：1，自引：0，他引：1

谢磊 I.Cravyse 蒋冬梅赵荣椿 H.Sahli Werner Verhelst J Cornelis Ignace Lemahieu 《计算机工程与应用》2003,39(16):3-5,35

文章抓住人类语音感知多模型的特点,尝试建立一个在噪音环境下的基于音频和视频复合特征的连续语音识别系统。在视频特征提取方面,引入了一种基于特征口形的提取方法。识别实验证明,这种视频特征提取方法比传统DCT、DWT方法能够带来更高的识别率;基于特征口形的音频-视频混合连续语音识别系统具有很好的抗噪性。相似文献

10.

Animating expressive faces across languages 总被引：2，自引：0，他引：2

Verma A. Subramaniam L.V. Rajput N. Neti C. Faruquie T.A. 《Multimedia, IEEE Transactions on》2004,6(6):791-800

This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audio-visual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video. 相似文献

11.

Automatic Detection of Disfluency Boundaries in Spontaneous Speech of Children Using Audio–Visual Information

Yildirim S. Narayanan S. 《IEEE transactions on audio, speech, and language processing》2009,17(1):2-12

The presence of disfluencies in spontaneous speech, while poses a challenge for robust automatic recognition, also offers means for gaining additional insights into understanding a speaker's communicative and cognitive state. This paper analyzes disfluencies in children's spontaneous speech, in the context of spoken dialog based computer game play, and addresses the automatic detection of disfluency boundaries. Although several approaches have been proposed to detect disfluencies in speech, relatively little work has been done to utilize visual information to improve the performance and robustness of the disfluency detection system. This paper describes the use of visual information along with prosodic and language information to detect the presence of disfluencies in a child's computer-directed speech and shows how these information sources can be integrated to increase the overall information available for disfluency detection. The experimental results on our children's multimodal dialog corpus indicate that disfluency detection accuracy of over 80% can be obtained by utilizing audio-visual information. Specifically, results showed that the addition of visual information to prosody and language features yield relative improvements in disfluency detection error rates of 3.6% and 6.3%, respectively, for information fusion at the feature level and decision level. 相似文献

12.

A review of recent advances in visual speech decoding

《Image and vision computing》2014,32(9):590-605

Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. This paper provides a detailed review of recent advances in this research area. In comparison with the previous survey [97] which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state-of-the-art on facial landmark localization is briefly introduced in this paper. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding. 相似文献

13.

语音识别准确率与检索性能的关联性研究

周梁高鹏丁鹏徐波《中文信息学报》2006,20(3):101-106

对海量语音进行基于内容的检索需要语音识别技术和检索技术的结合。本文通过调节语言模型的途径研究在不同识别率的语音识别文本上进行关键词检索的差异,由此研究语音识别性能和检索性能之间的关联性。通过对114小时语音数据的实验表明:语音识别性能与检索性能有一定的相关性,同时也说明改进检索的方法可以消除一部分由于语音识别所带来的误差。研究结果为进一步针对性地改进识别引擎、语音识别输出的表示和相应的快速检索方法提供了基础。相似文献

14.

Human visual and auditory characteristics in the temporal frequency domain

Jing-Long Wu Osamu Nobuki 《Artificial Life and Robotics》2000,4(2):62-67

Recently, many audio-visual devices have been developed in the expanding field of virtual reality and the diversification of mass media. In these applications, the visual and auditory information is very important, because humans get environmental information mainly from their visual and auditory senses. In order to get high-reality performance, human audio-visual characteristics and their mechanism need to be investigated. In this study, the phase discrimination thresholds between visual and auditory stimuli are measured by changing the temporal frequency of the stimuli. Any periodicity can be expressed using a summation of several frequencies. By using frequency, the results can be applied to any stimulus and therefore have a wider application than the step function. The experimental results suggest that the phase discrimination thresholds are increased as the temporal frequency of the stimuli is increased. In general, it is not possible to distinguish the phase difference if the temporal frequency is higher than 3Hz. The experimental results can be applied to effective productions in virtual reality, drama, and the mass media, and can contribute to elucidating human audio-visual, mechanisms. This work was presented, in part, at the Fourth International Symposium on Artificial Life and Robotics Oita, Japan, January 19–22, 1999. 相似文献

15.

文本-视觉语音合成综述 总被引：2，自引：1，他引：2

王志明陶建华《计算机研究与发展》2006,43(1):145-152

视觉信息对于理解语音的内容非常重要．不只是听力有障碍的人，普通人在交谈过程中也存在着一定程度的唇读，尤其是在语音质量受损的噪声环境下．正如文语转换系统可以使计算机像人一样讲话，文本-视觉语音合成系统可以使计算机模拟人类语音的双模态性，让计算机界面变得更为友好．回顾了文本-视觉语音合成的发展．文本驱动的视觉语音合成的实现方法可以分为两类：基于参数控制的方法和基于数据驱动的方法．详细介绍了参数控制类中的几个关键问题和数据驱动类中的几种不同实现方法。比较了这两类方法的优缺点及不同的适用环境．相似文献

16.

面向多模态自监督特征融合的音视频对抗对比学习

下载免费PDF全文

盛振涛陈雁翔齐国君《中国图象图形学报》2023,28(1):317-332

目的同一视频中的视觉与听觉是两个共生模态,二者相辅相成,同时发生,从而形成一种自监督模式。随着对比学习在视觉领域取得很好的效果,将对比学习这一自监督表示学习范式应用于音视频多模态领域引起了研究人员的极大兴趣。本文专注于构建一个高效的音视频负样本空间,提高对比学习的音视频特征融合能力。方法提出了面向多模态自监督特征融合的音视频对抗对比学习方法：1)创新性地引入了视觉、听觉对抗性负样本集合来构建音视频负样本空间;2)在模态间与模态内进行对抗对比学习,使得音视频负样本空间中的视觉和听觉对抗性负样本可以不断跟踪难以区分的视听觉样本,有效地促进了音视频自监督特征融合。在上述两点基础上,进一步简化了音视频对抗对比学习框架。结果本文方法在Kinetics-400数据集的子集上进行训练,得到音视频特征。这一音视频特征用于指导动作识别和音频分类任务,取得了很好的效果。具体来说,在动作识别数据集UCF-101和HMDB-51(human metabolome database)上,本文方法相较于Cross-AVID(cross-audio visual instance discrimination... 相似文献

17.

双模态车载语音控制仿真系统的设计与实现

严乐贫奉小慧《计算机与现代化》2010,(8):211-215

针对音、视频双模态语音识别能有效地提高噪声环境下的识别率的特性,本文设计了车载语音控制指令识别实验系统。该系统模拟车载环境,把说话时的视频信息融入到语音识别系统中,系统分为模型训练、离线识别和在线识别3部分。在线识别全程采用语音作为人机交互手段,并具备用户自适应的功能。离线识别部分将系统产生的数据分层次进行统计,非常适合进行双模态语音识别算法研究。相似文献

18.

基于音视一致性的音视人眼关注点检测

袁梦于小雨《计算机与现代化》2022,(4):103-109

现有音视人眼关注点检测算法使用双流结构分别对音视信息进行特征提取,随后对音视特征融合得到最终的预测图。但数据集中的音频信息和视觉信息会有不相关的情况,因此在音视不一致时直接对音视特征进行融合会使得音频信息对视觉特征产生消极的影响。针对上述问题,本文提出一种基于音视一致性的音视人眼关注点检测网络（Audio-visual Consistency Network, AVCN）。为验证该网络的可靠性,本文在现有音视结合的人眼关注点检测模型上加入音视一致性网络,AVCN对提取的音、视频特征进行一致性二值判断,二者一致时,输出音视融合特征作为最终的预测图,反之则输出视觉占主导的特征作为最终结果。该算法在开放的6个数据集上进行实验,结果表明加入AVCN模型的整体指标会有所提高。相似文献

19.

Recognizing Human Emotional State From Audiovisual Signals 总被引：1，自引：0，他引：1

Yongjin Wang Ling Guan 《Multimedia, IEEE Transactions on》2008,10(4):659-668

Machine recognition of human emotional state is an important component for efficient human-computer interaction. The majority of existing works address this problem by utilizing audio signals alone, or visual information only. In this paper, we explore a systematic approach for recognition of human emotional state from audiovisual signals. The audio characteristics of emotional speech are represented by the extracted prosodic, Mel-frequency Cepstral Coefficient (MFCC), and formant frequency features. A face detection scheme based on HSV color model is used to detect the face from the background. The visual information is represented by Gabor wavelet features. We perform feature selection by using a stepwise method based on Mahalanobis distance. The selected audiovisual features are used to classify the data into their corresponding emotions. Based on a comparative study of different classification algorithms and specific characteristics of individual emotion, a novel multiclassifier scheme is proposed to boost the recognition performance. The feasibility of the proposed system is tested over a database that incorporates human subjects from different languages and cultural backgrounds. Experimental results demonstrate the effectiveness of the proposed system. The multiclassifier scheme achieves the best overall recognition rate of 82.14%. 相似文献

20.

Multi-Modal Dialog Scene Detection Using Hidden Markov Models for Content-Based Multimedia Indexing 总被引：1，自引：1，他引：0

A. Aydin Alatan Ali N. Akansu Wayne Wolf 《Multimedia Tools and Applications》2001,14(2):137-151

A class of audio-visual data (fiction entertainment: movies, TV series) is segmented into scenes, which contain dialogs, using a novel hidden Markov model-based (HMM) method. Each shot is classified using both audio track (via classification of speech, silence and music) and visual content (face and location information). The result of this shot-based classification is an audio-visual token to be used by the HMM state diagram to achieve scene analysis. After simulations with circular and left-to-right HMM topologies, it is observed that both are performing very good with multi-modal inputs. Moreover, for circular topology, the comparisons between different training and observation sets show that audio and face information together gives the most consistent results among different observation sets. 相似文献