首页 | 官方网站   微博 | 高级检索  
     

基于Transformer-ESIM注意力机制的多模态情绪识别
引用本文:徐志京,高姗.基于Transformer-ESIM注意力机制的多模态情绪识别[J].计算机工程与应用,2022,58(10):132-138.
作者姓名:徐志京  高姗
作者单位:上海海事大学 信息工程学院,上海 201306
基金项目:航空科学基金;国家自然科学基金
摘    要:为了提高语音和文本融合的情绪识别准确率,提出一种基于Transformer-ESIM(Transformer-enhanced sequential inference model)注意力机制的多模态情绪识别方法。传统循环神经网络在语音和文本序列特征提取时存在长期依赖性,其自身顺序属性无法捕获长距离特征,因此采用Transformer编码层的多头注意力机制对序列进行并行化处理,解决了序列距离限制,能充分提取序列内的情感语义信息,获取语音和文本序列的深层情感语义编码,同时提高处理速度;通过ESIM交互注意力机制计算语音和文本之间的相似特征,实现语音和文本模态的对齐,解决了多模态特征直接融合而忽视的模态间交互问题,提高模型对情感语义的理解和泛化能力。该方法在IEMOCAP数据集上进行实验测试,实验结果表明,情绪识别分类准确率可达72.6%,和其他主流的多模态情绪识别方法相比各项指标都得到了明显的提升。

关 键 词:多模态情绪识别  Transformer编码层  多头注意力机制  交互注意力  

Multi-Modal Emotion Recognition Based on Transformer-ESIM Attention Mechanism
XU Zhijing,GAO Shan.Multi-Modal Emotion Recognition Based on Transformer-ESIM Attention Mechanism[J].Computer Engineering and Applications,2022,58(10):132-138.
Authors:XU Zhijing  GAO Shan
Affiliation:College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
Abstract:To improve the accuracy of multi-modal emotion recognition based on speech and text fusion, an emotion recognition method based on Transformer-ESIM(Transformer-enhanced sequential inference model)attention mechanism is proposed. Due to the traditional recurrent neural network has a long-term dependence on the feature extraction of speech and text sequences, and its own sequence attributes cannot capture long-distance features, the multi-head attention mechanism of Transformer coding layer is used to parallelize the sequence, improve the processing speed, solve the sequence distance limit, extract the emotional semantic information in the sequence, and obtain the voice and text sequence deep emotional semantic coding. Then, the similarity features between speech and text are calculated by ESIM interactive attention mechanism to realize the alignment of speech and text modes, solve the problem of modal interaction which is ignored by the direct fusion of multi-modal characteristics, and improve the comprehension and generalization ability of the model to emotional semantics. This method is tested on IEMOCAP dataset. The results show that the classification accuracy of emotion recognition can reach 72.6%. Compared with other mainstream multi-modal emotion recognition methods, each index has been improved.
Keywords:multi-modal emotion recognition  Transformer coding layer  multi-head attention mechanism  interactive attention  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号