首页 | 官方网站   微博 | 高级检索  
     

基于关键帧筛选网络的视听联合动作识别
引用本文:陈亭秀,尹建芹.基于关键帧筛选网络的视听联合动作识别[J].计算机应用,2022,42(3):731-735.
作者姓名:陈亭秀  尹建芹
作者单位:北京邮电大学 人工智能学院,北京 100876
基金项目:国家自然科学基金资助项目(61673192);;中央高校基本科研业务费专项资金资助项目(2020XD-A04)~~;
摘    要:近年来,视听联合学习的动作识别获得了一定关注。无论在视频(视觉模态)还是音频(听觉模态)中,动作发生是瞬时的,往往在动作发生时间段内的信息才能够显著地表达动作类别。如何更好地利用视听模态的关键帧携带的显著表达动作信息,是视听动作识别待解决的问题之一。针对该问题,提出关键帧筛选网络KFIA-S,通过基于全连接层的线性时间注意力机制赋予每个时刻视听信息不同权重,从而筛选益于视频分类的视听特征,减少重复冗余信息,抑制背景干扰信息,提升动作识别精度。研究了不同强度的时间注意力对动作识别的影响。在ActivityNet数据集上的实验表明,KFIA-S网络达到了最先进的识别精度,证明了所提方法的有效性。

关 键 词:视频动作识别  视听联合学习  时间注意力  深度学习  长短时记忆循环神经网络  
收稿时间:2021-06-11
修稿时间:2021-08-13

Audio visual joint action recognition based on key frame selection network
CHEN Tingxiu,YIN Jianqin.Audio visual joint action recognition based on key frame selection network[J].journal of Computer Applications,2022,42(3):731-735.
Authors:CHEN Tingxiu  YIN Jianqin
Affiliation:School of Artificial Intelligence,Beijing University of Posts and Telecommunications,Beijing 100876,China
Abstract:In recent years, the action recognition of audio visual joint learning has received some attention. Whether in video (visual modality) or audio (auditory modality), the occurrence of action is instantaneous, only the information in the time period of action can significantly express the action category. How to make better use of the significant expression information carried by the key frames of audio-visual modality is one of the problems to be solved in audio-visual action recognition. According to this problem, a key frame screening network KFIA-S was proposed. Though the linear temporal attention mechanism based on the full connected layer, different weights were given to the audio-visual information at different times, so as to screen the audio-visual features beneficial to video classification, reduce redundant information, suppress background interference information, and improve the accuracy of action recognition. The effect of different intensity of time attention on action recognition was studied. The experiments on ActivityNet dataset show that KFIA-S network achieves the SOTA (State-Of-The-Art) recognition accuracy, which proves the effectiveness of the proposed method.
Keywords:video action recognition  audio visual joint learning  temporal attention  deep learning  long short-term memory recurrent neural network  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号