基于视觉特征引导融合的视频描述方法 Video Captioning Method Based on Visual Feature Guided Fusion期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于视觉特征引导融合的视频描述方法

引用本文：	苗教伟,季怡,刘纯平.基于视觉特征引导融合的视频描述方法[J].计算机工程与应用,2022,58(20):124-131.

作者姓名：	苗教伟季怡刘纯平

作者单位：	苏州大学计算机科学与技术学院，江苏苏州 215006

摘要：	视频描述生成因其广泛的潜在应用场景而成为近年来的研究热点之一。针对模型解码过程中视觉特征和文本特征交互不足而导致描述中出现识别错误的情况，提出基于编解码框架下的视觉与文本特征交互增强的多特征融合视频描述方法。在解码过程中，该方法使用视觉特征辅助引导描述生成，不仅为每一步的生成过程提供了文本信息，同时还提供了视觉参考信息，引导其生成更准确的词，大幅度提升了模型产生的描述质量；同时，结合循环dropout缓解解码器存在的过拟合情况，进一步提升了评价分数。在该领域广泛使用的MSVD和MSRVTT数据集上的消融和对比实验结果证明，提出的方法的可以有效生成视频描述，综合指标分别增长了17.2和2.1个百分点。
关键词：	编解码框架视频描述特征融合 dropout 特征交互
Video Captioning Method Based on Visual Feature Guided Fusion

MIAO Jiaowei,JI Yi,LIU Chunping.Video Captioning Method Based on Visual Feature Guided Fusion[J].Computer Engineering and Applications,2022,58(20):124-131.

Authors:	MIAO Jiaowei JI Yi LIU Chunping

Affiliation:	School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China

Abstract:	Video captioning generation has become one of the research hotspots in recent years because of its wide range of potential applications. Aiming at the problem of recognition error caused by insufficient interaction between visual features and text features in the process of model decoding, a multi feature fusion video captioning method based on enhanced interaction between visual features and text features in the encoder-decoder framework is proposed. In the decoding process, the method exerts visual features to guide the captioning generation, which not only provides text information for each step of the generation process, but also provides visual reference information to guide it to generate more accurate words, which greatly improves the captioning quality of the model generation. At the same time, combined with recurrent dropout to alleviate the over fitting of decoder, the evaluation score is further improved. Experimental results on MSVD and MSRVTT datasets show that the proposed method can generate video captioning effectively, and the comprehensive score increases by 17.2 and 2.1 percentage points respectively.

Keywords:	encoder-decoder framework video captioning feature fusion dropout feature interaction

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏