基于注意力融合的图像描述生成方法 An image caption generation method based on attention fusion期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于注意力融合的图像描述生成方法

引用本文：	莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740-749.

作者姓名：	莫宏伟田朋

作者单位：	哈尔滨工程大学自动化学院，黑龙江哈尔滨 150001

摘要：	空间注意力机制和高层语义注意力机制都能够提升图像描述的效果，但是通过直接划分卷积神经网络提取图像空间注意力的方式不能准确地提取图像中目标对应的特征。为了提高基于注意力的图像描述效果，提出了一种基于注意力融合的图像描述模型，使用Faster R-CNN（faster region with convolutional neural network）作为编码器在提取图像特征的同时可以检测出目标的准确位置和名称属性特征，再将这些特征分别作为高层语义注意力和空间注意力来指导单词序列的生成。在COCO数据集上的实验结果表明，基于注意力融合的图像描述模型的性能优于基于空间注意力的图像描述模型和多数主流的图像描述模型。在使用交叉熵训练方法的基础上，使用强化学习方法直接优化图像描述评价指标对模型进行训练，提升了基于注意力融合的图像描述模型的准确率。
关键词：	图像描述卷积神经网络空间注意力 Faster R-CNN 注意力机制名称属性高层语义强化学习
An image caption generation method based on attention fusion

MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAL Transactions on Intelligent Systems,2020,15(4):740-749.

Authors:	MO Hongwei TIAN Peng

Affiliation:	College of Automation, Harbin Engineering University, Harbin 150001, China

Abstract:	The spatial attention mechanism and the high-level semantic attention mechanism can improve the effect of image captioning, but the method for extracting the spatial attention of image by directly dividing the convolutional neural network cannot accurately extract the features corresponding to target in the image. In order to improve the effect of image captioning based on attention, this paper proposes an image caption model based on attention fusion, using Faster R-CNN (faster region with convolutional neural network) as an encoder to exect image features and simultaneously detect the features of accurate position and noun attribute of the target object, then those features as high-level semantic attention and spatial attention respectively to guide the generation of word sequence. The experimental results on COCO dataset show that the performance of the image caption model based on attention fusion outperforms the image caption models based on spatial attention and most mainstream image caption models. Based on the cross entropy training method, we use reinforcement learning method to directly optimize the image caption evaluation index to train the model, which significantly improves the accuracy of the image caption model based on attention fusion.

Keywords:	image caption convolutional neural network spatial attention Faster R-CNN attention mechanism noun attribute high-level semantic reinforcement learning

	点击此处可从《智能系统学报》浏览原始摘要信息
	点击此处可从《智能系统学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏