首页 | 官方网站   微博 | 高级检索  
     

基于注意力融合的图像描述生成方法
引用本文:莫宏伟,田朋.基于注意力融合的图像描述生成方法[J].智能系统学报,2020,15(4):740-749.
作者姓名:莫宏伟  田朋
作者单位:哈尔滨工程大学 自动化学院,黑龙江 哈尔滨 150001
摘    要:空间注意力机制和高层语义注意力机制都能够提升图像描述的效果,但是通过直接划分卷积神经网络提取图像空间注意力的方式不能准确地提取图像中目标对应的特征。为了提高基于注意力的图像描述效果,提出了一种基于注意力融合的图像描述模型,使用Faster R-CNN(faster region with convolutional neural network)作为编码器在提取图像特征的同时可以检测出目标的准确位置和名称属性特征,再将这些特征分别作为高层语义注意力和空间注意力来指导单词序列的生成。在COCO数据集上的实验结果表明,基于注意力融合的图像描述模型的性能优于基于空间注意力的图像描述模型和多数主流的图像描述模型。在使用交叉熵训练方法的基础上,使用强化学习方法直接优化图像描述评价指标对模型进行训练,提升了基于注意力融合的图像描述模型的准确率。

关 键 词:图像描述  卷积神经网络  空间注意力  Faster  R-CNN  注意力机制  名称属性  高层语义  强化学习

An image caption generation method based on attention fusion
MO Hongwei,TIAN Peng.An image caption generation method based on attention fusion[J].CAAL Transactions on Intelligent Systems,2020,15(4):740-749.
Authors:MO Hongwei  TIAN Peng
Affiliation:College of Automation, Harbin Engineering University, Harbin 150001, China
Abstract:The spatial attention mechanism and the high-level semantic attention mechanism can improve the effect of image captioning, but the method for extracting the spatial attention of image by directly dividing the convolutional neural network cannot accurately extract the features corresponding to target in the image. In order to improve the effect of image captioning based on attention, this paper proposes an image caption model based on attention fusion, using Faster R-CNN (faster region with convolutional neural network) as an encoder to exect image features and simultaneously detect the features of accurate position and noun attribute of the target object, then those features as high-level semantic attention and spatial attention respectively to guide the generation of word sequence. The experimental results on COCO dataset show that the performance of the image caption model based on attention fusion outperforms the image caption models based on spatial attention and most mainstream image caption models. Based on the cross entropy training method, we use reinforcement learning method to directly optimize the image caption evaluation index to train the model, which significantly improves the accuracy of the image caption model based on attention fusion.
Keywords:image caption  convolutional neural network  spatial attention  Faster R-CNN  attention mechanism  noun attribute  high-level semantic  reinforcement learning
点击此处可从《智能系统学报》浏览原始摘要信息
点击此处可从《智能系统学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号