基于视觉和文本的多模态文档图像目标检测 Visual and textual based multimodal document object detection期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于视觉和文本的多模态文档图像目标检测

引用本文：	李玉腾,史操,许灿辉,程远志.基于视觉和文本的多模态文档图像目标检测[J].计算机应用研究,2023,40(5).

作者姓名：	李玉腾史操许灿辉程远志

作者单位：	青岛科技大学信息科学技术学院,青岛科技大学信息科学技术学院,青岛科技大学信息科学技术学院,青岛科技大学信息科学技术学院

基金项目：	国家自然科学基金资助项目(61806107,61702135)

摘要：	由于文档图像的布局复杂、目标对象尺寸分布不均匀，现有的检测算法很少考虑多模态信息和全局依赖关系，提出了基于视觉和文本的多模态文档图像目标检测方法。首先探索多模态特征的融合策略，为利用文本特征，将图像中文本序列信息转换为二维表征，在文本特征和视觉特征初次融合之后，将其输入到骨干网络提取多尺度特征，并在提取过程中多次融入文本特征实现多模态特征的深度融合；为保证小物体和大物体的检测精度，设计了一个金字塔网络，该网络的横向连接将上采样的特征图与自下而上生成的特征图在通道上连接，实现高层语义信息和低层特征信息的传播。在大型公开数据集PubLayNet上的实验结果表明，该方法的检测精度为95.86%，与其他检测方法相比有更高的准确率。该方法不仅实现了多模态特征的深度融合，还丰富了融合的多模态特征信息，具有良好的检测性能。
关键词：	多模态文档图像目标检测深度学习
收稿时间：	2022/8/4 0:00:00
修稿时间：	2023/4/12 0:00:00
Visual and textual based multimodal document object detection

Li Yuteng,Shi Cao,Xu Canhui and Cheng Yuanzhi.Visual and textual based multimodal document object detection[J].Application Research of Computers,2023,40(5).

Authors:	Li Yuteng Shi Cao Xu Canhui and Cheng Yuanzhi

Affiliation:	School of Information Science and Technology, Qingdao University of Science and Technology,,,

Abstract:	The layout of document images was complex and distribution of object sizes was uneven, currently, most of detection methods ignored multimodal information and global dependencies. Therefore, this paper proposed a multimodal document object detection method based on vision and text. Firstly, this method explored the fusion strategy of multimodal features. In order to utilize textual features, it converted text sequence information of the image into two-dimensional representation. After the initial fusion of text features and visual features, it input the fused features to backbone network to extract multiscale features, and repeatedly integrated textual features during the extraction process, so as to realize deep fusion of multimodal features. Next, to ensure the detection accuracy of small and large objects, this paper designed a pyramid network. The lateral connection could concatenate feature maps of the same spatial size from the bottom-up pathway and the top-down pathway in channel, so as to achieve the propagation between high-level semantic information and low-level feature information. The experimental results on large public dataset PubLayNet show that the detection accuracy of this method reaches 95.86%, and it has a higher accuracy than other methods. This method not only realizes the deep fusion of multimodal features, but also enriches the fused multimodal feature information, and it has good detection performance.

Keywords:	multimodal document image object detection deep learning

	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏