面向多模态交互式融合与渐进式优化的三维视觉理解 3D visual understanding oriented towards multimodal interactive fusion and progressive refinement期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向多模态交互式融合与渐进式优化的三维视觉理解

引用本文：	何鸿添,陈晗,刘洋,周礼亮,张敏,雷印杰.面向多模态交互式融合与渐进式优化的三维视觉理解[J].计算机应用研究,2024,41(5).

作者姓名：	何鸿添陈晗刘洋周礼亮张敏雷印杰

作者单位：	四川大学,四川大学,四川大学,四川大学,四川大学,四川大学

基金项目：	国家自然科学基金面上项目(62276176)

摘要：	三维视觉理解旨在智能地感知和解释三维场景，实现对物体、环境和动态变化的深入理解与分析。三维目标检测作为其核心技术，发挥着不可或缺的作用。针对当前的三维检测算法对于远距离目标和小目标检测精度较低的问题，提出了一种面向多模态交互式融合与渐进式优化的三维目标检测方法MIFPR。在特征提取阶段，首先引入自适应门控信息融合模块。通过把点云的几何特征融入图像特征中，能够获取对光照变化更有辨别力的图像表示。随后提出基于体素质心的可变形跨模态注意力模块，以驱使图像中丰富的语义特征和上下文信息融合到点云特征中。在目标框优化阶段，提出渐进式注意力模块，通过学习、聚合不同阶段的特征，不断增强模型对于精细化特征的提取与建模能力，逐步优化目标框，以提升对于远距离、小目标的检测精度，进而提高对于视觉场景理解的能力。在KITTI数据集上，所提方法对于Pedestrian和Cyclist等小目标的检测精度较最优基线有明显提升，证实了该方法的有效性。
关键词：	三维视觉理解多模态交互式融合渐进式注意力目标检测
收稿时间：	2023/8/8 0:00:00
修稿时间：	2024/4/24 0:00:00
3D visual understanding oriented towards multimodal interactive fusion and progressive refinement

He Hongtian,Chen Han,Liu Yang,Zhou Liliang,Zhang Min and Lei Yinjie.3D visual understanding oriented towards multimodal interactive fusion and progressive refinement[J].Application Research of Computers,2024,41(5).

Authors:	He Hongtian Chen Han Liu Yang Zhou Liliang Zhang Min and Lei Yinjie

Affiliation:	Sichuan University,,,,,

Abstract:	3D visual understanding aims to intelligently perceive and interpret 3D scenes, achieving a profound understanding and analysis of objects, environment, and dynamic changes. As its core technology, 3D object detection plays an indispensable role. For the problem of low detection accuracy of distant targets and small targets in current 3D detection algorithms, this paper proposed a 3D object detection method called MIFPR, which was oriented towards multimodal interactive fusion and progressive refinement. In the feature extraction stage, this algorithm introduced an adaptive gated information fusion module firstly. Incorporating the geometric features of the point cloud into the image features results in a more discriminative image representation for handling variations in lighting conditions. Subsequently, the proposed voxel centroid-based deformable cross-modal attention module was to drive the fusion of rich semantic features and contextual information from images into the point cloud features. During the proposal refinement stage, this algorithm introduced a progressive attention module. By learning and aggregating features from different stages, it continuously enhanced the model''s ability to extract and model fine-grained features, progressively refining bounding boxes. This gradual refinement of the proposal helps improve the detection accuracy of distant and small objects, thereby enhancing the overall capability of visual scene understanding. The proposed method shows significant improvement in the detection accuracy of small objects like pedestrian and cyclist on the KITTI dataset compared to the state-of-the-art baseline. This confirms the effectiveness of the proposed approach.

Keywords:	3D visual understanding multimodal interactive fusion progressive attention object detection

	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏