首页 | 官方网站   微博 | 高级检索  
     

基于注意力感知和语义感知的RGB-D室内图像语义分割算法
引用本文:段立娟,孙启超,乔元华,陈军成,崔国勤.基于注意力感知和语义感知的RGB-D室内图像语义分割算法[J].计算机学报,2021,44(2):275-291.
作者姓名:段立娟  孙启超  乔元华  陈军成  崔国勤
作者单位:北京工业大学信息学部 北京 100124;可信计算北京市重点实验室 北京 100124;信息安全等级保护关键技术国家工程实验室 北京 100124;北京工业大学信息学部 北京 100124;可信计算北京市重点实验室 北京 100124;浙江省北大信息技术高等研究院 杭州 311200;北京工业大学应用数理学院 北京 100124;北京工业大学信息学部 北京 100124;北京中星微电子有限公司数字多媒体芯片技术国家重点实验室 北京 100191
基金项目:本课题得到国家重点研发计划;杭州市重大科技创新项目;北京市自然基金委-市教委联合资助项目
摘    要:近年来,全卷积神经网络有效提升了语义分割任务的准确率.然而,由于室内环境的复杂性,室内场景语义分割仍然是一个具有挑战性的问题.随着深度传感器的出现,人们开始考虑利用深度信息提升语义分割效果.以往的研究大多简单地使用等权值的拼接或求和操作来融合RGB特征和深度特征,未能充分利用RGB特征与深度特征之间的互补信息.本文提出一种基于注意力感知和语义感知的网络模型ASNet(Attention-aware and Semantic-aware Network).通过引入注意力感知多模态融合模块和语义感知多模态融合模块,有效地融合多层次的RGB特征和深度特征.其中,在注意力感知多模态融合模块中,本文设计了一种跨模态注意力机制,RGB特征和深度特征利用互补信息相互指导和优化,从而提取富含空间位置信息的特征表示.另外,语义感知多模态融合模块通过整合语义相关的RGB特征通道和深度特征通道,建模多模态特征之间的语义依赖关系,提取更精确的语义特征表示.本文将这两个多模态融合模块整合到一个带有跳跃连接的双分支编码-解码网络模型中.同时,网络在训练时采用深层监督策略,在多个解码层上进行监督学习.在公开数据集上的实验结果表明,本文算法优于现有的RGB-D图像语义分割算法,在平均精度和平均交并比上分别比近期算法提高了1.9%和1.2%.

关 键 词:RGB-D语义分割  卷积神经网络  多模态融合  注意力模型  深度学习

Attention-Aware and Semantic-Aware Network for RGB-D Indoor Semantic Segmentation
DUAN Li-Juan,SUN Qi-Chao,QIAO Yuan-Hua,CHEN Jun-Cheng,CUI Guo-Qin.Attention-Aware and Semantic-Aware Network for RGB-D Indoor Semantic Segmentation[J].Chinese Journal of Computers,2021,44(2):275-291.
Authors:DUAN Li-Juan  SUN Qi-Chao  QIAO Yuan-Hua  CHEN Jun-Cheng  CUI Guo-Qin
Affiliation:(Faculy of Informalion Technology,Beijing Universisy of Technology,Beijing 100124;Beijing Key Laboralory of Trusled Com puling,Beijing 100124;Nal ional Engineering Laboralory for Key Technologies of Informalion Securily Level Prolection,Beijing 100124;Adranced Insilule of In formalion Technology,Peking Universily,Ilangzhou 311200;College of Applied Sciences,Beijing Universisy of Technology,Beijing 100124;Slale Key Laboralory of Digilal Mulli-media Chip Technology,Vimicro Cor poration,Beijing 100191)
Abstract:Semantic segmentation is a research hotspot in the field of computer vision.It refers to assigning all pixels into different semantic classes.As a fundamental problem in scene understanding,semantic segmentation is widely used in various intelligent tasks.In recent years,with the success of convolutional neural network(CNN)in many computer vision applications,fully convolutional networks(FCN)have shown great potential on RGB semantic segmentation task.However,semantic segmentation is still a challenging task due to the complexity of scene types,severe object occlusions and varying illuminations.In recent years,with the availability of consumer RGB-D sensors such as RealSense 3D Camera and Microsoft Kinect,we can capture both RGB image and depth information at the same time.Depth information can describe 3D geometric information which might be missed in RGB only images.It can significantly reduce classification errors and improve the accuracy of semantic segmentation.In order to make effective use of RGB information and depth information,it is crucial to find an efficient multi modal information fusion method.According to different fusion periods,the current RGB-D feature.fusion methods can be divided into three types:early fusion,late fusion and middle fusion.However,most of previous studies fail to make effective use of complementary information between RGB information and depth information.They simply fuse RGB features and depth features with equal-weight concatenating or summing,which failed to extract complementary information between two modals and will suppressed the modality specific information.In addition,semantic information in high level features between different modals is not taken into account,which is very important for the fine-grained semantic segmentation task.To solve the above problems,in this paper,we present a novel Attention-aware and Semantic-aware Multi-modal Fusion Network(ASNet)for RGB-D semantic segmentation.Our network is able to effectively fuse multi-level RGB-D features by including Attention-aware Multi-modal Fusion blocks(AMF)and Semantic-aware Multi-modal Fusion blocks(SMF).Specifically,in Attention-aware Multimodal Fusion blocks,a cross-modal attention mechanism is designed to make RGB features and depth features guide and optimize each other through their complementary characteristics in order to obtain the feature representation with rich spatial location information.In addition,Semantic-aware Multi modal Fusion blocks model the semantic interdependencies between multi-modal features by integrating semantic associated feature channels among the RGB and depth features and extract more precise semantic feature representation.The two blocks are integrated into a two-branch encoder-decoder architecture,which can restore image resolution gradually by using consecutive up-sampling operation and combine low level features and high level features through skip-connections to achieve high-resolution prediction.In order to optimize the training process,we using deeply supervised learning over multi-level decoding features.Our network is able to effectively learn the complementary characteristics of two modalities and models the semantic context interdependencies between RGB features and depth features.Experimental results with two challenging public RGB D indoor semantic segmentation datasets,i.e.,SUN RGB D and NYU Depth v2,show that our network outperforms existing RGB-D semantic segmentation methods and improves the segmentation performance by 1.9% and 1.2%for mean accuracy and mean IoU respectively.
Keywords:RGB-D semantic segmentation  convolutional neural network  multi-modal fusion  attention model  deep learning
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号