残差密集空间金字塔网络的城市遥感图像分割 Residual dense spatial pyramid network for urbanremote sensing image segmentation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

残差密集空间金字塔网络的城市遥感图像分割

引用本文：	韩彬彬,张月婷,潘宗序,台宪青,李芳芳.残差密集空间金字塔网络的城市遥感图像分割[J].中国图象图形学报,2020,25(12):2656-2664.

作者姓名：	韩彬彬张月婷潘宗序台宪青李芳芳

作者单位：	中国科学院空天信息创新研究院, 北京 100190;空间信息处理与应用系统技术重点实验室, 北京 100190;中国科学院大学, 北京 100049

基金项目：	国家重点研发计划项目（2016YFF0202700）；国家自然科学基金项目（61701478）

摘要：	目的遥感图像语义分割是根据土地覆盖类型对图像中每个像素进行分类，是遥感图像处理领域的一个重要研究方向。由于遥感图像包含的地物尺度差别大、地物边界复杂等原因，准确提取遥感图像特征具有一定难度，使得精确分割遥感图像比较困难。卷积神经网络因其自主分层提取图像特征的特点逐步成为图像处理领域的主流算法，本文将基于残差密集空间金字塔的卷积神经网络应用于城市地区遥感图像分割，以提升高分辨率城市地区遥感影像语义分割的精度。方法模型将带孔卷积引入残差网络，代替网络中的下采样操作，在扩大特征图感受野的同时能够保持特征图尺寸不变；模型基于密集连接机制级联空间金字塔结构各分支，每个分支的输出都有更加密集的感受野信息；模型利用跳线连接跨层融合网络特征，结合网络中的高层语义特征和低层纹理特征恢复空间信息。结果基于ISPRS （International Society for Photogrammetry and Remote Sensing） Vaihingen地区遥感数据集展开充分的实验研究，实验结果表明，本文模型在6种不同的地物分类上的平均交并比和平均F₁值分别达到69.88%和81.39%，性能在数学指标和视觉效果上均优于SegNet、pix2pix、Res-shuffling-Net以及SDFCN （symmetrical dense-shortcut fully convolutional network）算法。结论将密集连接改进空间金字塔池化网络应用于高分辨率遥感图像语义分割，该模型利用了遥感图像不同尺度下的特征、高层语义信息和低层纹理信息，有效提升了城市地区遥感图像分割精度。
关键词：	语义分割遥感影像多尺度残差卷积网络密集连接
收稿时间：	2019/10/30 0:00:00
修稿时间：	2020/3/13 0:00:00
Residual dense spatial pyramid network for urbanremote sensing image segmentation

Han Binbin,Zhang Yueting,Pan Zongxu,Tai Xianqing,Li Fangfang.Residual dense spatial pyramid network for urbanremote sensing image segmentation[J].Journal of Image and Graphics,2020,25(12):2656-2664.

Authors:	Han Binbin Zhang Yueting Pan Zongxu Tai Xianqing Li Fangfang

Affiliation:	Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;Key Laboratory of Technology in Geo-Spatial Information Processing and Application Systems, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:	Objective Remote sensing image semantic segmentation, in which each pixel in an image is classified according to the land cover type, presents an important research direction in the field of remote sensing image processing. However, accurately segmenting and extracting features from remote sensing images is difficult due to the wide coverage of these images and the large-scale difference and complex boundaries among these features. Meanwhile, the traditional remote sensing image processing methods are inefficient, inaccurate, and require much expertise. Convolutional neural networks are deep learning networks that are suitable for processing data with grid structures, such as 1D data with time series features (e.g., speech) and image data with 2D pixel matrix grids. Given its multi-layer structure, a convolutional neural network can automatically learn features at different levels. This network also has two features that facilitate image processing. First, a convolutional neural network uses the 2D characteristics of an image in feature extraction. Given the high correlation among adjacent pixels in an image, the neuron nodes in the network do not need to connect all pixels; only a local connection is required to extract features. Second, convolution kernel parameters are shared when the convolutional neural network performs convolution operations, and features at different positions of an image use the same convolution kernel to calculate their values, there by greatly reducing the model parameters. In this paper, a full convolutional neural network based on a residual dense spatial pyramid is applied in urban remote sensing image segmentation to achieve an accurate semantic segmentation of high-resolution remote sensing images. Method To improve the semantic segmentation precision of high-resolution urban remote sensing images, we first take a 101-layer residual convolutional network as our backbone in extracting remote sensing image feature maps. When extracting features by using classic convolutional neural networks, the repeated concatenation of max-pooling and striding at consecutive layers significantly reduces the spatial resolution of the feature maps, typically by a factor of 32 across each direction in general deep convolutional neural networks(DCNNs), thereby leading to spatial information loss. Semantic segmentation is a pixel-to-pixel mapping task whose class intensity reaches the pixel level. Reducing the spatial resolution of feature maps can lead to spatial information loss, which is not conducive to the semantic segmentation of remote sensing images. To avoid such loss, the proposed model introduces atrous convolution into the residual convolutional neural network. Compared with ordinary convolution, atrous convolution uses the parameter r to control the receptive field of the convolution kernel during the calculation. The convolutional neural network with atrous convolution can expand the receptive field of the feature map while keeping the feature map size unchanged, thereby significantly improving the remote sensing image semantic segmentation performance of the proposed model. Objects in remote sensing images often demonstrate large-scale variations and complex texture features, both of which challenge the accurate encoding of multi-scale advanced features. To accurately extract multi-scale features in these images, the proposed model cascades each branch of aspatial pyramid structure based on a dense connection mechanism, which allows each branch to output highly dense receptive field information. In these mantic segmentation of remote sensing images, not only the high-level semantic features extracted by the convolutional neural network are required to correctly determine the category of each pixel; low-level texture features are also required to determine the edges of the target. Low-level texture features can benefit the reconstruction of object edges during semantic segmentation. Our proposed model uses a simple encoder to effectively use high-level semantic features and low-level texture features in a network. A decoder also uses skip connection to fuse cross-layer network information and to combine high-level semantic features with the underlying texture features. After fusing high- and low-level information, we use two 3×3 convolutions to integrate the information among channels and to recover spatial information. We eventually input the extracted feature map to a softmax classifier for pixel-level classification and obtain the remote sensing image semantic segmentation results. Result Full experiments are performed by using the ISPRS(International Society for Phtogrammetry and Remote Sensing) remote sensing dataset of the Vaihingen area. WE use intersection over union (IoU) and F₁ as our indicators for evaluating the segmentation performance of the proposed model. We also build and train our models based on the NVIDIA Tesla P100 platform and the Tensorflow deep learning framework. The complexity of tasks in the experiment increases at each stage. Experimental results show that the proposed model obtains mean IoU (MIoU) and F₁values of 69.88% and 81.39% over six types of surface features, respectively, thereby demonstrating vast improvements compared with a residual convolutional network without atrous convolution. Our proposed method also outperforms SegNet, Res-shuffling-Net and SDFCN (symmetrical dense-shortcut fully convolutional network) in terms of mathematics and outperforms pix2pix in terms of visual effects, thereby cementing its validity. We then apply this model on the remote sensing image data of Potsdam area and obtain MIoU and F₁ values of 74.02% and 83.86%, respectively, thereby proving the robustness of our model. Conclusion We build an end-to-end deep learning model for the semantic segmentation of remote sensing images of high-resolution urban areas. By applying an improved spatial pyramid pooling network based on atrous convolution and dense connections, our proposed model effectively extracts multi-scale features from remote sensing images and fuse high-level semantic information and low-level texture information of the network, which in turn can improve the accuracy of the model in the remote sensing image segmentation of urban areas. Experimental results prove that the proposed model achieves an excellent performance in terms of mathematical and visual effects and has high application value in the semantic segmentation of high-resolution remote sensing images.

Keywords:	semantic segmentation remote sensing images multiscale residual convolutional network dense connection

	点击此处可从《中国图象图形学报》浏览原始摘要信息
	点击此处可从《中国图象图形学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏