面向图像修复的增强语义双解码器生成模型 Enhanced semantic dual decoder generation model for image inpainting期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向图像修复的增强语义双解码器生成模型

引用本文：	王倩娜,陈燚.面向图像修复的增强语义双解码器生成模型[J].中国图象图形学报,2022,27(10):2994-3009.

作者姓名：	王倩娜陈燚

作者单位：	南京师范大学计算机与电子信息学院/人工智能学院, 南京 210023

基金项目：	国家自然科学基金项目（61503188）；江苏省自然科学基金项目（BK20180727）；赛尔网络创新项目（NGII20180604）

摘要：	目的图像修复技术虽然取得了长足进步，但是当图像中缺失区域较大时，非缺失区域提供的信息量非常有限，从而导致难以产生语义信息一致的内容来增强修复图像和真实图像的视觉一致性；同时图像修复常使用两阶段网络结构，基于该结构的模型不仅需要较长的训练时间，还会导致图像修复效果对第1阶段输出结果依赖性较强。针对上述问题，提出了一种基于双解码器的增强语义一致的图像修复方法。方法使用双解码器网络结构消除两阶段修复方法中存在的依赖性问题，同时有效缩短模型的训练时间；利用一致性损失、感知损失和风格损失，更好地捕获图像的上下文语义信息，解决图像修复任务中出现的视觉不一致的问题。此外，本文使用了跳跃连接，并引入多尺度注意力模块和扩张卷积，进一步提高了网络的特征提取能力。结果为了公正地评价，在CelebA、Stanford Cars和UCF Google Street View共3个数据集上对具有规则和不规则缺失区域的图像分别进行实验，采用客观评价指标：均方误差（L₂）、峰值信噪比（peak signal-to-noise ratio，PSNR）、结构相似性（structural similarity，SSIM）、FID （Fréchet inception distance）和IS （inception score）进行评价。实验结果表明本文方法修复的图像不仅在视觉上有明显的提升，而且取得了较优的数值。如规则缺失区域下，在CelebA数据集中，本文方法的FID （越小越好）比性能第2的模型在数值上减少了39.2%；在UCF Google Street View数据集中，本文方法的PSNR比其他模型在数值上分别提高了12.64%、6.77%、4.41%。结论本文方法有效减少了模型的训练时间，同时消除了两阶段网络模型中的依赖性问题，修复的图像也呈现出更好的视觉一致性。
关键词：	图像修复语义一致双解码器跳跃连接多尺度注意力模块
收稿时间：	2021/4/27 0:00:00
修稿时间：	2021/7/26 0:00:00
Enhanced semantic dual decoder generation model for image inpainting

Wang Qiann,Chen Yi.Enhanced semantic dual decoder generation model for image inpainting[J].Journal of Image and Graphics,2022,27(10):2994-3009.

Authors:	Wang Qiann Chen Yi

Affiliation:	School of Computer and Electronic Information/School of Artificial Intelligence, Nanjing Normal University, Nanjing 210023, China

Abstract:	Objective Image inpainting for computer vision has been widely used in the context of image and video editing, medical, public security. Constrained of large missing regions in the image, most existing methods usually fail to generate sustainable semantic content to ensure the visual consistency between the repaired image and real image due to the very limited amount of information provided by non-missing regions. The inpainting results of the generator are often be distorted such as color difference, blurring and other artifacts. In addition, the model design has become complex in pursuit of high quality inpainting results, especially the two-stage network structure. The first stage predicts the image contents of missing regions coarsely. The following prediction is fed into the second stage to refine the previous inpainting results. This improves the inpainting effect of the model to some extent, but the two-stage network structure often lead to more inefficient training time and the dependency issue, which means the image inpainting effect is strongly dependent on the result of the first stage.Method We propose a dual decoders based enhancing semantic consistency of image inpainting. First, we use consistency loss to reduce the difference of the image between encoder and corresponding decoder. Meanwhile, perceptual loss and style loss are combined to improve the similarity between the repaired image and the real image. These loss functions are defined in the high-level deep features, which can motivate the network to capture the contextual semantic information of the images better, thus producing semantically content in consistency and ensuring the visual consistency between the repaired image and real image. Second, we illustrated a single encoder network structure and a simple and reconstructed paths based dual decoder to eliminate training cost and the dependence of the inpainting effect of the two-stage network structure on the first stage. Simple paths predict the content of missing regions in the image roughly, reconstructed paths generate higher quality inpainting effect, and the inpainting results are regularized by sharing weights. The dual decoder structure allows two inpainting paths to be performed independently at the same time, eliminating the dependency problem in the two-stage network structure and training cost. Finally, we apply the U-Net structure and introduce a skip connection between the encoder and decoder to improve the feature extraction ability, which resolves information loss through down-sampling. Additionally, the dilated convolution is utilized in the encoder to enlarge the receptive field of the model, and the multi-scale attention module is added in the decoder to enhance extracting features extraction ability from distant regions.Result We carried out experiments on three datasets, such as CelebA, Stanford Cars and UCF Google Street View. In general, there are usually regular and irregular missing regions in images. To fairly evaluate, we have performed experiments on images with centering and irregular holes. All masks and images are set to the resolution of 256×256 pixels for training and testing. The missing region of the central regular mask is set to 128×128 pixels, and the irregular mask is randomly generated. The qualitative experimental result have shown that our method generates more effectiveness compared to the other six methods, and the repaired image is more consistent visually with the real image. Furthermore, the quantitative comparisons are conducted in five metrics like mean square error (MSE, L₂), peak signal-to-noise ratio (PSNR), structural similarity (SSIM), Fréchet inception distance (FID) and inception score (IS) between the proposed method and other methods. Our experimental results indicate that our repaired images have its potentials of visual improvement and numerical performance. For example, the FID is 12.893 in the CelebA dataset in the case of regular missing regions, the FID (lower is better) is decreased by 39.2% compared to the second method. In addition, the PSNR (higher is better) is increased by 12.64%, 6.77% and 4.41% in the UCF Google Street View dataset, respectively. Meanwhile, we carry out ablation studies to verify the effectiveness of the proposed dual decoder. The effectiveness of loss function, multi-scale attention, and U-Net is also verified. Our model can enhance the visual consistency effectively between the repaired and real images, which is capable to produce more effective content for the missing regions of the images.Conclusion A novel image inpainting model is facilitated based on the multiple optimizations in network structure, training time, and image inpainting results. Our proposed method reduces the training time of the model effectively via utilizing a dual decoder and resolve the dependency issue in the two-stage network model simultaneously. The repaired images of our method has better visual consistency in related to consistency loss, perceptual loss, multi-scale attention module. Some limitations of the inpainting effect is challenged to be customized further for complex image structure.

Keywords:	image inpainting semantic consistency dual decoder skip connection multi-scale attention module

	点击此处可从《中国图象图形学报》浏览原始摘要信息
	点击此处可从《中国图象图形学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏