Despite the tremendous achievements of deep convolutional neural networks (CNNs) in many computer vision tasks, understanding how they actually work remains a significant challenge. In this paper, we propose a novel two-step understanding method, namely Salient Relevance (SR) map, which aims to shed light on how deep CNNs recognize images and learn features from areas, referred to as attention areas, therein. Our proposed method starts out with a layer-wise relevance propagation (LRP) step which estimates a pixel-wise relevance map over the input image. Following, we construct a context-aware saliency map, SR map, from the LRP-generated map which predicts areas close to the foci of attention instead of isolated pixels that LRP reveals. In human visual system, information of regions is more important than of pixels in recognition. Consequently, our proposed approach closely simulates human recognition. Experimental results using the ILSVRC2012 validation dataset in conjunction with two well-established deep CNN models, AlexNet and VGG-16, clearly demonstrate that our proposed approach concisely identifies not only key pixels but also attention areas that contribute to the underlying neural network's comprehension of the given images. As such, our proposed SR map constitutes a convenient visual interface which unveils the visual attention of the network and reveals which type of objects the model has learned to recognize after training. The source code is available at https://github.com/Hey1Li/Salient-Relevance-Propagation. 相似文献
Deep learning and, in particular, convolutional neural networks (CNN) achieve very good results on several computer vision applications like security and surveillance, where image and video analysis are required. These networks are quite demanding in terms of computation and memory and therefore are usually implemented in high-performance computing platforms or devices. Running CNNs in embedded platforms or devices with low computational and memory resources requires a careful optimization of system architectures and algorithms to obtain very efficient designs. In this context, Field Programmable Gate Arrays (FPGA) can achieve this efficiency since the programmable hardware fabric can be tailored for each specific network. In this paper, a very efficient configurable architecture for CNN inference targeting any density FPGAs is described. The architecture considers fixed-point arithmetic and image batch to reduce computational, memory and memory bandwidth requirements without compromising network accuracy. The developed architecture supports the execution of large CNNs in any FPGA devices including those with small on-chip memory size and logic resources. With the proposed architecture, it is possible to infer an image in AlexNet in 4.3 ms in a ZYNQ7020 and 1.2 ms in a ZYNQ7045. 相似文献
Methods based on convolutional neural networks have achieved excellent performance in the image dehazing task. Unfortunately, most of the dehazing methods that exist suffer from loss of detail in the convolution and activation operations and failure to consider the effects of superimposing different intensities of haze, such as under-exposed and over-exposed images. To address these issues, we propose a dynamic dehazing convolution (DDC) based on attentional weight calculation and dynamic weight fusion and a dynamic dehazing activation (DDA) based on the input global context encoding function to address the problem of detail loss. And we propose a multi-scaled feature-fused image dehazing network (MFID-Net) based on DDC and DDA to address the effects of haze superposition. We also design a loss function based on the physical model with dynamic weights. Extensive experimental results demonstrate that the proposed MFID-Net performs favorably against the state-of-the-art algorithms on the hazy dataset while improving further on hazy images with large differences in haze concentration, and producing satisfactory dehazing results. The code is available at https://github.com/awhitewhale/MFID-Net. 相似文献
Recently, cellular neural networks (CNNs) have been demonstrated to be a highly effective paradigm applicable in a wide range of areas. Typically, CNNs can be implemented using VLSI circuits, but this would unavoidably require additional hardware. On the other hand, we can also implement CNNs purely by software; this, however, would result in very low performance when given a large CNN problem size. Nowadays, conventional desktop computers are usually equipped with programmable graphics processing units (GPUs) that can support parallel data processing. This paper introduces a GPU-based CNN simulator. In detail, we carefully organize the CNN data as 4-channel textures, and efficiently implement the CNN computation as fragment programs running in parallel on a GPU. In this way, we can create a high performance but low-cost CNN simulator. Experimentally, we demonstrate that the resultant GPU-based CNN simulator can run 8–17 times faster than a CPU-based CNN simulator. 相似文献
Convolutional neural networks (CNNs) have shown tremendous progress and performance in recent years. Since emergence, CNNs have exhibited excellent performance in most of classification and segmentation tasks. Currently, the CNN family includes various architectures that dominate major vision-based recognition tasks. However, building a neural network (NN) by simply stacking convolution blocks inevitably limits its optimization ability and introduces overfitting and vanishing gradient problems. One of the key reasons for the aforementioned issues is network singularities, which have lately caused degenerating manifolds in the loss landscape. This situation leads to a slow learning process and lower performance. In this scenario, the skip connections turned out to be an essential unit of the CNN design to mitigate network singularities. The proposed idea of this research is to introduce skip connections in NN architecture to augment the information flow, mitigate singularities and improve performance. This research experimented with different levels of skip connections and proposed the placement strategy of these links for any CNN. To prove the proposed hypothesis, we designed an experimental CNN architecture, named as Shallow Wide ResNet or SRNet, as it uses wide residual network as a base network design. We have performed numerous experiments to assess the validity of the proposed idea. CIFAR-10 and CIFAR-100, two well-known datasets are used for training and testing CNNs. The final empirical results have shown a great many of promising outcomes in terms of performance, efficiency and reduction in network singularities issues.
As a special group, visually impaired people (VIP) find it difficult to access and use visual information in the same way as sighted individuals. In recent years, benefiting from the development of computer hardware and deep learning techniques, significant progress have been made in assisting VIP with visual perception. However, most existing datasets are annotated in single scenario and lack of sufficient annotations for diversity obstacles to meet the realistic needs of VIP. To address this issue, we propose a new dataset called Walk On The Road (WOTR), which has nearly 190 K objects, with approximately 13.6 objects per image. Specially, WOTR contains 15 categories of common obstacles and 5 categories of road judging objects, including multiple scenario of walking on sidewalks, tactile pavings, crossings, and other locations. Additionally, we offer a series of baselines by training several advanced object detectors on WOTR. Furthermore, we propose a simple but effective PC-YOLO to obtain excellent detection results on WOTR and PASCAL VOC datasets. The WOTR dataset is available at https://github.com/kxzr/WOTR 相似文献