首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.

Rare-class objects in natural scene images that are usually small and less frequent often convey more important information for scene understanding than the common ones. However, they are often overlooked in scene labeling studies due to two main reasons, low occurrence frequency and limited spatial coverage. Many methods have been proposed to enhance overall semantic labeling performance, but only a few consider rare-class objects. In this work, we present a deep semantic labeling framework with special consideration of rare classes via three techniques. First, a novel dual-resolution coarse-to-fine superpixel representation is developed, where fine and coarse superpixels are applied to rare classes and background areas respectively. This unique dual representation allows seamless incorporation of shape features into integrated global and local convolutional neural network (CNN) models. Second, shape information is directly involved during the CNN feature learning for both frequent and rare classes from the re-balanced training data, and also explicitly involved in data inference. Third, the proposed framework incorporates both shape information and the CNN architecture into semantic labeling through a fusion of probabilistic multi-class likelihood. Experimental results demonstrate competitive semantic labeling performance on two standard datasets both qualitatively and quantitatively, especially for rare-class objects.

  相似文献   

2.
Robust camera pose and scene structure analysis for service robotics   总被引:1,自引:0,他引:1  
Successful path planning and object manipulation in service robotics applications rely both on a good estimation of the robot’s position and orientation (pose) in the environment, as well as on a reliable understanding of the visualized scene. In this paper a robust real-time camera pose and a scene structure estimation system is proposed. First, the pose of the camera is estimated through the analysis of the so-called tracks. The tracks include key features from the imaged scene and geometric constraints which are used to solve the pose estimation problem. Second, based on the calculated pose of the camera, i.e. robot, the scene is analyzed via a robust depth segmentation and object classification approach. In order to reliably segment the object’s depth, a feedback control technique at an image processing level has been used with the purpose of improving the robustness of the robotic vision system with respect to external influences, such as cluttered scenes and variable illumination conditions. The control strategy detailed in this paper is based on the traditional open-loop mathematical model of the depth estimation process. In order to control a robotic system, the obtained visual information is classified into objects of interest and obstacles. The proposed scene analysis architecture is evaluated through experimental results within a robotic collision avoidance system.  相似文献   

3.
融合图像场景及物体先验知识的图像描述生成模型   总被引:1,自引:0,他引:1       下载免费PDF全文
目的 目前基于深度卷积神经网络(CNN)和长短时记忆(LSTM)网络模型进行图像描述的方法一般是用物体类别信息作为先验知识来提取图像CNN特征,忽略了图像中的场景先验知识,造成生成的句子缺乏对场景的准确描述,容易对图像中物体的位置关系等造成误判。针对此问题,设计了融合场景及物体类别先验信息的图像描述生成模型(F-SOCPK),将图像中的场景先验信息和物体类别先验信息融入模型中,协同生成图像的描述句子,提高句子生成质量。方法 首先在大规模场景类别数据集Place205上训练CNN-S模型中的参数,使得CNN-S模型能够包含更多的场景先验信息,然后将其中的参数通过迁移学习的方法迁移到CNNd-S中,用于捕捉待描述图像中的场景信息;同时,在大规模物体类别数据集Imagenet上训练CNN-O模型中的参数,然后将其迁移到CNNd-O模型中,用于捕捉图像中的物体信息。提取图像的场景信息和物体信息之后,分别将其送入语言模型LM-S和LM-O中;然后将LM-S和LM-O的输出信息通过Softmax函数的变换,得到单词表中每个单词的概率分值;最后使用加权融合方式,计算每个单词的最终分值,取概率最大者所对应的单词作为当前时间步上的输出,最终生成图像的描述句子。结果 在MSCOCO、Flickr30k和Flickr8k 3个公开数据集上进行实验。本文设计的模型在反映句子连贯性和准确率的BLEU指标、反映句子中单词的准确率和召回率的METEOR指标及反映语义丰富程度的CIDEr指标等多个性能指标上均超过了单独使用物体类别信息的模型,尤其在Flickr8k数据集上,在CIDEr指标上,比单独基于物体类别的Object-based模型提升了9%,比单独基于场景类别的Scene-based模型提升了近11%。结论 本文所提方法效果显著,在基准模型的基础上,性能有了很大提升;与其他主流方法相比,其性能也极为优越。尤其是在较大的数据集上(如MSCOCO),其优势较为明显;但在较小的数据集上(如Flickr8k),其性能还有待于进一步改进。在下一步工作中,将在模型中融入更多的视觉先验信息,如动作类别、物体与物体之间的关系等,进一步提升描述句子的质量。同时,也将结合更多视觉技术,如更深的CNN模型、目标检测、场景理解等,进一步提升句子的准确率。  相似文献   

4.
目的 动态场景图像中所存在的静态目标、背景纹理等静态噪声,以及背景运动、相机抖动等动态噪声,极易导致运动目标检测误检或漏检。针对这一问题,本文提出了一种基于运动显著性概率图的目标检测方法。方法 该方法首先在时间尺度上构建包含短期运动信息和长期运动信息的构建时间序列组;然后利用TFT(temporal Fourier transform)方法计算显著性值。基于此,得到条件运动显著性概率图。接着在全概率公式指导下得到运动显著性概率图,确定前景候选像素,突出运动目标的显著性,而对背景的显著性进行抑制;最后以此为基础,对像素的空间信息进行建模,进而检测运动目标。结果 对提出的方法在3种典型的动态场景中与9种运动目标检测方法进行了性能评价。3种典型的动态场景包括静态噪声场景、动态噪声场景及动静态噪声场景。实验结果表明,在静态噪声场景中,Fscore提高到92.91%,准确率提高到96.47%,假正率低至0.02%。在动态噪声场景中,Fscore提高至95.52%,准确率提高到95.15%,假正率低至0.002%。而在这两种场景中,召回率指标没有取得最好的性能的原因是,本文所提方法在较好的包络目标区域的同时,在部分情况下易将部分目标区域误判为背景区域的,尤其当目标区域较小时,这种误判的比率更为明显。但是,误判的比率一直维持在较低的水平,且召回率的指标也保持在较高的值,完全能够满足于实际应用的需要,不能抵消整体性能的显著提高。另外,在动静态噪声场景中,4种指标均取得了最优的性能。因此,本文方法能有效地消除静态目标干扰,抑制背景运动和相机抖动等动态噪声,准确地检测出视频序列中的运动目标。结论 本文方法可以更好地抑制静态背景噪声和由背景变化(水波荡漾、相机抖动等)引起的动态噪声,在复杂的噪声背景下准确地检测出运动目标,提高了运动目标检测的鲁棒性和普适性。  相似文献   

5.
Automatic scene understanding from multimodal data is a key task in the design of fully autonomous vehicles. The theory of belief functions has proved effective for fusing information from several sensors at the superpixel level. Here, we propose a novel framework, called evidential grammars, which extends stochastic grammars by replacing probabilities by belief functions. This framework allows us to fuse local information with prior and contextual information, also modeled as belief functions. The use of belief functions in a compositional model is shown to allow for better representation of the uncertainty on the priors and for greater flexibility of the model. The relevance of our approach is demonstrated on multi-modal traffic scene data from the KITTI benchmark suite.  相似文献   

6.
《Advanced Robotics》2012,26(17):1995-2020
Abstract

In this paper, we propose a robot that acquires multimodal information, i.e. visual, auditory, and haptic information, fully autonomously using its embodiment. We also propose batch and online algorithms for multimodal categorization based on the acquired multimodal information and partial words given by human users. To obtain multimodal information, the robot detects an object on a flat surface. Then, the robot grasps and shakes it to obtain haptic and auditory information. For obtaining visual information, the robot uses a small hand-held observation table with an XBee wireless controller to control the viewpoints for observing the object. In this paper, for multimodal concept formation, multimodal latent Dirichlet allocation using Gibbs sampling is extended to an online version. This framework makes it possible for the robot to learn object concepts naturally in everyday operation in conjunction with a small amount of linguistic information from human users. The proposed algorithms are implemented on a real robot and tested using real everyday objects to show the validity of the proposed system.  相似文献   

7.
面向儿童的多通道交互系统   总被引:9,自引:2,他引:9  
李杰  田丰  王维信  戴国忠 《软件学报》2002,13(9):1846-1851
设计和实现了一个基于笔和语音的面向儿童的多通道三维交互系统.系统中包含了基于笔和语音的交互信息整合框架,用来整合儿童输入的笔和语音信息.同时,系统中定义了一些基于笔和语音的交互技术,可以支持孩子们以自然的方式,通过笔和语音同系统进行交互.用笔来勾画三维的场景和小动物等实体,同时用笔和语音同场景和场景中的实体进行一定的交互.  相似文献   

8.
This paper presents a novel content‐based method for transferring the colour patterns between images. Unlike previous methods that rely on image colour statistics, our method puts an emphasis on high‐level scene content analysis. We first automatically extract the foreground subject areas and background scene layout from the scene. The semantic correspondences of the regions between source and target images are established. In the second step, the source image is re‐coloured in a novel optimization framework, which incorporates the extracted content information and the spatial distributions of the target colour styles. A new progressive transfer scheme is proposed to integrate the advantages of both global and local transfer algorithms, as well as avoid the over‐segmentation artefact in the result. Experiments show that with a better understanding of the scene contents, our method well preserves the spatial layout, the colour distribution and the visual coherence in the transfer process. As an interesting extension, our method can also be used to re‐colour video clips with spatially‐varied colour effects.  相似文献   

9.
We propose a novel framework called transient imaging for image formation and scene understanding through impulse illumination and time images. Using time-of-flight cameras and multi-path analysis of global light transport, we pioneer new algorithms and systems for scene understanding through time images. We demonstrate that our proposed transient imaging framework allows us to accomplish tasks that are well beyond the reach of existing imaging technology. For example, one can infer the geometry of not only the visible but also the hidden parts of a scene, enabling us to look around corners. Traditional cameras estimate intensity per pixel I(x,y). Our transient imaging camera captures a 3D time-image I(x,y,t) for each pixel and uses an ultra-short pulse laser for illumination. Emerging technologies are supporting cameras with a temporal-profile per pixel at picosecond resolution, allowing us to capture an ultra-high speed time-image. This time-image contains the time profile of irradiance incident at a sensor pixel. We experimentally corroborated our theory with free space hardware experiments using a femtosecond laser and a picosecond accurate sensing device. The ability to infer the structure of hidden scene elements, unobservable by both the camera and illumination source, will create a range of new computer vision opportunities.  相似文献   

10.
This research explores the interaction of textual and photographic information in image understanding. Specifically, it presents a computational model whereby textual captions are used as collateral information in the interpretation of the corresponding photographs. The final understanding of the picture and caption reflects a consolidation of the information obtained from each of the two sources and can thus be used in intelligent information retrieval tasks. The problem of building a general-purpose computer vision system withouta priori knowledge is very difficult at best. The concept of using collateral information in scene understanding has been explored in systems that use general scene context in the task of object identification. The work described here extends this notion by incorporating picture specific information. A multi-stage systemPICTION which uses captions to identify humans in an accompanying photograph is described. This provides a computationally less expensive alternative to traditional methods of face recognition. A key component of the system is the utilisation of spatial and characteristic constraints (derived from the caption) in labeling face candidates (generated by a face locator).This work was supported in part by ARPA Contract 93-F148900-000. I would like to thank William Rapaport for serving as my advisor in my doctoral work; Venu Govindaraju for his work on the face locator; and more recently, Rajiv Chopra, Debra Burhans and Toshio Morita for their work in the new implementation of PICTION as well as valuable feedback.  相似文献   

11.
Liu  Zihe  Hou  Weiying  Zhang  Jiayi  Cao  Chenyu  Wu  Bin 《Multimedia Tools and Applications》2022,81(4):4909-4934

Automatically interpreting social relations, e.g., friendship, kinship, etc., from visual scenes has huge potential application value in areas such as knowledge graphs construction, person behavior and emotion analysis, entertainment ecology, etc. Great progress has been made in social analysis based on structured data. However, existing video-based methods consider social relationship extraction as a general classification task and categorize videos into only predefined types. Such methods are unable to recognize multiple relations in multi-person videos, which is obviously not consistent with the actual application scenarios. At the same time, videos are inherently multimodal. Subtitles in the video also provide abundant cues for relationship recognition that is often ignored by researchers. In this paper, we introduce and define a new task named “Multiple-Relation Extraction in Videos (MREV)”. To solve the MREV task, we propose the Visual-Textual Fusion (VTF) framework for jointly modeling visual and textual information. For the spatial representation, we not only adopt a SlowFast network to learn global action and scene information, but also exploit the unique cues of face, body and dialogue between characters. For the temporal domain, we propose a Temporal Feature Aggregation module to perform temporal reasoning, which assesses the quality of different frames adaptively. After that, we use a Multi-Conv Attention module to capture the inter-modal correlation and map the features of different modes to a coordinated feature space. By this means, our VTF framework comprehensively exploits abundant multimodal cues for the MREV task and achieves 49.2% and 50.4% average accuracy on a self-constructed Video Multiple-Relation(VMR) dataset and ViSR dataset, respectively. Extensive experiments on VMR dataset and ViSR dataset demonstrate the effectiveness of the proposed framework.

  相似文献   

12.
为了能快速、有效地进行视频场景分割,论文提出一种基于镜头竞争力的多模态视频场景分割算法,充分考虑视频中多模态之间的时序关联共生特性,通过对视频物理特征的提取、融合计算出镜头间相似度,结合镜头竞争力的判定思想分割出视频场景.实验结果表明,该算法能较为高效地进行视频场景分割,查全率和查准率可达82.1%和86.7%.  相似文献   

13.
R. Englert 《Computing》1999,62(4):369-385
Nearly all three-dimensional reconstruction methods lack proper model knowledge that reflects the scene. Model knowledge is required in order to reduce ambiguities which occur during the reconstruction process. It must comprise the scene and is therefore complex, and additionally difficult to acquire. In this paper we present an approach for the learning of complex model knowledge. A (large) sample set of three-dimensionally acquired buildings represented as graphs is generalized by the use of background knowledge. The background knowledge entails domain-specific knowledge and is utilized for the search guidance during the generalization process of EXRES. The generalization result is a distribution of relevant patterns which reduces ambiguities occurring in 3D object reconstruction (here: buildings). Three different applications for the 3D reconstruction of buildings from aerial images are executed whereas binary relations of so-called building atoms, namely tertiary nodes and faces, and building models are learned. These applications are evaluated based on (a) the estimated empirical generalization error and (b) the use of information coding theory and statistics by comparing the learned knowledge with non-available a priori knowledge. Received: June 3, 1998; revised November 5, 1998  相似文献   

14.
In order for an agent to achieve its objectives, make sound decisions, communicate and collaborate with others effectively it must have high quality representations. Representations can encapsulate objects, situations, experiences, decisions and behavior just to name a few. Our interest is in designing high quality representations, therefore it makes sense to ask of any representation; what does it represent; why is it represented; how is it represented; and importantly how well is it represented. This paper identifies the need to develop a better understanding of the grounding process as key to answering these important questions. The lack of a comprehensive understanding of grounding is a major obstacle in the quest to develop genuinely intelligent systems that can make their own representations as they seek to achieve their objectives. We develop an innovative framework which provides a powerful tool for describing, dissecting and inspecting grounding capabilities with the necessary flexibility to conduct meaningful and insightful analysis and evaluation. The framework is based on a set of clearly articulated principles and has three main applications. First, it can be used at both theoretical and practical levels to analyze grounding capabilities of a single system and to evaluate its performance. Second, it can be used to conduct comparative analysis and evaluation of grounding capabilities across a set of systems. Third, it offers a practical guide to assist the design and construction of high performance systems with effective grounding capabilities.  相似文献   

15.
目的 针对红外与可见光图像融合时易产生边缘细节信息丢失、融合结果有光晕伪影等问题,同时为充分获取多源图像的重要特征,将各向异性导向滤波和相位一致性结合,提出一种红外与可见光图像融合算法。方法 首先,采用各向异性导向滤波从源图像获得包含大尺度变化的基础图和包含小尺度细节的系列细节图;其次,利用相位一致性和高斯滤波计算显著图,进而通过对比像素显著性得到初始权重二值图,再利用各向异性导向滤波优化权重图,达到去除噪声和抑制光晕伪影;最后,通过图像重构得到融合结果。结果 从主客观两个方面,将所提方法与卷积神经网络(convolutional neural network,CNN)、双树复小波变换(dual-tree complex wavelet transform,DTCWT)、导向滤波(guided filtering,GFF)和各向异性扩散(anisotropic diffusion,ADF)等4种经典红外与可见光融合方法在TNO公开数据集上进行实验对比。主观分析上,所提算法结果在边缘细节、背景保存和目标完整度等方面均优于其他4种方法;客观分析上,选取互信息(mutual information,MI)、边缘信息保持度(degree of edge information,QAB/F)、熵(entropy,EN)和基于梯度的特征互信息(gradient based feature mutual information,FMI_gradient)等4种图像质量评价指数进行综合评价。相较于其他4种方法,本文算法的各项指标均有一定幅度的提高,MI平均值较GFF提高了21.67%,QAB/F平均值较CNN提高了20.21%,EN平均值较CNN提高了5.69%,FMI_gradient平均值较GFF提高了3.14%。结论 本文基于各向异性导向滤波融合算法可解决原始导向滤波存在的细节"光晕"问题,有效抑制融合结果中伪影的产生,同时具有尺度感知特性,能更好保留源图像的边缘细节信息和背景信息,提高了融合结果的准确性。  相似文献   

16.
目的 现有视觉问答方法通常只关注图像中的视觉物体,忽略了对图像中关键文本内容的理解,从而限制了图像内容理解的深度和精度。鉴于图像中隐含的文本信息对理解图像的重要性,学者提出了针对图像中场景文本理解的“场景文本视觉问答”任务以量化模型对场景文字的理解能力,并构建相应的基准评测数据集TextVQA(text visual question answering)和ST-VQA(scene text visual question answering)。本文聚焦场景文本视觉问答任务,针对现有基于自注意力模型的方法存在过拟合风险导致的性能瓶颈问题,提出一种融合知识表征的多模态Transformer的场景文本视觉问答方法,有效提升了模型的稳健性和准确性。方法 对现有基线模型M4C(multimodal multi-copy mesh)进行改进,针对视觉对象间的“空间关联”和文本单词间的“语义关联”这两种互补的先验知识进行建模,并在此基础上设计了一种通用的知识表征增强注意力模块以实现对两种关系的统一编码表达,得到知识表征增强的KR-M4C(knowledge-representation-enhanced M4C)方法。结果 在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上,将本文KR-M4C方法与最新方法进行比较。本文方法在TextVQA数据集中,相比于对比方法中最好的结果,在不增加额外训练数据的情况下,测试集准确率提升2.4%,在增加ST-VQA数据集作为训练数据的情况下,测试集准确率提升1.1%;在ST-VQA数据集中,相比于对比方法中最好的结果,测试集的平均归一化Levenshtein相似度提升5%。同时,在TextVQA数据集中进行对比实验以验证两种先验知识的有效性,结果表明提出的KR-M4C模型提高了预测答案的准确率。结论 本文提出的KR-M4C方法的性能在TextVQA和ST-VQA两个场景文本视觉问答基准评测集上均有显著提升,获得了在该任务上的最好结果。  相似文献   

17.

The exposition of any nature-inspired optimization technique relies firmly upon its executed organized framework. Since the regularly utilized backtracking search algorithm (BSA) is a fixed framework, it is not always appropriate for all difficulty levels of problems and, in this manner, probably does not search the entire search space proficiently. To address this limitation, we propose a modified BSA framework, called gQR-BSA, based on the quasi reflection-based initialization, quantum Gaussian mutations, adaptive parameter execution, and quasi-reflection-based jumping to change the coordinate structure of the BSA. In gQR-BSA, a quantum Gaussian mechanism was developed based on the best population information mechanism to boost the population distribution information. As population distribution data can represent characteristics of a function landscape, gQR-BSA has the ability to distinguish the methodology of the landscape in the quasi-reflection-based jumping. The updated automatically managed parameter control framework is also connected to the proposed algorithm. In every iteration, the quasi-reflection-based jumps aim to jump from local optima and are adaptively modified based on knowledge obtained from offspring to global optimum. Herein, the proposed gQR-BSA was utilized to solve three sets of well-known standards of functions, including unimodal, multimodal, and multimodal fixed dimensions, and to solve three well-known engineering optimization problems. The numerical and experimental results reveal that the algorithm can obtain highly efficient solutions to both benchmark and real-life optimization problems.

  相似文献   

18.

This work introduces a novel approach to extract meaningful content information from video by collaborative integration of image understanding and natural language processing. We developed a person browser system that associates faces and overlaid name texts in videos. This approach takes news videos as a knowledge source, then automatically extracts face and assoicated name text as content information. The proposed framework consists of the text detection module, the face detection module, and the person indexing database module. The successful results of person extraction reveal that the proposed methodology of integrated use of image understanding techniques and natural language processing technique is headed in the right direction to achieve our goal of accessing real content of multimedia information.

  相似文献   

19.
视频结构化描述是对一种视频内容信息提取和应用的技术,它对视频内容按照语义关系,采用时空分割、特征提取、对象识别等处理手段,组织成可供计算机和人理解的文本信息的技术。本文介绍基于该技术的室内场景描述系统方案,系统实现对室内场景的描述以及相关描述数据的存储和分发。结果表明经过结构化描述的视频可提高应用效率。  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号