首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Bag-of-visual-words (BoW) has recently become a popular representation to describe video and image content. Most existing approaches, nevertheless, neglect inter-word relatedness and measure similarity by bin-to-bin comparison of visual words in histograms. In this paper, we explore the linguistic and ontological aspects of visual words for video analysis. Two approaches, soft-weighting and constraint-based earth mover’s distance (CEMD), are proposed to model different aspects of visual word linguistics and proximity. In soft-weighting, visual words are cleverly weighted such that the linguistic meaning of words is taken into account for bin-to-bin histogram comparison. In CEMD, a cross-bin matching algorithm is formulated such that the ground distance measure considers the linguistic similarity of words. In particular, a BoW ontology which hierarchically specifies the hyponym relationship of words is constructed to assist the reasoning. We demonstrate soft-weighting and CEMD on two tasks: video semantic indexing and near-duplicate keyframe retrieval. Experimental results indicate that soft-weighting is superior to other popular weighting schemes such as term frequency (TF) weighting in large-scale video database. In addition, CEMD shows excellent performance compared to cosine similarity in near-duplicate retrieval.  相似文献   

2.
The method based on Bag-of-visual-Words (BoW) deriving from local keypoints has recently appeared promising for video annotation. Visual word weighting scheme has critical impact to the performance of BoW method. In this paper, we propose a new visual word weighting scheme which is referred as emerging patterns weighting (EP-weighting). The EP-weighting scheme can efficiently capture the co-occurrence relationships of visual words and improve the effectiveness of video annotation. The proposed scheme firstly finds emerging patterns (EPs) of visual keywords in training dataset. And then an adaptive weighting assignment is performed for each visual word according to EPs. The adjusted BoW features are used to train classifiers for video annotation. A systematic performance study on TRECVID corpus containing 20 semantic concepts shows that the proposed scheme is more effective than other popular existing weighting schemes.  相似文献   

3.
Recently, image representation based on bag-of-visual-words (BoW) model has been popularly applied in image and vision domains. In BoW, a visual codebook of visual words is defined, usually by clustering local features, to represent any novel image with the occurrence of its contained visual words. Given a set of images, we argue that the significance of each image is determined by the significance of its contained visual words. Traditionally, the significances of visual words are defined by term frequency-inverse document frequency (tf-idf), which cannot necessarily capture the intrinsic visual context. In this paper, we propose a new scheme of latent visual context learning (LVCL). The visual context among images and visual words is formulated from latent semantic context and visual link graph analysis. With LVCL, the importance of visual words and images will be distinguished from each other, which will facilitate image level applications, such as image re-ranking and canonical image selection.We validate our approach on text-query based search results returned by Google Image. Experimental results demonstrate the effectiveness and potentials of our LVCL in applications of image re-ranking and canonical image selection, over the state-of-the-art approaches.  相似文献   

4.
In this paper, we present the results of a project that seeks to transform low-level features to a higher level of meaning. This project concerns a technique, latent semantic indexing (LSI), in conjunction with normalization and term weighting, which have been used for full-text retrieval for many years. In this environment, LSI determines clusters of co-occurring keywords, sometimes, called concepts, so that a query which uses a particular keyword can then retrieve documents perhaps not containing this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based image retrieval, using two different approaches to image feature representation. We also study the integration of visual features and textual keywords and the results show that it can help improve the retrieval performance significantly.  相似文献   

5.
Improving Image Classification Using Semantic Attributes   总被引:1,自引:0,他引:1  
The Bag-of-Words (BoW) model??commonly used for image classification??has two strong limitations: on one hand, visual words are lacking of explicit meanings, on the other hand, they are usually polysemous. This paper proposes to address these two limitations by introducing an intermediate representation based on the use of semantic attributes. Specifically, two different approaches are proposed. Both approaches consist in predicting a set of semantic attributes for the entire images as well as for local image regions, and in using these predictions to build the intermediate level features. Experiments on four challenging image databases (PASCAL VOC 2007, Scene-15, MSRCv2 and SUN-397) show that both approaches improve performance of the BoW model significantly. Moreover, their combination achieves the state-of-the-art results on several of these image databases.  相似文献   

6.
Visual vocabulary representation approach has been successfully applied to many multimedia and vision applications, including visual recognition, image retrieval, and scene modeling/categorization. The idea behind the visual vocabulary representation is that an image can be represented by visual words, a collection of local features of images. In this work, we will develop a new scheme for the construction of visual vocabulary based on the analysis of visual word contents. By considering the content homogeneity of visual words, we design a visual vocabulary which contains macro-sense and micro-sense visual words. The two types of visual words are appropriately further combined to describe an image effectively. We also apply the visual vocabulary to construct image retrieving and categorization systems. The performance evaluation for the two systems indicates that the proposed visual vocabulary achieves promising results.  相似文献   

7.
8.
基于多模态的检测方法是过滤成人视频的有效手段,然而现有方法中缺乏准确的音频语义表示方法。因此本文提出融合音频单词与视觉特征的成人视频检测方法。先提出基于周期性的能量包络单元(简称EE)分割算法,将音频流准确地分割为EE的序列;再提出基于EE和BoW(Bag-of-Words)的音频语义表示方法,将EE的特征描述为音频单词的出现概率;采用复合加权方法融合音频单词与视觉特征的检测结果;还提出基于周期性的成人视频判别算法,与基于周期性的EE分割算法前后配合,以充分利用周期性进行检测。实验结果表明,与基于视觉特征的方法相比,本文方法显著提高了检测性能。当误检率为9.76%时,检出率可达94.44%。  相似文献   

9.
A new region filtering and region weighting method, which filters out unnecessary regions from images and learns region importance from the region size and the spatial location of regions in an image, is proposed based on region representations. It weights the regions optimally and improves the performance of the region-based retrieval system based on relevance feedback. Due to the semantic gap between the low level feature representation and the high level concept in a query image, semantically relevant images may exhibit very different visual characteristics, and may be scattered in several clusters in the feature space. Our main goal is finding semantically related clusters and their weights to reduce this semantic gap. Experimental results demonstrate the efficiency and effectiveness of the proposed region filtering and weighting method in comparison with the area percentage method and region frequency weighted by inverse image frequency method, respectively.  相似文献   

10.
回环检测能够消除视觉SLAM的累积误差,对SLAM系统意义重大。其中,应用较广泛的视觉词袋模型算法存在着视觉单词同一性和歧义性问题,影响了回环检测效果。为改善这些问题并提高回环检测效果,提出了一种基于软分配SIFT(scale-invariant feature transform)特征的回环检测算法。该算法将图像提取出的SIFT特征点分配到欧氏距离最近的几个单词上,并根据距离排序加权,剔除距离单词较远的特征点,生成更具区分性的描述子。并且在筛选候选项时,加入相同单词特征点占比以及单词偏移稳定性约束,筛选出少量候选项。实验结果中,该算法相较于传统视觉词袋模型以及近些年的几种回环检测算法,在三种数据集中的100%准确率下的召回率有所提升,图像平均查询时间在40 ms左右。结果表明,该算法对回环检测效果有一定提升,并且保证了实时性。  相似文献   

11.
一种基于优化“词袋”模型的物体识别方法*   总被引:1,自引:0,他引:1  
针对传统基于“词袋”模型物体识别现有方法的不足,对现特征表达、视觉词典和图像表示方法进行优化,以提高物体识别正确率。采用HUE直方图与SIFT特征描述符分别描述兴趣点周围的颜色和形状特征,实现“词袋”模型下两种特征的特征级和图像级融合,引入K-means++聚类算法生成视觉词典,并利用软权重思想将特征向量映射到视觉单词形成图像直方图。实验结果表明,所述方法会产生较高的物体识别正确率,且识别结果不受两种特征融合权重的影响。  相似文献   

12.

In the recent years the rapid growth of multimedia content makes the image retrieval a challenging research task. Content Based Image Retrieval (CBIR) is a technique which uses features of image to search user required image from large image dataset according to the user’s request in the form of query image. Effective feature representation and similarity measures are very crucial to the retrieval performance of CBIR. The key challenge has been attributed to the well known semantic gap issue. The machine learning has been actively investigated as possible solution to bridge the semantic gap. The recent success of deep learning inspires as a hope for bridging the semantic gap in CBIR. In this paper, we investigate deep learning approach used for CBIR tasks under varied settings from our empirical studies; we find some encouraging conclusions and insights for future research.

  相似文献   

13.
Xu  Xing  Wu  Haiping  Yang  Yang  Shen  Fumin  Xie  Ning  Ji  Yanli 《Multimedia Tools and Applications》2018,77(17):22185-22198

Recent years have witnessed the unprecedented efforts of visual representation for enabling various efficient and effective multimedia applications. In this paper, we propose a novel visual representation learning framework, which generates efficient semantic hash codes for visual samples by substantially exploring concepts, semantic attributes as well as their inter-correlations. Specifically, we construct a conceptual space, where the semantic knowledge of concepts and attributes is embedded. Then, we develop an effective on-line feature coding scheme for visual objects by leveraging the inter-concept relationships through the intermediate representative power of attributes. The code process is formulated as an overlapping group lasso problem, which can be efficiently solved. Finally, we may binarize the visual representation to generate efficient hash codes. Extensive experiments have been conducted to illustrate the superiority of our proposed framework on visual retrieval task as compared to state-of-the-art methods.

  相似文献   

14.
映射域漂移和偏见性预测问题使得现有的方案无法很好地应对广义零样本学习挑战.在CADA-VAE模型的基础上,提出了基于模态融合的半监督学习方案,就如何利用未标注样本及语义辅助模型进行模态内自学习提供了一种思路.该方案使用潜层向量空间作为视觉和语义模态融合的桥梁,提出了视觉质心和异类语义潜层向量概念,用以指导模态间互学习;...  相似文献   

15.
Automatic image annotation is an attractive service for users and administrators of online photo sharing websites. In this paper, we propose an image annotation approach exploiting the crossmodal saliency correlation including visual and textual saliency. For textual saliency, a concept graph is firstly established based on the association between the labels. Then semantic communities and latent textual saliency are detected; For visual saliency, we adopt a dual-layer BoW (DL-BoW) model integrated with the local features and salient regions of the image. Experiments on MIRFlickr and IAPR TC-12 datasets demonstrate that the proposed method outperforms other state-of-the-art approaches.  相似文献   

16.
Auditory scenes are temporal audio segments with coherent semantic content. Automatically classifying and grouping auditory scenes with similar semantics into categories is beneficial for many multimedia applications, such as semantic event detection and indexing. For such semantic categorization, auditory scenes are first characterized with either low-level acoustic features or some mid-level representations like audio effects, and then supervised classifiers or unsupervised clustering algorithms are employed to group scene segments into various semantic categories. In this paper, we focus on the problem of automatically categorizing audio scenes in unsupervised manner. To achieve more reasonable clustering results, we introduce the co-clustering scheme to exploit potential grouping trends among different dimensions of feature spaces (either low-level or mid-level feature spaces), and provide more accurate similarity measure for comparing auditory scenes. Moreover, we also extend the co-clustering scheme with a strategy based on the Bayesian information criterion (BIC) to automatically estimate the numbers of clusters. Evaluation performed on 272 auditory scenes extracted from 12-h audio data shows very encouraging categorization results. Co-clustering achieved a better performance compared to some traditional one-way clustering algorithms, both based on the low-level acoustic features and on the mid-level audio effect representations. Finally, we present our vision regarding the applicability of this approach on general multimedia data, and also show some preliminary results on content-based image clustering.  相似文献   

17.
Cui  Zheng  Hu  Yongli  Sun  Yanfeng  Gao  Junbin  Yin  Baocai 《Multimedia Tools and Applications》2022,81(17):23615-23632

Image-text retrieval task has received a lot of attention in the modern research field of artificial intelligence. It still remains challenging since image and text are heterogeneous cross-modal data. The key issue of image-text retrieval is how to learn a common feature space while semantic correspondence between image and text remains. Existing works cannot gain fine cross-modal feature representation because the semantic relation between local features is not effectively utilized and the noise information is not suppressed. In order to address these issues, we propose a Cross-modal Alignment with Graph Reasoning (CAGR) model, in which the refined cross-modal features in the common feature space are learned and then a fine-grained cross-modal alignment method is implemented. Specifically, we introduce a graph reasoning module to explore semantic connection for local elements in each modality and measure their importance by self-attention mechanism. In a multi-step reasoning manner, the visual semantic graph and textual semantic graph can be effectively learned and the refined visual and textual features can be obtained. Finally, to measure the similarity between image and text, a novel alignment approach named cross-modal attentional fine-grained alignment is used to compute similarity score between two sets of features. Our model achieves the competitive performance compared with the state-of-the-art methods on Flickr30K dataset and MS-COCO dataset. Extensive experiments demonstrate the effectiveness of our model.

  相似文献   

18.
This paper proposes to investigate the potential benefit of the use of low-level human vision behaviors in the context of high-level semantic concept detection. A large part of the current approaches relies on the Bag-of-Words (BoW) model, which has proven itself to be a good choice especially for object recognition in images. Its extension from static images to video sequences exhibits some new problems to cope with, mainly the way to use the temporal information related to the concepts to detect (swimming, drinking...). In this study, we propose to apply a human retina model to preprocess video sequences before constructing the State-Of-The-Art BoW analysis. This preprocessing, designed in a way that enhances relevant information, increases the performance by introducing robustness to traditional image and video problems, such as luminance variation, shadows, compression artifacts and noise. Additionally, we propose a new segmentation method which enables a selection of low-level spatio-temporal potential areas of interest from the visual scene, without slowing the computation as much as a high-level saliency model would. These approaches are evaluated on the TrecVid 2010 and 2011 Semantic Indexing Task datasets, containing from 130 to 346 high-level semantic concepts. We also experiment with various parameter settings to check their effect on performance.  相似文献   

19.
针对目前词袋模型(BoW)视频语义概念检测方法中的量化误差问题,为了更有效地自动提取视频的底层特征,提出一种基于拓扑独立成分分析(TICA)和高斯混合模型(GMM)的视频语义概念检测算法。首先,通过TICA算法进行视频片段的特征提取,该特征提取算法能够学习到视频片段复杂不变性特征;其次利用GMM方法对视频视觉特征进行建模,描述视频特征的分布情况;最后构造视频片段的GMM超向量,采用支持向量机(SVM)进行视频语义概念检测。GMM是BoW概率框架下的拓展,能够减少量化误差,具有良好的鲁棒性。在TRECVID 2012和OV两个视频库上,将所提方法与传统的BoW、SIFT-GMM方法进行了对比实验,结果表明,基于TICA和GMM的视频语义概念检测方法能够提高视频语义概念检测的准确率。  相似文献   

20.
In recent years, the rapid growth of multimedia content makes content-based image retrieval (CBIR) a challenging research problem. The content-based attributes of the image are associated with the position of objects and regions within the image. The addition of image content-based attributes to image retrieval enhances its performance. In the last few years, the bag-of-visual-words (BoVW) based image representation model gained attention and significantly improved the efficiency and effectiveness of CBIR. In BoVW-based image representation model, an image is represented as an order-less histogram of visual words by ignoring the spatial attributes. In this paper, we present a novel image representation based on the weighted average of triangular histograms (WATH) of visual words. The proposed approach adds the image spatial contents to the inverted index of the BoVW model, reduces overfitting problem on larger sizes of the dictionary and semantic gap issues between high-level image semantic and low-level image features. The qualitative and quantitative analysis conducted on three image benchmarks demonstrates the effectiveness of the proposed approach based on WATH.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号