首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 835 毫秒
1.
The keyphrases of a text entity are a set of words or phrases that concisely describe the main content of that text. Automatic keyphrase extraction plays an important role in natural language processing and information retrieval tasks such as text summarization, text categorization, full-text indexing, and cross-lingual text reuse. However, automatic keyphrase extraction is still a complicated task and the performance of the current keyphrase extraction methods is low. Automatic discovery of high-quality and meaningful keyphrases requires the application of useful information and suitable mining techniques. This paper proposes Topical and Structural Keyphrase Extractor (TSAKE) for the task of automatic keyphrase extraction. TSAKE combines the prior knowledge about the input langue learned by an N-gram topical model (TNG) with the co-occurrence graph of the input text to form some topical graphs. Different from most of the recent keyphrase extraction models, TSAKE uses the topic model to weight the edges instead of the nodes of the co-occurrence graph. Moreover, while TNG represents the general topics of the language, TSAKE applies network analysis techniques to each topical graph to detect finer grained sub-topics and extract more important words of each sub-topic. The use of these informative words in the ranking process of the candidate keyphrases improves the quality of the final keyphrases proposed by TSAKE. The results of our experiment studies conducted on three manually annotated datasets show the superiority of the proposed model over three baseline techniques and six state-of-the-art models.  相似文献   

2.
针对现有的基于图的关键词提取方法未能有效整合文本序列中词与词之间的潜在语义关系的问题,提出了一个融合词向量与位置信息的基于图的关键词提取算法EPRank。通过词向量表示模型学得目标文档中每个词的表示向量;将该反映词与词之间的潜在语义关系的词向量与位置特征相结合融合到PageRank评分模型中;选择几个排名靠前的单词或短语作为目标文档的关键词。实验结果表明,提出的EPRank方法在KDD和SIGIR两个数据集上的各项评估指标均高于5个现有的关键词提取方法。  相似文献   

3.
While several automatic keyphrase extraction (AKE) techniques have been developed and analyzed, there is little consensus on the definition of the task and a lack of overview of the effectiveness of different techniques. Proper evaluation of keyphrase extraction requires large test collections with multiple opinions, currently not available for research. In this paper, we (i) present a set of test collections derived from various sources with multiple annotations (which we also refer to as opinions in the remained of the paper) for each document, (ii) systematically evaluate keyphrase extraction using several supervised and unsupervised AKE techniques, (iii) and experimentally analyze the effects of disagreement on AKE evaluation. Our newly created set of test collections spans different types of topical content from general news and magazines, and is annotated with multiple annotations per article by a large annotator panel. Our annotator study shows that for a given document there seems to be a large disagreement on the preferred keyphrases, suggesting the need for multiple opinions per document. A first systematic evaluation of ranking and classification of keyphrases using both unsupervised and supervised AKE techniques on the test collections shows a superior effectiveness of supervised models, even for a low annotation effort and with basic positional and frequency features, and highlights the importance of a suitable keyphrase candidate generation approach. We also study the influence of multiple opinions, training data and document length on evaluation of keyphrase extraction. Our new test collection for keyphrase extraction is one of the largest of its kind and will be made available to stimulate future work to improve reliable evaluation of new keyphrase extractors.  相似文献   

4.
Multimodal Retrieval is a well-established approach for image retrieval. Usually, images are accompanied by text caption along with associated documents describing the image. Textual query expansion as a form of enhancing image retrieval is a relatively less explored area. In this paper, we first study the effect of expanding textual query on both image and its associated text retrieval. Our study reveals that judicious expansion of textual query through keyphrase extraction can lead to better results, either in terms of text-retrieval or both image and text-retrieval. To establish this, we use two well-known keyphrase extraction techniques based on tf-idf and KEA. While query expansion results in increased retrieval efficiency, it is imperative that the expansion be semantically justified. So, we propose a graph-based keyphrase extraction model that captures the relatedness between words in terms of both mutual information and relevance feedback. Most of the existing works have stressed on bridging the semantic gap by using textual and visual features, either in combination or individually. The way these text and image features are combined determines the efficacy of any retrieval. For this purpose, we adopt Fisher-LDA to adjudge the appropriate weights for each modality. This provides us with an intelligent decision-making process favoring the feature set to be infused into the final query. Our proposed algorithm is shown to supersede the previously mentioned keyphrase extraction algorithms for query expansion significantly. A rigorous set of experiments performed on ImageCLEF-2011 Wikipedia Retrieval task dataset validates our claim that capturing the semantic relation between words through Mutual Information followed by expansion of a textual query using relevance feedback can simultaneously enhance both text and image retrieval.  相似文献   

5.
面向文本的关键词自动提取一直以来是自然语言处理领域的一个关键基础问题和研究热点.特别是,随着当前对文本数据应用需求的不断增加,使得关键词提取技术进一步得到研究者的广泛关注.尽管近年来关键词提取技术得到长足的发展,但提取结果目前还远未取得令人满意的效果.为了促进关键词提取问题的解决,本文对近年来国内、外学者在该研究领域取得的成果进行了系统总结,具体包括候选关键词生成、特征工程和关键词提取三个主要步骤,并对未来可能的研究方向进行了探讨和展望.不同于围绕提取方法进行总结的综述文献,本文主要围绕着各种方法使用的特征信息归纳总结现有成果,这种从特征驱动的视角考察现有研究成果的方式有助于综合利用现有特征或提出新特征,进而提出更有效的关键词提取方法.  相似文献   

6.
当前主流的Web图像检索方法仅考虑了视觉特征,没有充分利用Web图像附带的文本信息,并忽略了相关文本中涉及的有价值的语义,从而导致其图像表达能力不强。针对这一问题,提出了一种新的无监督图像哈希方法——基于语义迁移的深度图像哈希(semantic transfer deep visual hashing,STDVH)。该方法首先利用谱聚类挖掘训练文本的语义信息;然后构建深度卷积神经网络将文本语义信息迁移到图像哈希码的学习中;最后在统一框架中训练得到图像的哈希码和哈希函数,在低维汉明空间中完成对大规模Web图像数据的有效检索。通过在Wiki和MIR Flickr这两个公开的Web图像集上进行实验,证明了该方法相比其他先进的哈希算法的优越性。  相似文献   

7.
提出了一种基于高频词和共现词的文本主题词抽取方法。该方法充分考虑到文档的统计信息和语义信息, 通过对提问问题和答案库中答案的相似度计算排序, 输出候选答案。提出一种具体的应用模型, 分别从问题的分析、信息检索和答案抽取三个模块进行系统的设计, 具有一定的应用价值。  相似文献   

8.

The internet changed the way that people communicate, and this has led to a vast amount of Text that is available in electronic format. It includes things like e-mail, technical and scientific reports, tweets, physician notes and military field reports. Providing key-phrases for these extensive text collections thus allows users to grab the essence of the lengthy contents quickly and helps to locate information with high efficiency. While designing a Keyword Extraction and Indexing system, it is essential to pick unique properties, called features. In this article, we proposed different unsupervised keyword extraction approaches, which is independent of the structure, size and domain of the documents. The proposed method relies on the novel and cognitive inspired set of standard, phrase, word embedding and external knowledge source features. The individual and selected feature results are reported through experimentation on four different datasets viz. SemEval, KDD, Inspec, and DUC. The selected (feature selection) and word embedding based features are the best features set to be used for keywords extraction and indexing among all mentioned datasets. That is the proposed distributed word vector with additional knowledge improves the results significantly over the use of individual features, combined features after feature selection and state-of-the-art. After successfully achieving the objective of developing various keyphrase extraction methods we also experimented it for document classification task.

  相似文献   

9.
关键词抽取技术是自然语言处理领域的一个研究热点。在目前的关键词抽取算法中,深度学习方法较少考虑到中文的特点,汉字粒度的信息利用不充分,中文短文本关键词的提取效果仍有较大的提升空间。为了改进短文本的关键词提取效果,针对论文摘要关键词自动抽取任务,提出了一种将双向长短时记忆神经网络(Bidirectional Long Shot-Term Memory,BiLSTM)与注意力机制(Attention)相结合的基于序列标注(Sequence Tagging)的关键词提取模型(Bidirectional Long Short-term Memory and Attention Mechanism Based on Sequence Tagging,BAST)。首先使用基于词语粒度的词向量和基于字粒度的字向量分别表示输入文本信息;然后,训练BAST模型,利用BiLSTM和注意力机制提取文本特征,并对每个单词的标签进行分类预测;最后使用字向量模型校正词向量模型的关键词抽取结果。实验结果表明,在8159条论文摘要数据上,BAST模型的F1值达到66.93%,比BiLSTM-CRF(Bidirectional Long Shoft-Term Memory and Conditional Random Field)算法提升了2.08%,较其他传统关键词抽取算法也有进一步的提高。该模型的创新之处在于结合了字向量和词向量模型的抽取结果,充分利用了中文文本信息的特征,可以有效提取短文本的关键词,提取效果得到了进一步的改进。  相似文献   

10.
李雄  丁治明  苏醒  郭黎敏 《计算机科学》2018,45(Z11):417-421, 438
本研究主要解决在大量文本数据中 抽取 关键语义信息的问题。文本是自然语言的信息载体,在分析和处理文本信息时,由于目标与方式不同,对文本信息的特征表达方式也各不相同。已有的语义抽取方法往往是针对单篇文本的,忽略了不同文本间的语义联系。为此,文中提出了基于词项聚类的文本语义标签提取方法。该方法以语义抽取为目标,以Hinton的分布式表示假说为文本信息的表达方式,并以最大化语义标签与原文本数据间的语义相似度为目标,使用聚类算法对语义标签进行聚类。实验表明,所提方法由于是基于全体词汇表对语义信息分布进行聚类计算的,因此在语义丰富度和表达能力上相比很多现有方法具有更好的表现。  相似文献   

11.
基于语义的关键词提取算法   总被引:3,自引:1,他引:2  
关键词1提供了文档内容的概要信息,它们被使用在很多数据挖掘的应用中,在目前的关键词提取算法中,我们发现词汇层面(代表意思的词)和概念层面(意思本身)的差别导致了关键字提取的不准确,比如不同语法的词可能有着相同的意思,而相同语法的词在不同的上下文有着不同的意思.为了解决这个问题,这篇文章提出使用词义代替词并且通过考虑关键候选词的语义信息来提高关键词提取算法性能的方法.与现有的关键词提取方法不同,该方法首先通过使用消歧算法,通过上下文得到候选词的词义;然后在后面的词合并、特征提取和评估的步骤中,候选词义之间的语义相关度被用来提高算法的性能.在评估算法时,我们采用一种更为有效的基于语义的评估方法与著名的Kea系统作比较.在不同领域间的实验中可以发现,当考虑语义信息后,关键词提取算法的性能能够得到很大的提高.在同领域的实验中,我们的算法的性能与Kea 算法的相近.我们的算法没有领域的限制性,因此具有更好的应用前景.  相似文献   

12.

A great many of approaches have been developed for cross-modal retrieval, among which subspace learning based ones dominate the landscape. Concerning whether using the semantic label information or not, subspace learning based approaches can be categorized into two paradigms, unsupervised and supervised. However, for multi-label cross-modal retrieval, supervised approaches just simply exploit multi-label information towards a discriminative subspace, without considering the correlations between multiple labels shared by multi-modalities, which often leads to an unsatisfactory retrieval performance. To address this issue, in this paper we propose a general framework, which jointly incorporates semantic correlations into subspace learning for multi-label cross-modal retrieval. By introducing the HSIC-based regularization term, the correlation information among multiple labels can be not only leveraged but also the consistency between the modality similarity from each modality is well preserved. Besides, based on the semantic-consistency projection, the semantic gap between the low-level feature space of each modality and the shared high-level semantic space can be balanced by a mid-level consistent one, where multi-label cross-modal retrieval can be performed effectively and efficiently. To solve the optimization problem, an effective iterative algorithm is designed, along with its convergence analysis theoretically and experimentally. Experimental results on real-world datasets have shown the superiority of the proposed method over several existing cross-modal subspace learning methods.

  相似文献   

13.
Using Webcast Text for Semantic Event Detection in Broadcast Sports Video   总被引:1,自引:0,他引:1  
Sports video semantic event detection is essential for sports video summarization and retrieval. Extensive research efforts have been devoted to this area in recent years. However, the existing sports video event detection approaches heavily rely on either video content itself, which face the difficulty of high-level semantic information extraction from video content using computer vision and image processing techniques, or manually generated video ontology, which is domain specific and difficult to be automatically aligned with the video content. In this paper, we present a novel approach for sports video semantic event detection based on analysis and alignment of webcast text and broadcast video. Webcast text is a text broadcast channel for sports game which is co-produced with the broadcast video and is easily obtained from the web. We first analyze webcast text to cluster and detect text events in an unsupervised way using probabilistic latent semantic analysis (pLSA). Based on the detected text event and video structure analysis, we employ a conditional random field model (CRFM) to align text event and video event by detecting event moment and event boundary in the video. Incorporation of webcast text into sports video analysis significantly facilitates sports video semantic event detection. We conducted experiments on 33 hours of soccer and basketball games for webcast analysis, broadcast video analysis and text/video semantic alignment. The results are encouraging and compared with the manually labeled ground truth.   相似文献   

14.
Efficiently representing and recognizing the semantic classes of the subregions of large-scale high spatial resolution (HSR) remote-sensing images are challenging and critical problems. Most of the existing scene classification methods concentrate on the feature coding approach with handcrafted low-level features or the low-level unsupervised feature learning approaches, which essentially prevent them from better recognizing the semantic categories of the scene due to their limited mid-level feature representation ability. In this article, to overcome the inadequate mid-level representation, a patch-based spatial-spectral hierarchical convolutional sparse auto-encoder (HCSAE) algorithm, based on deep learning, is proposed for HSR remote-sensing imagery scene classification. The HCSAE framework uses an unsupervised hierarchical network based on a sparse auto-encoder (SAE) model. In contrast to the single-level SAE, the HCSAE framework utilizes the significant features from the single-level algorithm in a feedforward and full connection approach to the maximum extent, which adequately represents the scene semantics in the high level of the HCSAE. To ensure robust feature learning and extraction during the SAE feature extraction procedure, a ‘dropout’ strategy is also introduced. The experimental results using the UC Merced data set with 21 classes and a Google Earth data set with 12 classes demonstrate that the proposed HCSAE framework can provide better accuracy than the traditional scene classification methods and the single-level convolutional sparse auto-encoder (CSAE) algorithm.  相似文献   

15.
基于主题特征的关键词抽取   总被引:2,自引:1,他引:1  
为了使抽取出的关键词更能反映文档主题,提出了一种新的词的主题特征(topic feature,TF)计算方法,该方法利用主题模型中词和主题的分布情况计算词的主题特征。并将该特征与关键词抽取中的常用特征结合,用装袋决策树方法构造一个关键词抽取模型。实验结果表明提出的主题特征可以提升关键词抽取的效果,同时验证了装袋决策树在关键词抽取中的适用性。  相似文献   

16.
Linked Open Data cloud is being conceived and published to improve the usability and performance of various applications including Recommender System. While most of the existing works incorporate semantic web information into recommendation system by exploiting content based method, we introduced a collaborative filtering based semantic dual probabilistic matrix factorization approach. We have used semantic item features, generated in an unsupervised manner and incorporated them into user-item preference matrix for recommendation by co-factoring two matrices. To mitigate the difficulty of high dimensionality, sparsity and possible noise in the semantic item-property matrix, Singular Value Decomposition was used as an unsupervised preprocessing step. To evaluate our new approach, RMSE are compared with 10 state-of-art algorithms especially Probabilistic Matrix Factorization which our method is based on, Precision is also compared between our method and PMF. Although similar dual matrix factorization approaches exist but most of them deal with very small item property matrix with abundant entries while our approach introduced a high dimensional and sparse item property matrix through an unsupervised automatic way which also alleviate the efforts to create those item property matrices.  相似文献   

17.
As a valuable tool for text understanding, semantic similarity measurement enables discriminative semantic-based applications in the fields of natural language processing, information retrieval, computational linguistics and artificial intelligence. Most of the existing studies have used structured taxonomies such as WordNet to explore the lexical semantic relationship, however, the improvement of computation accuracy is still a challenge for them. To address this problem, in this paper, we propose a hybrid WordNet-based approach CSSM-ICSP to measuring concept semantic similarity, which leverage the information content(IC) of concepts to weight the shortest path distance between concepts. To improve the performance of IC computation, we also develop a novel model of the intrinsic IC of concepts, where a variety of semantic properties involved in the structure of WordNet are taken into consideration. In addition, we summarize and classify the technical characteristics of previous WordNet-based approaches, as well as evaluate our approach against these approaches on various benchmarks. The experimental results of the proposed approaches are more correlated with human judgment of similarity in term of the correlation coefficient, which indicates that our IC model and similarity detection approach are comparable or even better for semantic similarity measurement as compared to others.  相似文献   

18.
Document clustering is text processing that groups documents with similar concepts. It's usually considered an unsupervised learning approach because there's no teacher to guide the training process, and topical information is often assumed to be unavailable. A guided approach to document clustering that integrates linguistic top-down knowledge from WordNet into text vector representations based on the extended significance vector weighting technique improves both classification accuracy and average quantization error. In our guided self-organization approach we integrate topical and semantic information from WordNet. Because a document-training set with preclassified information implies relationships between a word and its preference class, we propose a novel document vector representation approach to extract these relationships for document clustering. Furthermore, merging statistical methods, competitive neural models, and semantic relationships from symbolic Word-Net, our hybrid learning approach is robust and scales up to a real-world task of clustering 100,000 news documents.  相似文献   

19.
关键词生成是自然语言处理中一项经典但具有挑战性的任务,需要从文档中自动生成一组具有代表性和特征性的词语。基于深度学习的序列到序列模型在这项任务中取得了显著的效果,弥补了以往关键词抽取存在的一个严重缺陷:无法产生不存在于原文中的关键词。由于其产生的结果更切合实际,关键词生成方法逐渐超越了以往的抽取方法,成为了关键词提取任务的主流方法。介绍了关键词提取的发展历程以及关键词生成任务的主要数据集,对基础设计采用序列到序列模型的关键词生成方法进行了分类梳理,分析其原理和优缺点。概述了关键词生成任务的评价方法,并对其未来研究重点进行了展望。  相似文献   

20.
从非结构化商品描述文本中抽取结构化属性信息,对于电子商务实现商品的对比与推荐及用户需求预测等功能具有重要意义.现有结构化方法大多采用监督或半监督的分类方法抽取属性值与属性名,通过文法分析器分析属性值与属性名之间的文法依存关系,并根据关联规则实现属性值与属性名的匹配.这些方法存在以下不足:(1)需要人工标记部分属性值、属性名及它们之间的对应关系;(2)属性值-属性名匹配的准确度受到语言习惯、句意逻辑、语料库及属性名候选集质量的严重制约.提出了一种无监督的中文商品属性结构化方法.该方法借助搜索引擎,基于小概率事件原理分析文法关系来抽取属性值与属性名.同时,提出相对不选取条件概率场,并使用Page Rank算法来计算属性值与属性名的配对概率.该方法无需人工标记的开销,且无论商品描述中是否显式地包含相应的属性名,该方法都能自动抽取到属性值并匹配相应的属性名.使用百度搜索引擎上的真实语料,针对4类商品的中文描述进行了实验.实验结果验证了对于候选属性名的自动生成,所提出的基于搜索引擎搜索属性值,并在包含属性值的搜索结果中抽取一般名词的候选属性名生成方法与只在描述句中抽取一般名词的候选属性名生成方法相比,查全率提高了20%以上;对于非量化类属性,所提出的基于相对不选取条件概率场的属性值-属性名匹配方法与基于依存关联的方法相比,Rank-1的准确率提高了30%以上,平均MRR提高了0.3以上.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号