首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Statistical semantics for enhancing document clustering   总被引:1,自引:1,他引:0  
Document clustering algorithms usually use vector space model (VSM) as their underlying model for document representation. VSM assumes that terms are independent and accordingly ignores any semantic relations between them. This results in mapping documents to a space where the proximity between document vectors does not reflect their true semantic similarity. This paper proposes new models for document representation that capture semantic similarity between documents based on measures of correlations between their terms. The paper uses the proposed models to enhance the effectiveness of different algorithms for document clustering. The proposed representation models define a corpus-specific semantic similarity by estimating measures of term–term correlations from the documents to be clustered. The corpus of documents accordingly defines a context in which semantic similarity is calculated. Experiments have been conducted on thirteen benchmark data sets to empirically evaluate the effectiveness of the proposed models and compare them to VSM and other well-known models for capturing semantic similarity.  相似文献   

2.
Text document clustering using global term context vectors   总被引:2,自引:2,他引:0  
Despite the advantages of the traditional vector space model (VSM) representation, there are known deficiencies concerning the term independence assumption. The high dimensionality and sparsity of the text feature space and phenomena such as polysemy and synonymy can only be handled if a way is provided to measure term similarity. Many approaches have been proposed that map document vectors onto a new feature space where learning algorithms can achieve better solutions. This paper presents the global term context vector-VSM (GTCV-VSM) method for text document representation. It is an extension to VSM that: (i) it captures local contextual information for each term occurrence in the term sequences of documents; (ii) the local contexts for the occurrences of a term are combined to define the global context of that term; (iii) using the global context of all terms a proper semantic matrix is constructed; (iv) this matrix is further used to linearly map traditional VSM (Bag of Words—BOW) document vectors onto a ‘semantically smoothed’ feature space where problems such as text document clustering can be solved more efficiently. We present an experimental study demonstrating the improvement of clustering results when the proposed GTCV-VSM representation is used compared with traditional VSM-based approaches.  相似文献   

3.
许多自然语言应用需要将输入的文本表示成一个固定长度的向量,现有的技术如词嵌入(Word Embeddings)和文档表示(Document Representation)为自然语言任务提供特征表示,但是它们没有考虑句子中每个单词的重要性差别,同时也忽略一个句子在一篇文档中的重要性差别.本文提出一个基于层级注意力机制的文档表示模型(HADR),而且考虑文档中重要的句子和句子中重要的单词因素.实验结果表明,在考虑了单词的重要和句子重要性的文档表示具有更好的性能.该模型在文档(IMBD)的情感分类上的正确率高于Doc2Vec和Word2Vec模型.  相似文献   

4.
提出了一种结合关键词特征和共现词对特征的向量空间模型。首先,通过分词和去除停用词提取文本中的候选关键词,利用文本频率筛选关键词特征。然后,基于获得的关键词特征两两构造候选共现词对,定义支持度和置信度筛选共现词对特征。最后,结合关键词特征和共现词对特征构建向量空间模型。文本分类实验结果表明,提出的模型具有更强的文本分类能力。  相似文献   

5.
传统的文本分类都是根据文本的外在特征进行的,最常见的就是基于向量空间模型的方法,使用空间向量表示文本,通过相似度比较来确定分类。为了克服向量空间模型中的词条独立性假设,文章提出了一种基于潜在语义索引的文本分类模型,通过对大量的文本集进行统计分析,揭示了词语的上下文使用含义,通过奇异值分解有效地降低了向量空间的维数,消除了同义词、多义词的影响,从而提高了文本分类的精度。  相似文献   

6.
Text representation is an essential task in transforming the input from text into features that can be later used for further Text Mining and Information Retrieval tasks. The commonly used text representation model is Bags-of-Words (BOW) and the N-gram model. Nevertheless, some known issues of these models, which are inaccurate semantic representation of text and high dimensionality of word size combination, should be investigated. A pattern-based model named Frequent Adjacent Sequential Pattern (FASP) is introduced to represent the text using a set of sequence adjacent words that are frequently used across the document collection. The purpose of this study is to discover the similarity of textual pattern between documents that can be later converted to a set of rules to describe the main news event. The FASP is based on the Pattern-Growth’s divide-and-conquer strategy where the main difference between FASP and the prior technique is in the Pattern Generation phase. This approach is tested against the BOW and N-gram text representation model using Malay and English language news dataset with different term weightings in the Vector Space Model (VSM). The findings demonstrate that the FASP model has a promising performance in finding similarities between documents with the average vector size reduction of 34% against the BOW and 77% against the N-gram model using the Malay dataset. Results using the English dataset is also consistent, indicating that the FASP approach is also language independent.  相似文献   

7.
基于WordNet概念向量空间模型的文本分类   总被引:5,自引:0,他引:5  
文章提出了一种文本特征提取方法,以WordNet语言本体库为基础,以同义词集合概念代替词条,同时考虑同义词集合间的上下位关系,建立文本的概念向量空间模型作为文本特征向量,使得在训练过程中能够提取出代表类别的高层次信息。实验结果表明,当训练文本集合很小时,方法能够较大地提高文本的分类准确率。  相似文献   

8.
The text categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which proved effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which (i) uses a finite design set of labeled data to (ii) help agglomerative hierarchical clustering algorithms (AHC) partition a finite set of unlabeled data and then (iii) terminates without the capacity to classify other objects. This algorithm is the “semi-supervised agglomerative hierarchical clustering algorithm” (ssAHC). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 89 TOPICS classes of the Reuters collection. Using the vector space model (VSM), each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssAHC improve its performance, effectively addresses the classification of documents into categories with few training documents, and does not interfere with the use of training data. © 2001 John Wiley & Sons, Inc.  相似文献   

9.
文本分类是自然语言处理领域的核心任务之一,深度学习的发展给文本分类带来更广阔的发展前景。针对当前基于深度学习的文本分类方法在长文本分类中的优势和不足,该文提出一种文本分类模型,在层次模型基础上引入混合注意力机制来关注文本中的重要部分。首先,按照文档的层次结构分别对句子和文档进行编码;其次,在每个层级分别使用注意力机制。句编码时在全局目标向量基础上同时利用最大池化提取句子特定的目标向量,使编码出的文档向量具有更加明显的类别特征,能够更好地关注到每个文本最具区别性的语义特征。最后,根据构建的文档表示对文档分类。在公开数据集和行业数据集上的实验结果表明,该模型对具有层次结构的长文本具有更优的分类性能。  相似文献   

10.
信息处理领域中,现有的各种文本分类算法大都基于向量空间模型,而向量空间模型却不能够有效地表达文档的结构信息,从而使得它还不能充分地表达文档的语义信息.为了更有效地表达文档的语义信息,本文首先提出了一种新的文档表示模型一图模型,即通过带权标号图表达文档的特征词条及其位置关联信息,在此基础上本文继而提出了一种新的文档相似性度量标准,并用于中文文本的分类.实验结果表明,基于图模型的这种文档表示方式是有效的和可行的.  相似文献   

11.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

12.
基于向量空间模型的贝叶斯文本分类方法   总被引:2,自引:0,他引:2  
提出基于向量空间模型的贝叶斯文本分类方法。首先提取出文本训练集的特征词,建立特征向量空间模型。然后采用贝叶斯文本分类方法对未知类别文档进行分类。给出了贝叶斯文本分类方法过程的详细描述和文本分类的一个测试实例。  相似文献   

13.
基于XML和N层VSM的Web信息检索   总被引:2,自引:0,他引:2  
基于XML文档格式良好、层次清晰,可以方便地操纵、分析其结构的特点。文中在将Web上的HTML文档转化为XML文档的基础上,通过Java中的DOM树,分析文档的层次结构。把文档分为层次化的文本段,对传统的VSM算法进行改进,把每个文本段转换为空间向量,实现了N层VSM算法,通过试验证明,改进后算法的查全率和查准率都要优于传统的VSM算法。  相似文献   

14.
一种免疫克隆特征选择算法在文本分类中的应用   总被引:2,自引:0,他引:2  
如何选择最能够表达文本主题的特征词,从而减少特征空间维数,是文本分类的一个关键问题。针对此问题本文提出了一种基于向量空间模型(VSM)的免疫克隆特征选择算法。实验表明,该方法能有效提高文本分类正确率,比文档频率方法和遗传算法具有更明显的优势。  相似文献   

15.
Web文本表示是Web文本特征提取和分类的前提,最常用的文本表示是向量空间模型(VSM),其中向量一般是基于词的特征项。由于向量空间模型本身没有考虑文本上下文间的潜在概念结构(如词汇间的共现关系),而Web文本是一种半结构化文本,同时经常有新词出现,因此在VSM基础上提出了一种基于新词发现的Web文本表示方法:首先进行预处理将网页转化为文本;然后进行文本分词;接着通过二元互信息进行新词发现,同时把新词加入字典重新分词;最后用词和新词共同来表示Web文本。实验结果表明,该方法可以帮助识别未登录词并扩充现有字典,能够增强Web文本表示能力,改善Web文本的特征项质量,提高Web文本分类效果。  相似文献   

16.
In vector space model (VSM), text representation is the task of transforming the content of a textual document into a vector in the term space so that the document could be recognized and classified by a computer or a classifier. Different terms (i.e. words, phrases, or any other indexing units used to identify the contents of a text) have different importance in a text. The term weighting methods assign appropriate weights to the terms to improve the performance of text categorization. In this study, we investigate several widely-used unsupervised (traditional) and supervised term weighting methods on benchmark data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection, we propose a new simple supervised term weighting method, i.e. tf.rf, to improve the terms' discriminating power for text categorization task. From the controlled experimental results, these supervised term weighting methods have mixed performance. Specifically, our proposed supervised term weighting method, tf.rf, has a consistently better performance than other term weighting methods while other supervised term weighting methods based on information theory or statistical metric perform the worst in all experiments. On the other hand, the popularly used tf.idf method has not shown a uniformly good performance in terms of different data sets.  相似文献   

17.
Text categorization is one of the most common themes in data mining and machine learning fields. Unlike structured data, unstructured text data is more difficult to be analyzed because it contains complicated both syntactic and semantic information. In this paper, we propose a two-level representation model (2RM) to represent text data, one is for representing syntactic information and the other is for semantic information. Each document, in syntactic level, is represented as a term vector where the value of each component is the term frequency and inverse document frequency. The Wikipedia concepts related to terms in syntactic level are used to represent document in semantic level. Meanwhile, we designed a multi-layer classification framework (MLCLA) to make use of the semantic and syntactic information represented in 2RM model. The MLCLA framework contains three classifiers. Among them, two classifiers are applied on syntactic level and semantic level in parallel. The outputs of these two classifiers will be combined and input to the third classifier, so that the final results can be obtained. Experimental results on benchmark data sets (20Newsgroups, Reuters-21578 and Classic3) have shown that the proposed 2RM model plus MLCLA framework improves the text classification performance by comparing with the existing flat text representation models (Term-based VSM, Term Semantic Kernel Model, Concept-based VSM, Concept Semantic Kernel Model and Term + Concept VSM) plus existing classification methods.  相似文献   

18.
String alignment for automated document versioning   总被引:2,自引:2,他引:0  
The automated analysis of documents is an important task given the rapid increase in availability of digital texts. Automatic text processing systems often encode documents as vectors of term occurrence frequencies, a representation which facilitates the classification and clustering of documents. Historically, this approach derives from the related field of data mining, where database entries are commonly represented as points in a vector space. While this lineage has certainly contributed to the development of text processing, there are situations where document collections do not conform to this clustered structure, and where the vector representation may be unsuitable for text analysis. As a proof-of-concept, we had previously presented a framework where the optimal alignments of documents could be used for visualising the relationships within small sets of documents. In this paper we develop this approach further by using it to automatically generate the version histories of various document collections. For comparison, version histories generated using conventional methods of document representation are also produced. To facilitate this comparison, a simple procedure for evaluating the accuracy of the version histories thus generated is proposed.  相似文献   

19.
Document Similarity Using a Phrase Indexing Graph Model   总被引:3,自引:1,他引:2  
Document clustering techniques mostly rely on single term analysis of text, such as the vector space model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes Web documents based on phrases rather than on single terms only. The semistructured Web documents help in identifying potential phrases that when matched with other documents indicate strong similarity between the documents. The Document Index Graph captures this information, and finding significant matching phrases between documents becomes easy and efficient with such model. The model is flexible in that it could revert to a compact representation of the vector space model if we choose not to index phrases. However, using phrase indexing yields more accurate document similarity calculations. The similarity between documents is based on both single term weights and matching phrase weights. The combined similarities are used with standard document clustering techniques to test their effect on the clustering quality. Experimental results show that our phrase-based similarity, combined with single-term similarity measures, gives a more accurate measure of document similarity and thus significantly enhances Web document clustering quality.  相似文献   

20.
基于协同演化的文本特征获取算法   总被引:3,自引:0,他引:3  
作为证券监管机构,如何从海量的网络信息中有效地对文本信息进行准确的分类,对于提高日常监管工作效率是非常重要的。该文主要基于数据挖掘技术,以矢量空间模型VSM为文本的表示方法,提出了一个基于协同演化遗传算法的多文本特征抽取算法,有效地降低了文本特征矢量的维数,为文本分类模板获取等多文本特征获取问题提供了一个可行的解决方案。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号