首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, genetic algorithm oriented latent semantic features (GALSF) are proposed to obtain better representation of documents in text classification. The proposed approach consists of feature selection and feature transformation stages. The first stage is carried out using the state-of-the-art filter-based methods. The second stage employs latent semantic indexing (LSI) empowered by genetic algorithm such that a better projection is attained using appropriate singular vectors, which are not limited to the ones corresponding to the largest singular values, unlike standard LSI approach. In this way, the singular vectors with small singular values may also be used for projection whereas the vectors with large singular values may be eliminated as well to obtain better discrimination. Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions.  相似文献   

2.
刘美茹 《计算机工程》2007,33(15):217-219
文本分类技术是文本数据挖掘的基础和核心,是基于自然语言处理技术和机器学习算法的一个具体应用。特征选择和分类算法是文本分类中两个最关键的技术,该文提出了利用潜在语义索引进行特征提取和降维,并结合支持向量机(SVM)算法进行多类分类,实验结果显示与向量空间模型(VSM)结合SVM方法和LSI结合K近邻(KNN)方法相比,取得了更好的效果,在文本类别数较少、类别划分比较清晰的情况下可以达到实用效果。  相似文献   

3.
《Applied Soft Computing》2007,7(3):908-914
This paper presents a least square support vector machine (LS-SVM) that performs text classification of noisy document titles according to different predetermined categories. The system's potential is demonstrated with a corpus of 91,229 words from University of Denver's Penrose Library catalogue. The classification accuracy of the proposed LS-SVM based system is found to be over 99.9%. The final classifier is an LS-SVM array with Gaussian radial basis function (GRBF) kernel, which uses the coefficients generated by the latent semantic indexing algorithm for classification of the text titles. These coefficients are also used to generate the confidence factors for the inference engine that present the final decision of the entire classifier. The system is also compared with a K-nearest neighbor (KNN) and Naïve Bayes (NB) classifier and the comparison clearly claims that the proposed LS-SVM based architecture outperforms the KNN and NB based system. The comparison between the conventional linear SVM based classifiers and neural network based classifying agents shows that the LS-SVM with LSI based classifying agents improves text categorization performance significantly and holds a lot of potential for developing robust learning based agents for text classification.  相似文献   

4.
采用向量空间模型(V SM)描述文本,利用隐性语义索引(LSI)技术进行特征重构与降维,构造了BP神经网络文本分类器。将贝叶斯分类技术与前者结合构造了一种混合文本分类器。实验结果表明混合分类器分类准确度和分类速度得到提高。  相似文献   

5.
Document clustering using locality preserving indexing   总被引:7,自引:0,他引:7  
We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.  相似文献   

6.
基于潜在语义索引和句子聚类的中文自动文摘   总被引:2,自引:0,他引:2  
自动文摘是自然语言处理领域的一项重要的研究课题.提出一种基于潜在语义索引和句子聚类的中文自动文摘方法.该方法的特色在于:使用潜在语义索引计算句子的相似度,并将层次聚类算法和K-中心聚类算法相结合进行句子聚类,这样提高了句子相似度计算和主题划分的准确性,有利于生成的文摘在全面覆盖文档主题的同时减少自身的冗余.实验结果验证了该文提出的方法的有效性,对比传统的基于聚类的自动文摘方法,该方法生成的文摘质量获得了显著的提高.  相似文献   

7.
基于Rough集潜在语义索引的Web文档分类   总被引:5,自引:0,他引:5  
Rough集(粗糙集)埋论是一种处理不确定或模糊知识的数学工具。提出了一种基于Rough集理论的潜在语义索引的Web文档分类方法。首先应用向量空间模型表示Web文档信息,然后通过矩阵的奇异值分解来进行信息过滤和潜在语义索引;运用属性约简算法生成分类规则,最后利用多知识库进行文档分类。通过试验比较,该方法具有较好的分类效果。  相似文献   

8.
本文介绍了一种信息抽取和自动分类的新应用,分析了传统分类方法的不足,介绍了一种基于隐含语义索引技术的文本分类改进方案。该技术是一新型的检索模型,它通过奇异值分解,或增强或消减词在文档中的语义影响力,使得文档之间的语义关系更为明晰,从而能容易地剔除掉那些语义关联弱的噪声数据,提高特征值提取精度和最后的分类准确度。  相似文献   

9.
基于概念空间的文本分类研究   总被引:3,自引:0,他引:3  
1.引言随着文本信息的快速增长,特别是Internet上在线信息的增加,文本(网页)自动分类已成为一项具有较大实用价值的关键技术,是组织和管理数据的有力手段。文本分类的方法分为两类:一是基于知识的分类方法;二是基于统计的分类方法。基于知识的文本分类系统应用于某一具体领域,需要该领域的知识库作为支撑。由于知识提取、更新、维护以及自我学习等方面存在的种种问题,使得它适用  相似文献   

10.
李旻松  段琢华 《计算机应用》2011,31(9):2429-2431
隐含语意索引(LSI)是一个能有效捕获文档中词的隐含语意特征的方法。然而,用该方法选择的特征空间对文本分类来说可能不是最适合的,因为这种方法按照词的变化排序特征,而没有考虑到分类能力。支持向量机(SVM)高度的泛化能力使它特别适用于高维数据例如文档的分类。为此提出基于支持向量机的特征提取方法用于选择适于分类的LSI特征。该方法利用SVM高度泛化的分类能力, 通过使用在每一个规则下训练的分类器的参数对第k个特征对反向平方分解面的贡献w2k的值进行估计。实验表明当需要比LSI更少的训练和测试时间时,该方法能够以更为紧凑的表示方式提高分类性能。  相似文献   

11.
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.  相似文献   

12.
基于概率潜在语义分析的中文信息检索   总被引:1,自引:1,他引:0       下载免费PDF全文
罗景  涂新辉 《计算机工程》2008,34(2):199-201
传统的信息检索模型把词看作孤立的单元,没有考虑自然语言中存在大量的同义词、多义词现象,对召回率和准确率有不利的影响。概率潜在语义模型使用统计的方法建立“文档-潜在语义-词”之间概率分布关系并利用这种关系进行检索。该文将概率潜在语义模型用于中文信息检索,实验结果表明,概率潜在语义模型相对于传统的向量空间模型能够显著地提高检索的平均精度。  相似文献   

13.
一种压缩域特征提取与语义图像检索技术   总被引:1,自引:0,他引:1  
为了解决“语义鸿沟”问题,通过将隐含语义索引(LSI)技术引入到图像语义提取问题的研究中,试图从图像的视觉特征中抽取出“高层概念”.基于GM(1,1)压缩域中的一种图像特征,提出了一种建立“图像视觉特征”与“语义信息”之间映射的技术方法.实验研究表明,这种基于压缩域特征和LSI技术的图像检索方法能显著改善图像检索的性能,提高图像检索的质量.  相似文献   

14.
We present the results of our work that seek to negotiate the gap between low-level features and high-level concepts in the domain of web document retrieval. This work concerns a technique, called the latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this environment, LSI determines clusters of co-occurring keywords so that a query which uses a particular keyword can then retrieve documents perhaps not containing this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based web document retrieval, using both keywords and image features to represent the documents. Two different approaches to image feature representation, namely, color histograms and color anglograms, are adopted and evaluated. Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly, even when querying is done using only keywords.  相似文献   

15.
传统潜在语义分析(Latent Semantic Analysis, LSA)方法无法获得场景目标空间分布信息和潜在主题的判别信息。针对这一问题提出了一种基于多尺度空间判别性概率潜在语义分析(Probabilistic Latent Semantic Analysis, PLSA)的场景分类方法。首先通过空间金字塔方法对图像进行空间多尺度划分获得图像空间信息,结合PLSA模型获得每个局部块的潜在语义信息;然后串接每个特定局部块中的语义信息得到图像多尺度空间潜在语义信息;最后结合提出的权值学习方法来学习不同图像主题间的判别信息,从而得到图像的多尺度空间判别性潜在语义信息,并将学习到的权值信息嵌入支持向量基(Support Vector Machine, SVM)分类器中完成图像的场景分类。在常用的三个场景图像库(Scene-13、Scene-15和Caltech-101)上的实验表明,该方法平均分类精度比现有许多state-of-art方法均优。验证了其有效性和鲁棒性。  相似文献   

16.
本文针对当前传统潜在语义索引(LSI——latent semantic indexing)技术在提供信息过滤服务时已经不能满足用户个性化需求这一实际情况,提出利用隐式反馈技术来解决如何提供给不同用户以不同信息结果这一问题。在传统的LSI技术上提出了一种基于隐式反馈的LSI个性化信息过滤方法,该方法通过引入隐式反馈技术,将其应用于信息过滤中,从而可以为不同用户提供更多更有针对性的信息结果。本文给出了该方法的公式和具体算法,为其应用的实现提供了理论基础。  相似文献   

17.
特征选择和分类算法是网页文本聚类中最关键的技术。提出对网页文本提取特征值后,利用潜在语义索引对网页文本降维,采用支持向量聚类(SVC)算法对降维后的特征向量进行聚类,以此进行文本分类。实验结果显示具有较好的效果。  相似文献   

18.
针对传统潜在语义检索模型计算成本大、检索速度慢、不利于应用在大规模农业信息检索领域的缺陷,文中提出一种针对农业主题的改进潜在语义检索模型(ALSI)。该模型先利用全文检索生成农业信息全文倒排索引库,然后利用农业高频词库和潜在语义分析生成的语义索引库,进行语义检索。通过多组实验分析确定了该模型所采用的词条权重计算方法和语义空间维数。最后,通过实验分析对比了改进后的潜在语义检索模型(ALSI)与传统潜在语义检索模型(LSI)的检索效果。结果表明,ALSI的检索效果明显好于LSI,适合应用于较大规模农业信息检索。  相似文献   

19.
Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI.  相似文献   

20.
传统的文本分类都是根据文本的外在特征进行的,最常见的就是基于向量空间模型的方法,使用空间向量表示文本,通过相似度比较来确定分类。为了克服向量空间模型中的词条独立性假设,文章提出了一种基于潜在语义索引的文本分类模型,通过对大量的文本集进行统计分析,揭示了词语的上下文使用含义,通过奇异值分解有效地降低了向量空间的维数,消除了同义词、多义词的影响,从而提高了文本分类的精度。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号