期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Text classification using genetic algorithm oriented latent semantic features

《Expert systems with applications》2014,41(13):5938-5947

In this paper, genetic algorithm oriented latent semantic features (GALSF) are proposed to obtain better representation of documents in text classification. The proposed approach consists of feature selection and feature transformation stages. The first stage is carried out using the state-of-the-art filter-based methods. The second stage employs latent semantic indexing (LSI) empowered by genetic algorithm such that a better projection is attained using appropriate singular vectors, which are not limited to the ones corresponding to the largest singular values, unlike standard LSI approach. In this way, the singular vectors with small singular values may also be used for projection whereas the vectors with large singular values may be eliminated as well to obtain better discrimination. Experimental results demonstrate that GALSF outperforms both LSI and filter-based feature selection methods on benchmark datasets for various feature dimensions. 相似文献

2.

基于LSI和SVM的文本分类研究

下载免费PDF全文

刘美茹《计算机工程》2007,33(15):217-219

文本分类技术是文本数据挖掘的基础和核心，是基于自然语言处理技术和机器学习算法的一个具体应用。特征选择和分类算法是文本分类中两个最关键的技术，该文提出了利用潜在语义索引进行特征提取和降维，并结合支持向量机(SVM)算法进行多类分类，实验结果显示与向量空间模型(VSM)结合SVM方法和LSI结合K近邻(KNN)方法相比，取得了更好的效果，在文本类别数较少、类别划分比较清晰的情况下可以达到实用效果。相似文献

3.

Text classification: A least square support vector machine approach

《Applied Soft Computing》2007,7(3):908-914

This paper presents a least square support vector machine (LS-SVM) that performs text classification of noisy document titles according to different predetermined categories. The system's potential is demonstrated with a corpus of 91,229 words from University of Denver's Penrose Library catalogue. The classification accuracy of the proposed LS-SVM based system is found to be over 99.9%. The final classifier is an LS-SVM array with Gaussian radial basis function (GRBF) kernel, which uses the coefficients generated by the latent semantic indexing algorithm for classification of the text titles. These coefficients are also used to generate the confidence factors for the inference engine that present the final decision of the entire classifier. The system is also compared with a K-nearest neighbor (KNN) and Naïve Bayes (NB) classifier and the comparison clearly claims that the proposed LS-SVM based architecture outperforms the KNN and NB based system. The comparison between the conventional linear SVM based classifiers and neural network based classifying agents shows that the LS-SVM with LSI based classifying agents improves text categorization performance significantly and holds a lot of potential for developing robust learning based agents for text classification. 相似文献

4.

基于神经网络与贝叶斯的混合文本分类研究

陈世立高野军《电脑开发与应用》2006,19(12):27-29,32

采用向量空间模型(V SM)描述文本,利用隐性语义索引(LSI)技术进行特征重构与降维,构造了BP神经网络文本分类器。将贝叶斯分类技术与前者结合构造了一种混合文本分类器。实验结果表明混合分类器分类准确度和分类速度得到提高。相似文献

5.

Document clustering using locality preserving indexing 总被引：7，自引：0，他引：7

Cai D. He X. Han J. 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(12):1624-1637

We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using locality preserving indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on latent semantic indexing (LSI) or nonnegative matrix factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised linear discriminant analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets. 相似文献

6.

基于潜在语义索引和句子聚类的中文自动文摘 总被引：2，自引：0，他引：2

陈戈段建勇陆汝占《计算机仿真》2008,25(7)

自动文摘是自然语言处理领域的一项重要的研究课题.提出一种基于潜在语义索引和句子聚类的中文自动文摘方法.该方法的特色在于：使用潜在语义索引计算句子的相似度,并将层次聚类算法和K-中心聚类算法相结合进行句子聚类,这样提高了句子相似度计算和主题划分的准确性,有利于生成的文摘在全面覆盖文档主题的同时减少自身的冗余.实验结果验证了该文提出的方法的有效性,对比传统的基于聚类的自动文摘方法,该方法生成的文摘质量获得了显著的提高. 相似文献

7.

基于Rough集潜在语义索引的Web文档分类 总被引：5，自引：0，他引：5

何明冯博琴傅向华《计算机工程》2004,30(13):3-5

Rough集(粗糙集)埋论是一种处理不确定或模糊知识的数学工具。提出了一种基于Rough集理论的潜在语义索引的Web文档分类方法。首先应用向量空间模型表示Web文档信息，然后通过矩阵的奇异值分解来进行信息过滤和潜在语义索引；运用属性约简算法生成分类规则，最后利用多知识库进行文档分类。通过试验比较，该方法具有较好的分类效果。相似文献

8.

隐含语义索引技术在供求信息分类中的应用

朱学昊王儒敬《计算机工程与应用》2007,43(14):192-194

本文介绍了一种信息抽取和自动分类的新应用,分析了传统分类方法的不足,介绍了一种基于隐含语义索引技术的文本分类改进方案。该技术是一新型的检索模型,它通过奇异值分解,或增强或消减词在文档中的语义影响力,使得文档之间的语义关系更为明晰,从而能容易地剔除掉那些语义关联弱的噪声数据,提高特征值提取精度和最后的分类准确度。相似文献

9.

基于概念空间的文本分类研究 总被引：3，自引：0，他引：3

黄海英《计算机科学》2003,30(3):46-49

1.引言随着文本信息的快速增长,特别是Internet上在线信息的增加,文本(网页)自动分类已成为一项具有较大实用价值的关键技术,是组织和管理数据的有力手段。文本分类的方法分为两类:一是基于知识的分类方法;二是基于统计的分类方法。基于知识的文本分类系统应用于某一具体领域,需要该领域的知识库作为支撑。由于知识提取、更新、维护以及自我学习等方面存在的种种问题,使得它适用相似文献

10.

基于支持向量机的隐含语意特征选择方法

李旻松段琢华《计算机应用》2011,31(9):2429-2431

隐含语意索引(LSI)是一个能有效捕获文档中词的隐含语意特征的方法。然而,用该方法选择的特征空间对文本分类来说可能不是最适合的,因为这种方法按照词的变化排序特征,而没有考虑到分类能力。支持向量机(SVM)高度的泛化能力使它特别适用于高维数据例如文档的分类。为此提出基于支持向量机的特征提取方法用于选择适于分类的LSI特征。该方法利用SVM高度泛化的分类能力, 通过使用在每一个规则下训练的分类器的参数对第k个特征对反向平方分解面的贡献w2k的值进行估计。实验表明当需要比LSI更少的训练和测试时间时,该方法能够以更为紧凑的表示方式提高分类性能。相似文献

11.

Genetic algorithm for text clustering based on latent semantic indexing

Wei Song Soon Cheol Park 《Computers & Mathematics with Applications》2009,57(11-12):1901

In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results. 相似文献

12.

基于概率潜在语义分析的中文信息检索 总被引：1，自引：1，他引：0

下载免费PDF全文

罗景涂新辉《计算机工程》2008,34(2):199-201

传统的信息检索模型把词看作孤立的单元,没有考虑自然语言中存在大量的同义词、多义词现象,对召回率和准确率有不利的影响。概率潜在语义模型使用统计的方法建立“文档-潜在语义-词”之间概率分布关系并利用这种关系进行检索。该文将概率潜在语义模型用于中文信息检索,实验结果表明,概率潜在语义模型相对于传统的向量空间模型能够显著地提高检索的平均精度。相似文献

13.

一种压缩域特征提取与语义图像检索技术 总被引：1，自引：0，他引：1

曹奎冯玉才《小型微型计算机系统》2005,26(1):151-155

为了解决“语义鸿沟”问题，通过将隐含语义索引(LSI)技术引入到图像语义提取问题的研究中，试图从图像的视觉特征中抽取出“高层概念”．基于GM(1，1)压缩域中的一种图像特征，提出了一种建立“图像视觉特征”与“语义信息”之间映射的技术方法．实验研究表明，这种基于压缩域特征和LSI技术的图像检索方法能显著改善图像检索的性能，提高图像检索的质量．相似文献

14.

Narrowing the semantic gap - improved text-based web document retrieval using visual features 总被引：2，自引：0，他引：2

Rong Zhao Grosky W.I. 《Multimedia, IEEE Transactions on》2002,4(2):189-200

We present the results of our work that seek to negotiate the gap between low-level features and high-level concepts in the domain of web document retrieval. This work concerns a technique, called the latent semantic indexing (LSI), which has been used for textual information retrieval for many years. In this environment, LSI determines clusters of co-occurring keywords so that a query which uses a particular keyword can then retrieve documents perhaps not containing this keyword, but containing other keywords from the same cluster. In this paper, we examine the use of this technique for content-based web document retrieval, using both keywords and image features to represent the documents. Two different approaches to image feature representation, namely, color histograms and color anglograms, are adopted and evaluated. Experimental results show that LSI, together with both textual and visual features, is able to extract the underlying semantic structure of web documents, thus helping to improve the retrieval performance significantly, even when querying is done using only keywords. 相似文献

15.

多尺度空间判别性概率潜在语义分析的场景分类

下载免费PDF全文

季海峰高隽郑鹏王婧《中国图象图形学报》2014,19(1):109-118

传统潜在语义分析(Latent Semantic Analysis, LSA)方法无法获得场景目标空间分布信息和潜在主题的判别信息。针对这一问题提出了一种基于多尺度空间判别性概率潜在语义分析(Probabilistic Latent Semantic Analysis, PLSA)的场景分类方法。首先通过空间金字塔方法对图像进行空间多尺度划分获得图像空间信息,结合PLSA模型获得每个局部块的潜在语义信息;然后串接每个特定局部块中的语义信息得到图像多尺度空间潜在语义信息;最后结合提出的权值学习方法来学习不同图像主题间的判别信息,从而得到图像的多尺度空间判别性潜在语义信息,并将学习到的权值信息嵌入支持向量基(Support Vector Machine, SVM)分类器中完成图像的场景分类。在常用的三个场景图像库(Scene-13、Scene-15和Caltech-101)上的实验表明,该方法平均分类精度比现有许多state-of-art方法均优。验证了其有效性和鲁棒性。相似文献

16.

基于隐式反馈的LSI个性化信息过滤方法的研究

ZHANG Hong XU Qun-yi SU Chen 《数字社区&智能家居》2008,(12)

本文针对当前传统潜在语义索引(LSI——latent semantic indexing)技术在提供信息过滤服务时已经不能满足用户个性化需求这一实际情况,提出利用隐式反馈技术来解决如何提供给不同用户以不同信息结果这一问题。在传统的LSI技术上提出了一种基于隐式反馈的LSI个性化信息过滤方法,该方法通过引入隐式反馈技术,将其应用于信息过滤中,从而可以为不同用户提供更多更有针对性的信息结果。本文给出了该方法的公式和具体算法,为其应用的实现提供了理论基础。相似文献

17.

融合LSI和支持向量聚类的网页文本分类算法*

史长琼黄辉王大卫姜腊林扶宗文《计算机应用研究》2009,26(12):4523-4525

特征选择和分类算法是网页文本聚类中最关键的技术。提出对网页文本提取特征值后,利用潜在语义索引对网页文本降维,采用支持向量聚类（SVC）算法对降维后的特征向量进行聚类,以此进行文本分类。实验结果显示具有较好的效果。相似文献

18.

一种改进的潜在语义检索模型研究

陈燕红 ;刘风华《微机发展》2014,(9):120-124

针对传统潜在语义检索模型计算成本大、检索速度慢、不利于应用在大规模农业信息检索领域的缺陷,文中提出一种针对农业主题的改进潜在语义检索模型（ALSI）。该模型先利用全文检索生成农业信息全文倒排索引库,然后利用农业高频词库和潜在语义分析生成的语义索引库,进行语义检索。通过多组实验分析确定了该模型所采用的词条权重计算方法和语义空间维数。最后,通过实验分析对比了改进后的潜在语义检索模型（ALSI）与传统潜在语义检索模型（LSI）的检索效果。结果表明,ALSI的检索效果明显好于LSI,适合应用于较大规模农业信息检索。相似文献

19.

Efficient storage and retrieval of probabilistic latent semantic information for information retrieval

Laurence A. F. Park Kotagiri Ramamohanarao 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(1):141-155

Probabilistic latent semantic analysis (PLSA) is a method for computing term and document relationships from a document set. The probabilistic latent semantic index (PLSI) has been used to store PLSA information, but unfortunately the PLSI uses excessive storage space relative to a simple term frequency index, which causes lengthy query times. To overcome the storage and speed problems of PLSI, we introduce the probabilistic latent semantic thesaurus (PLST); an efficient and effective method of storing the PLSA information. We show that through methods such as document thresholding and term pruning, we are able to maintain the high precision results found using PLSA while using a very small percent (0.15%) of the storage space of PLSI. 相似文献

20.

潜在语义索引在文本分类中的应用

伍建军康耀红《电脑与信息技术》2006,14(5):32-34,38

传统的文本分类都是根据文本的外在特征进行的，最常见的就是基于向量空间模型的方法，使用空间向量表示文本，通过相似度比较来确定分类。为了克服向量空间模型中的词条独立性假设，文章提出了一种基于潜在语义索引的文本分类模型，通过对大量的文本集进行统计分析，揭示了词语的上下文使用含义，通过奇异值分解有效地降低了向量空间的维数，消除了同义词、多义词的影响，从而提高了文本分类的精度。相似文献