首页 | 官方网站   微博 | 高级检索  
     

基于LDA模型的文本分类研究
引用本文:姚全珠,宋志理,彭程.基于LDA模型的文本分类研究[J].计算机工程与应用,2011,47(13):150-153.
作者姓名:姚全珠  宋志理  彭程
作者单位:西安理工大学,计算机科学与工程学院,西安,710048
摘    要:针对传统的降维算法在处理高维和大规模的文本分类时存在的局限性,提出了一种基于LDA模型的文本分类算法,在判别模型SVM框架中,应用LDA概率增长模型,对文档集进行主题建模,在文档集的隐含主题-文本矩阵上训练SVM,构造文本分类器。参数推理采用Gibbs抽样,将每个文本表示为固定隐含主题集上的概率分布。应用贝叶斯统计理论中的标准方法,确定最优主题数T。在语料库上进行的分类实验表明,与文本表示采用VSM结合SVM,LSI结合SVM相比,具有较好的分类效果。

关 键 词:文本分类  潜在狄利克雷分配(LDA)模型  Gibbs抽样  贝叶斯统计理论
修稿时间: 

Research on text categorization based on LDA
YAO Quanzhu,SONG Zhili,PENG Cheng.Research on text categorization based on LDA[J].Computer Engineering and Applications,2011,47(13):150-153.
Authors:YAO Quanzhu  SONG Zhili  PENG Cheng
Affiliation:School of Computer Science & Engineering,Xi’an University of Technology,Xi’an 710048,China
Abstract:When the text corpuses are high-dimensional and large-scale,the traditional dimension reduction algorithms will expose their limitations.A Chinese text categorization algorithm based on LDA is presented.In the discriminative frame of Support Vector Machine(SVM),Latent Dirichlet Allocation(LDA) is used to give a generative probabilistic model for the text corpus,which reduces each document to fixed valued features——The probabilistic distribution on a set of latent topics.Gibbs sampling is used for parameter estimation.In the process of modeling the corpus,a latent topics-document matrix associated with the corpus has been constructed for training SVM.Standard method of Bayes is used for reference to get the best number of topics.Compared to Vector Space Model(VSM) for text expression combined SVM and the classifier based on Latent Semantic Indexing(LSI) combined SVM,the experimental result shows that the proposed method for text categorization is practicable and effective.
Keywords:text categorization  Latent Dirichlet Allocation (LDA)  Gibbs sampling  Bayes statistics theory
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号