首页 | 官方网站   微博 | 高级检索  
     

基于隐马尔可夫模型的文本分类算法
引用本文:杨健,汪海航.基于隐马尔可夫模型的文本分类算法[J].计算机应用,2010,30(9):2348-2350.
作者姓名:杨健  汪海航
作者单位:1. 大理学院2. 同济大学电子与信息工程学院
基金项目:上海市科委科技支撑计划项目 
摘    要:自动文本分类领域近年来已经产生了若干成熟的分类算法,但这些算法主要基于概率统计模型,没有与文本自身的语法和语义建立起联系。提出了将隐马尔可夫序列分析模型(HMM)用于自动文本分类的算法,首先构造表示文档类别的特征词集合,并以文档类别的特征词序列作为不同HMM分类器的观察序列,而HMM的状态转换序列则隐含地表示了不同类别文档内容的形成演化过程。分类时,具有最大生成概率的HMM分类器类标即为测试文档的分类结果。该算法构造的分类器模型一定程度上体现了不同类别文档的语法和语义特征,并可以实现多类别的自动文本分类,分类效率较高。

关 键 词:文本分类    隐马尔可夫模型    信息增益    χ2检验    词频—反文档频率
收稿时间:2010-03-08
修稿时间:2010-04-27

Text classification algorithm based on hidden Markov model
YANG Jian,WANG Hai-hang.Text classification algorithm based on hidden Markov model[J].journal of Computer Applications,2010,30(9):2348-2350.
Authors:YANG Jian  WANG Hai-hang
Abstract:A number of sophisticated automatic text classification algorithms have been proposed in recent years, but those algorithms are mainly based on the probability and statistical models and have not established a relationship with the syntax and semantic of text. In this paper, a new automatic text classification algorithm using Hidden Markov Model (HMM) was proposed. At first, a feature set was built to distinguish the document types. Then the different sequences of feature words were regarded as the different observations generated by HMM classifiers. The state transition sequence of a specific HMM classifier implied the process of document's formation and evolution in a specific document type. When a document was classified, the result was created by the HMM classifier which could get the greatest generation probability according to the document. To some extent, some syntactic and semantic features of different document were represented by the classification model. The model can be applied to automatic multi-category text classification, and it has high classification efficiency.
Keywords:text classification                                                                                                                        Hidden Markov Model (HMM)                                                                                                                        information gain                                                                                                                        χ2 test                                                                                                                        Term Frequency-Inverse Document Frequency (TF-IDF)
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号