首页 | 官方网站   微博 | 高级检索  
     

用于文本分类的特征项权重算法改进
引用本文:龚静,胡平霞,胡灿.用于文本分类的特征项权重算法改进[J].微机发展,2014(9):128-132.
作者姓名:龚静  胡平霞  胡灿
作者单位:湖南环境生物职业技术学院信息技术系,湖南衡阳421005
基金项目:基金项目:湖南省教育科技计划项目(07D036);湖南省教育厅、财政厅联合资助项目(12C1056)
摘    要:TF-IDF算法是文本分类中一种常用的权重计算方法,但是TF-IDF仅仅考虑了特征项在文本中出现的次数以及该特征项在训练集中的出现频率,没有考虑特征项在各个类间的分布情况及特征项的语义信息。因此针对TF-IDF的不足提出了一种改进的TF-IDF算法,此算法既考虑了特征项在类内的分布情况又考虑了特征项的位置及长度等语义因素,能更好地反映特征项的重要性。用朴素贝叶斯分类器验证其有效性,实验结果表明该算法优于TF-IDF算法,能较好地提高文本分类的准确率。

关 键 词:文本分类  特征项  权重  改进

Improvement of Algorithm for Weight of Characteristic Item in Text Classification
GONG Jing,HU Ping-xia,HU Can.Improvement of Algorithm for Weight of Characteristic Item in Text Classification[J].Microcomputer Development,2014(9):128-132.
Authors:GONG Jing  HU Ping-xia  HU Can
Affiliation:( Department of Information Technology, Hunan Environment and Biological Polytechnic, Hengyang 421005, China)
Abstract:TF-IDF algorithm is a commonly used method of calculating weight in text classification,but TF-IDF considers only occurrence of feature in the text, as well as the frequency of characteristic appearing in the training set, and does not take into the distribution of characteristics in each class and the semantic information of characteristics account. In order to solve this problem, the improved TF-IDF algorithm has been proposed which considers not only the distribution condition of feature in class, but also the semantic factors such as the position of the feature, length of the feature. This algorithm can better reflect the importance of feature item, and its validity is verified by Naive Bayes classifier. The experiment results show that the proposed algorithm outperforms the TF-IDF algorithm,and the algorithm can improve the accuracy of text classification well.
Keywords:text classification  feature item  weights  improvement
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号