首页 | 官方网站   微博 | 高级检索  
     

基于信息增益的文本特征权重改进算法
引用本文:李凯齐,刁兴春,曹建军.基于信息增益的文本特征权重改进算法[J].计算机工程,2011,37(1):16-18,21.
作者姓名:李凯齐  刁兴春  曹建军
作者单位:1. 解放军理工大学指挥自动化学院,南京,210007;总参第六十三研究所,南京,210007
2. 总参第六十三研究所,南京,210007
基金项目:中国博士后科学基金资助项目(20090461425);江苏省博士后科研计划基金资助项目(0901014B)
摘    要:传统tf.idf算法中的idf函数只能从宏观上评价特征区分不同文档的能力,无法反映特征在训练集各文档以及各类别中分布比例上的差异对特征权重计算结果的影响,降低文本表示的准确性。针对以上问题,提出一种改进的特征权重计算方法tf.igt.igC。该方法从考察特征分布入手,通过引入信息论中信息增益的概念,实现对上述特征分布具体维度的综合考虑,克服传统公式存在的不足。实验结果表明,与tf.idf.ig和tf.idf.igc 2种特征权重计算方法相比,tf.igt.igC在计算特征权重时更加有效。

关 键 词:特征分布  特征加权  文本分类

Improved Algorithm of Text Feature Weighting Based on Information Gain
LI Kai-qi,DIAO Xing-chun,CAO Jian-jun.Improved Algorithm of Text Feature Weighting Based on Information Gain[J].Computer Engineering,2011,37(1):16-18,21.
Authors:LI Kai-qi  DIAO Xing-chun  CAO Jian-jun
Affiliation:(1. Institute of Command Automation, PLA Univ. of Sci. & Tech., Nanjing 210007, China; 2. The 63rd Research Institute, PLA General Staff Headquarters, Nanjing 210007, China)
Abstract:The idf function of traditional tf.idf algorithm can only evaluate the ability of features to discriminate different documents in a macroscopically way, which can not reflect the differences of distribution proportion for features in each document and each class of the whole training set, it reduces the accuracy of text representation. To solve the above problem, this paper proposes an improved feature weighting method called tf.igt.igC. This method begins from analyzing the characteristics of feature distribution, through introducing the concept of information gain in the information theory, realizes the comprehensive consideration of the two specific dimensions of feature distributions, and overcomes the shortcomings of the traditional formula. Experimental results on the two open source corpus show that compared to other two feature weighting methods, tf.igt.igC is more effective in terms of calculating the feature weighting.
Keywords:feature distribution  feature weighting  text classification
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号