首页 | 官方网站   微博 | 高级检索  
     

知识增益: 文本分类中一种新的特征选择方法
引用本文:徐燕,王斌,李锦涛,孙春明. 知识增益: 文本分类中一种新的特征选择方法[J]. 中文信息学报, 2008, 22(1): 44-50
作者姓名:徐燕  王斌  李锦涛  孙春明
作者单位:中国科学院 计算技术研究所,北京 100080
基金项目:国家973资助项目(2004CB318109),国家自然科学基金资金项目(60473002,60603094),北京市自然科学基金资助项目(4051004)
摘    要:特征选择在文本分类中起重要的作用。文档频率(DF)、信息增益(IG)和互信息(MI)等特征选择方法在文本分类中广泛应用。已有的实验结果表明,IG是最有效的特征选择算法之一,该方法基于申农提出的信息论。本文基于粗糙集理论,提出了一种新的特征选择方法(KG算法),该方法依据粗糙集理论关于知识的观点,即知识是分类事物的能力,将知识进行量化,提出知识增益的概念,得到基于知识增益的特征选择方法。在两个通用的语料集OHSUMED和NewsGroup上进行分类实验发现KG算法均超过IG的性能,特别是在特征空间的维数降到低维时尤其明显,可见KG算法有较好的性能;

关 键 词:计算机应用  中文信息处理  文本分类  特征选择  粗糙集  信息检索  
文章编号:1003-0077(2008)01-0044-07
收稿时间:2007-05-29
修稿时间:2007-12-02

Knowledge Gain: An New Feature Selection Method in Text Categorization
XU Yan,WANG Bin,LI Jin-tao,SUN Chun-ming. Knowledge Gain: An New Feature Selection Method in Text Categorization[J]. Journal of Chinese Information Processing, 2008, 22(1): 44-50
Authors:XU Yan  WANG Bin  LI Jin-tao  SUN Chun-ming
Affiliation:Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China
Abstract:Feature selection(FS) plays an important role in text categorization(TC).Automatic feature selection methods such as document frequency thresholding(DF),information gain(IG),mutual information(MI),and so on are commonly applied in text categorization [J].Existing experiments show IG is one of the most effective methods.In this paper,a feature selection method is proposed based on Rough Set theory.According to Rough set theory,knowledge about a universe of objects may be defined as classifications based on certain properties of the objects,i.e.rough set theory assume that knowledge is an ability to partition objects.We quantify the ability of classify objects,and call the amount of this ability as knowledge quantity and then following this quantification,we put forward a notion "knowledge Gain" and put forward a knowledge gain-based feature selection method(KG method).Experiments on NewsGroup collection and OHSUMED collection show that KG performs better than the IG method,specially,on extremely aggressive reduction.
Keywords:computer application  Chinese information processing  feature selection  text categorization  rough set  information retrieval
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号