首页 | 官方网站   微博 | 高级检索  
     

自动文本分类特征选择方法研究
引用本文:张海龙,王莲芝.自动文本分类特征选择方法研究[J].计算机工程与设计,2006,27(20):3840-3841.
作者姓名:张海龙  王莲芝
作者单位:中国农业大学,信息与电气工程学院,北京,100083
摘    要:文本分类是指根据文本的内容将大量的文本归到一个或多个类别的过程,文本表示技术是文本分类的核心技术之一,而特征选择又是文本表示技术的关键技术之一,对分类效果至关重要。文本特征选择是最大程度地识别和去除冗余信息,提高训练数据集质量的过程。对文本分类的特征选择方法,包括信息增益、互信息、X^2统计量、文档频率、低损降维和频率差法等做了详细介绍、分析、比较研究。

关 键 词:文本分类  特征选择  信息增益  互信息  X2统计量法  文档频率  低损降维  频率差
文章编号:1000-7024(2006)20-3838-04
收稿时间:2005-08-19
修稿时间:2005-08-19

Automatic text categorization feature selection methods research
ZHANG Hai-long,WANG Lian-zhi.Automatic text categorization feature selection methods research[J].Computer Engineering and Design,2006,27(20):3840-3841.
Authors:ZHANG Hai-long  WANG Lian-zhi
Affiliation:College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
Abstract:Text categorization is a task of classifying text documents into predefined set of categories based on their content, text representation is one of kernel technology of text categorization, and feature selection is one of key technology of text representation, it is very important to text categorization effect. Text feature selection is a process of recognizing and deleting redundant information and enhancing training documents cluster quality. The text feature selection methods are introduced, analysed and researched, including information gain, mutual information, X^2 statistics, document frequency, low loss dimensionality reduction, relative frequency difference.
Keywords:text categorization  feature selection  information gain  mutual information  X^2 statistics  document frequency  low loss dimensionality reduction  relative frequency difference
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号