首页 | 官方网站   微博 | 高级检索  
     

文本分类TF-IDF算法的改进研究
引用本文:叶雪梅,毛雪岷,夏锦春,王波.文本分类TF-IDF算法的改进研究[J].计算机工程与应用,2019,55(2):104-109.
作者姓名:叶雪梅  毛雪岷  夏锦春  王波
作者单位:合肥工业大学 管理学院,合肥 230009;合肥工业大学 过程优化与智能决策教育部重点实验室,合肥 230009;合肥工业大学 管理学院,合肥 230009;合肥工业大学 过程优化与智能决策教育部重点实验室,合肥 230009;合肥工业大学 管理学院,合肥 230009;合肥工业大学 过程优化与智能决策教育部重点实验室,合肥 230009;合肥工业大学 管理学院,合肥 230009;合肥工业大学 过程优化与智能决策教育部重点实验室,合肥 230009
基金项目:安徽省年度重点科研项目计划;国家自然科学基金创新群体项目
摘    要:中国互联网环境的发展,让大量蕴含丰富信息的新词得以普及。而传统的特征词权重TF-IDF(Term Frequency and Inverted Document Frequency)算法主要考虑TF和IDF两个方面的因素,未考虑到新词这一新兴词类的优势。针对特征项中的新词对分类结果的影响,提出基于网络新词改进文本分类TF-IDF算法。在文本预处理中识别新词,并在向量空间模型表示中改变特征权重计算公式。实验结果表明把新词发现加入文本预处理,可以达到特征降维的目的,并且改进后的特征权重算法能优化文本分类的结果。

关 键 词:新词  词频-逆文档频率(TF-IDF)  向量空间模型  文本分类

Improved Approach to TF-IDF Algorithm in Text Classification
YE Xuemei,MAO Xuemin,XIA Jinchun,WANG Bo.Improved Approach to TF-IDF Algorithm in Text Classification[J].Computer Engineering and Applications,2019,55(2):104-109.
Authors:YE Xuemei  MAO Xuemin  XIA Jinchun  WANG Bo
Affiliation:1.School of Management, Hefei University of Technology, Hefei 230009, China 2.Key Laboratory of Process Optimization and Intelligent Decision-Making(MoE), Hefei University of Technology, Hefei 230009, China
Abstract:With the development of Internet environment in China, a lot of new words with rich information have been popularized. The traditional term weight algorithm named TF-IDF(Term Frequency and Inverted Document Frequency) mainly considers two factors named TF and IDF without the advantage of new words. In view of the influence of new words in feature items on classification results, an improved TF-IDF algorithm based on new words of network is proposed in text classification. Research recognizes new words in the text preprocessing, and improves the weight calculation formula of them in the vector space model representation. Experimental results show that adding new word discovery process to text preprocessing can reduce feature dimension, meanwhile, the improved TF-IDF algorithm can optimize the result of text classification.
Keywords:new words  Term Frequency and Inverted Document Frequency(TF-IDF)  vector space model  text classification  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号