首页 | 官方网站   微博 | 高级检索  
     

基于主题词频数特征的文本主题划分
引用本文:康恺,林坤辉,周昌乐.基于主题词频数特征的文本主题划分[J].计算机应用,2006,26(8):1993-1995.
作者姓名:康恺  林坤辉  周昌乐
作者单位:1. 厦门大学,软件学院,福建,厦门,361005
2. 厦门大学,信息科学与技术学院,福建,厦门,361005
基金项目:厦门大学校科研和教改项目
摘    要:目前文本分类所采用的文本—词频矩阵具有词频维数过大和过于稀疏两个特点,给计算造成了一定困难。为解决这一问题,从用户使用搜索引擎时选择所需文本的心理出发,提出了一种基于主题词频数特征的文本主题划分方法。该方法首先根据统计方法筛选各文本类的主题词,然后以主题词类替代单个词作为特征采用模糊C 均值(FCM)算法施行文本聚类。实验获得了较好的主题划分效果,并与一种基于词聚类的文本聚类方法进行了过程及结果中多个方面的比较,得出了一些在实施要点和应用背景上较有意义的结论。

关 键 词:搜索引擎  文本聚类  模糊C-均值  主题词筛选
文章编号:1001-9081(2006)08-1993-03
收稿时间:2006-02-28
修稿时间:2006-02-282006-04-28

New text categorization method based on the frequency of topic words
KANG Kai,LIN Kun-hui,ZHOU Chang-le.New text categorization method based on the frequency of topic words[J].journal of Computer Applications,2006,26(8):1993-1995.
Authors:KANG Kai  LIN Kun-hui  ZHOU Chang-le
Affiliation:1. School of Software, Xiamen University, Fujian Xiamen 361005, China; 2. School of Information Science and Technology, Xiamen University, Fujian Xiamen 361005, China
Abstract:The word frequency matrix currently used in text categorization is characterized with high dimensionality and excessive sparsity.These two features caused some difficulties to computing.To solve this problem,according to the search engine users' selections,a new text categorization method based upon the feature of topic words frequency was proposed.This approach was designed to filter new concept topic words by statistical method,and then the FCM clustering algorism was applied to the documents,using the frequency of topic words rather than the frequency of single word as the feature.This method performs well in the experiment.Furthermore,this method was compared in many aspects with a text categorization method based on clusters,and some useful conclusions about implementation and application were reached.
Keywords:search engine  document clustering  Fuzzy C-Means(FCM)  topic word filtering  
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号