首页 | 官方网站   微博 | 高级检索  
     

基于R-Grams的文本聚类方法
引用本文:王贤明,谷琼,胡智文.基于R-Grams的文本聚类方法[J].计算机应用,2015,35(11):3130-3134.
作者姓名:王贤明  谷琼  胡智文
作者单位:1. 温州大学 瓯江学院, 浙江 温州 325035;2. 温州信息化研究中心, 浙江 温州 325035;3. 湖北文理学院 数学与计算机科学学院, 湖北 襄阳 441053;4. 西南大学 逻辑与智能研究中心, 重庆 400715;5. 浙江传媒学院 新媒体学院, 杭州 310018
基金项目:浙江省自然科学基金资助项目(LY13F010005); 教育部人文社会科学研究项目(15YJAZH015); 湖北省科技支撑计划软科学项目(2015BDH109); 温州市科技计划项目(R20130021).
摘    要:针对传统文本聚类中存在着聚类准确率和召回率难以平衡等问题,提出了一种基于R-Grams文本相似度计算方法的文本聚类方法.该方法首先通过将待聚类文档降序排列,其次采用R-Grams文本相似度算法计算文本之间的相似度并根据相似度实现各聚类标志文档的确定并完成初始聚类,最后通过对初始聚类结果进行聚类合并完成最终聚类.实验结果表明:聚类结果可以通过聚类阈值灵活调整以适应不同的需求,最佳聚类阈值为15左右.随着聚类阈值的增大,各聚类准确率增大,召回率呈现先增后降的趋势.此外,该聚类方法避免了大量的分词、特征提取等繁琐处理,实现简单.

关 键 词:文本  聚类  随机  R-Grams  
收稿时间:2015-06-17
修稿时间:2015-07-15

Novel text clustering approach based on R-Grams
WANG Xianming,GU Qiong,HU Zhiwen.Novel text clustering approach based on R-Grams[J].journal of Computer Applications,2015,35(11):3130-3134.
Authors:WANG Xianming  GU Qiong  HU Zhiwen
Affiliation:1. Oujiang College, Wenzhou University, Wenzhou Zhejiang 325035, China;2. Network Research Institute of Wenzhou, Wenzhou Zhejiang 325035, China;3. School of Mathematics and Computer Science, Hubei University of Arts and Science, Xiangyang Hubei 441053, China;4. Institute of Logic and Intelligence, Southwest University, Chongqing 400715, China;5. College of New Media, Zhejiang University of Media and Communications, Hangzhou Zhejiang 310018, China
Abstract:Focusing on the issue that the clustering accuracy rate and recall rate are difficult to balance in traditional text clustering algorithms, a clustering approach based on the R-Grams text similarity computing algorithm was proposed. Firstly, the clustered documents were sorted in descending order; secondly, the symbolic documents were identified and then initial clustering results were achieved by using an R-Grams-based similarity computing algorithm; finally, the final clustering results were completed by combining the initial clustering. The experimental results show that the proposed approach can flexibly regulate the clustering results by adjusting the clustering threshold parameter to satisfy different demands and the optimal parameter is about 15. With the increasing of the clustering threshold, the clustering accuracies increase, and the recalls increase at first, then decrease. In addition, the approach is free from time-consuming processing procedures such as word segmentation and feature extraction and can be easily implemented.
Keywords:text                                                                                                                        clustering                                                                                                                        random                                                                                                                        R-Grams
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号