首页 | 官方网站   微博 | 高级检索  
     

基于余弦距离选取初始簇中心的文本聚类研究
引用本文:王彬宇,刘文芬,胡学先,魏江宏.基于余弦距离选取初始簇中心的文本聚类研究[J].计算机工程与应用,2018,54(10):11-18.
作者姓名:王彬宇  刘文芬  胡学先  魏江宏
作者单位:1.数学工程与先进计算国家重点实验室,郑州 450000 2.桂林电子科技大学 广西密码学与信息安全重点实验室,广西 桂林 541000
摘    要:文本聚类是文本信息进行有效组织、摘要和导航的重要手段,其中基于余弦相似度的K-means算法是最重要且使用最广泛的文本聚类算法之一。针对基于余弦相似度的K-means算法改进方案设计困难,且众多优异的基于欧氏距离的K-means改进方法无法适用的问题,对余弦相似度与欧氏距离的关系进行探讨,得到标准向量前提下二者的转化公式,并在此基础上定义一种与欧氏距离意义相近关系紧密的余弦距离,使原有基于欧氏距离的K-means改进方法可通过余弦距离迁移到基于余弦相似度的K-means算法中。在此基础上理论推导出余弦K-means算法及其拓展算法的簇内中心点计算方法,并进一步改进了聚类初始簇中心的选取方案,形成新的文本聚类算法MCSKM++。通过实验验证,该算法在迭代次数减少、运行时间缩短的同时,聚类精度得到提高。

关 键 词:文本聚类  K-means算法  余弦相似度  余弦距离  初始点选取  

Research on text clustering for selecting initial cluster center based on Cosine distance
WANG Binyu,LIU Wenfen,HU Xuexian,WEI Jianghong.Research on text clustering for selecting initial cluster center based on Cosine distance[J].Computer Engineering and Applications,2018,54(10):11-18.
Authors:WANG Binyu  LIU Wenfen  HU Xuexian  WEI Jianghong
Affiliation:1.State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450000, China 2.Guangxi Key Laboratory of Cryptography and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541000, China
Abstract:Text clustering is an important means for text information to be organized, abstracted and navigated effectively, in which K-means algorithm based on cosine similarity is one of the most widely used algorithms. Aiming at the problem that the K-means algorithm based on cosine similarity is difficult to be improved, and that many excellent K-means improvement methods based on Euclidean distance can not be applied, the relationship between cosine similarity and Euclidean distance is discussed, and the transformation formula of the two is obtained with standard vector. Thus, a definition of cosine distance is given, which is close to the Euclidean distance, so that the original improved K-means method based on Euclidean distance can be transformed into a cosine similarity K-means algorithm by cosine distance. On this basis, it is deduced the calculation method of cluster center points in cosine K-means algorithm, and the initial point selection  scheme is further improved to form a new text clustering algorithm MCSKM++. The experimental results show that the algorithm can improve the clustering accuracy while the number of iterations is reduced and the running time is shortened.
Keywords:text clustering  K-means algorithm  cosine similarity  cosine distance  initial point selection  
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号