首页 | 官方网站   微博 | 高级检索  
     

基于Zipf分布的网格密度峰值聚类算法
引用本文:马福民,宫婷,杨帆,张腾飞.基于Zipf分布的网格密度峰值聚类算法[J].控制与决策,2024,39(2):577-587.
作者姓名:马福民  宫婷  杨帆  张腾飞
作者单位:南京财经大学 信息工程学院,南京 210023;南京邮电大学 自动化学院、人工智能学院,南京 210023
基金项目:国家自然科学基金项目(61973151,62073173);江苏省自然科学基金项目(BK20191406);江苏省重点研发计划项目(BE2021001-4);江苏省研究生科研与实践创新计划项目(G-TXW21001);江苏省高校青蓝工程项目.
摘    要:网格密度峰值聚类在兼顾密度峰值聚类算法可识别任意形状类簇的基础上,通过数据集的网格化简化整体计算量,成为当前备受关注的聚类方法.针对大规模数据,如何进一步区分稠密与稀疏网格,减少网格密度峰值聚类中参与计算的非空网格代表点的数量是解决“网格灾难”的关键.结合以网格密度为变量的概率密度分布呈现出类Zipf分布的特点,提出一种基于Zipf分布的网格密度峰值聚类算法.首先计算所有非空网格的密度并映射为Zipf分布,根据对应的Zipf分布筛选出稠密中心网格和稀疏边缘网格;然后仅对稠密中心网格进行密度峰值聚类,在自适应确定潜在聚类中心的同时减少欧氏距离的计算量,降低算法复杂度;最后通过对稀疏边缘网格的处理,进一步优化类簇边界并提高聚类精度.人工数据集和UCI数据集下的实验结果表明,所提出算法对大规模、类簇交叉数据的聚类具有明显优势,能够在保证聚类精度的同时降低时间复杂度.

关 键 词:聚类  密度峰值  网格  Zipf分布  密度阈值

Grid density peak clustering algorithm based on Zipf distribution
MA Fu-min,GONG Ting,YANG Fan,ZHANG Teng-fei.Grid density peak clustering algorithm based on Zipf distribution[J].Control and Decision,2024,39(2):577-587.
Authors:MA Fu-min  GONG Ting  YANG Fan  ZHANG Teng-fei
Affiliation:College of Information Engineering,Nanjing University of Finance and Economics,Nanjing 210023,China; College of Automation & College of Artificial Intelligence,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
Abstract:Grid density peak clustering, taking advantage of the strong ability that the density peak clustering algorithm can identify arbitrary shape clusters, greatly reduces the computational cost by gridding the data set. It has become a popular clustering method nowadays. However, for large-scale data, how to further distinguish the dense and sparse grids and reduce the number of non-empty grids involved in the calculation of the grid density peak clustering is the key to solve the "grid disaster". In view of the probability density distribution with grid density as variable shows a Zipf-like distribution, a grid density peak clustering algorithm based on Zipf distribution is proposed. Firstly, the density of all non-empty grids is calculated and mapped to Zipf distribution, so the dense center grids and sparse edge grids are filtered according to the Zipf distribution. And then, only the dense center grids are clustered with peak density, and the calculation of Euclidean distance is reduced while the cluster centers are determined heuristically, which reduces the complexity of the algorithm. Finally, the sparse edge grids are processed to further optimize the cluster boundary and improve the clustering accuracy. Experimental results on the synthetic datasets and UCI datasets show that the proposed algorithm has obvious advantages for the clustering of large-scale and cross-cluster data, and can reduce the time complexity while ensuring the clustering accuracy.
Keywords:
点击此处可从《控制与决策》浏览原始摘要信息
点击此处可从《控制与决策》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号