首页 | 官方网站   微博 | 高级检索  
     

基于数据流的聚类趋势分析算法
引用本文:樊仲欣.基于数据流的聚类趋势分析算法[J].计算机应用,2020,40(8):2248-2254.
作者姓名:樊仲欣
作者单位:大气科学与环境气象国家级实验教学示范中心(南京信息工程大学), 江苏 南京 210044
基金项目:国家重点研发计划项目(2018YFC1505804)。
摘    要:聚类趋势分析算法基于抽样原理导致聚类趋势指标不稳定和片面,而且不适应数据流的批量增量特性,因而需要重复进行聚类趋势指数计算。为此,基于全体数据进行整体分析,提出一种基于最小距离连通图(MDCG)的聚类趋势分析算法MDCG-CTI。首先,利用栈的深度优先遍历法更新增量数据的最邻近路径从而降低MDCG的建立复杂度;然后,计算聚类趋势指数并确定可聚类性的判定阈值;最后,将所提算法和批量增量的具有噪声的基于密度的聚类方法(DBSCAN)相结合。在自定义数据集上的实验表明,该算法比现有算法对单簇和含大量噪点的数据的可聚类性判断更为精确;而在大数据集pendigits和avila上,所提算法比基于谱方法的聚类趋势可视化分析(SpecVAT)累计耗时降低了38%和42%,且相较SpecVAT结合批量增量DBSCAN,该算法结合批量增量DBSCAN的聚类平均准确率分别提高了6%和11%,聚类累计耗时则分别降低了7%和8%。实验结果表明该算法可以准确无参地判断聚类趋势,并明显提高增量聚类的有效性和运行效率。

关 键 词:聚类趋势  最小距离连通图  数据流聚类  批量增量聚类  具有噪声的基于密度的聚类方法  
收稿时间:2020-01-19
修稿时间:2020-03-17

Clustering tendency analysis algorithm based on data stream
FAN Zhongxin.Clustering tendency analysis algorithm based on data stream[J].journal of Computer Applications,2020,40(8):2248-2254.
Authors:FAN Zhongxin
Affiliation:National Experimental Teaching Demonstration Center for Atmospheric Science and Environmental Meteorology(Nanjing University of Information Science and Technology), Nanjing Jiangsu 210044, China
Abstract:Focusing on the issues that clustering tendency analysis algorithms based on sampling have instability and one-sidedness in clustering tendecy index, and clustering tendency parameters need to be computed repeatedly because the algorithms do not suit the batch incremental property of data stream, an improved Clustering Tendency Index analysis algorithm based on Minimum Distance Connected Graph (MDCG) was proposed, namely MDCG-CTI, which performs overall analysis on all data. First, MDCG was built with complexity optimization by using stack depth-first traversal to update the nearest path of incremental data; then clustering tendency index was computed to determine the judgment threshold of clustering; finally, the proposed algorithm was integrated with batch incremental Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Experimental results on self-built datasets show that the proposed algorithm has higher accuracy of clusterable determination than existing algorithms for single cluster and data with a large number of noises. And on large datasets pendigits and avila, the proposed algorithm has the time consumption reduced by 38% and 42% compared to Spectral Visual Assessment of cluster Tendency (SpecVAT); meanwhile, the proposed algorithm combined with batch incremental DBSCAN has average accuracy of clustering increased by 6% and 11% and time consumption of clustering reduced by 7% and 8% compared to SpecVAT combined with batch incremental DBSCAN. It can be seen that the proposed algorithm not only determines clustering tendency nonparametrically and accurately, but also improves effectiveness and operational efficiency of incremental clustering.
Keywords:clustering tendency  Minimum Distance Connected Graph (MDCG)  data stream clustering  batch incremental clustering  Density-Based Spatial Clustering of Applications with Noise (DBSCAN)  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号