首页 | 官方网站   微博 | 高级检索  
     

基于Spark的并行DBSCAN算法的设计与实现
引用本文:黄明吉,张倩. 基于Spark的并行DBSCAN算法的设计与实现[J]. 计算机科学, 2017, 44(Z11): 524-529
作者姓名:黄明吉  张倩
作者单位:北京科技大学机械工程学院 北京100083,北京科技大学机械工程学院 北京100083
基金项目:本文受北京市自然科学基金(2112011),中央高校基本科研业务费基金(2050205)资助
摘    要:随着云应用对运行时间和性能水平要求的逐步提高,以及内存价格的持续走低,基于内存的分布式计算框架Spark获得了前所未有的关注。主要研究DBSCAN算法在Spark上并行化的设计与实现,通过整体分析找到算法并行化可能的性能瓶颈,并从Spark的角度设计了并行DBSCAN算法的DAG图,优化了算法的并行化策略,最大化地降低了shuffle频率和数据量。最后将并行DBSCAN算法与单机DBSCAN算法进行性能对比,并通过实验分析不同参数对聚类结果的影响。结果表明,与单机DBSCAN算法相比,基于Spark的并行DBSCAN算法在聚类精度没有明显损失的情况下,数据量在3百万行时运行效率提高了37.2%,且加速比达到1.6。

关 键 词:Spark  并行DBSCAN算法  DAG  并行化策略

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark
HUANG Ming-ji and ZHANG Qian. Design and Implementation of Parallel DBSCAN Algorithm Based on Spark[J]. Computer Science, 2017, 44(Z11): 524-529
Authors:HUANG Ming-ji and ZHANG Qian
Affiliation:School of Mechanical Engineering,University of Science and Technology Beijing,Beijing 100083,China and School of Mechanical Engineering,University of Science and Technology Beijing,Beijing 100083,China
Abstract:With the cloud application requiring less running time and higher performance as well as memory prices continuing to decline,the technology of memory-based distributed computing framework,such as Spark,has received unprecedented attention and is widely applied.We designed and implemented parallelized DBSCAN algorithm based on Spark,which can minimize the shuffle frequency and the data amount in shuffle.In order to optimize the algorithm para-llelization strategy,we found the bottleneck of algorithm performance through the analysis and described the DAG of parallel DBSCAN algorithm based on Spark.Finally,we compared the performance of parallel DBSCAN algorithm with the DBSCAN algorithm,and analyzed the influence of different parameters on clustering results.Experimental results show that,compared with DBSCAN algorithm,the running time of parallel DBSCAN algorithm based on Spark decreases 37.2% and the speedup reaches 1.6 respectively on a data set containing three million lines without obvious loss of clustering accuracy.
Keywords:Spark  Parallel DBSCAN algorithm  DAG  Parallelization strategy
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号