基于Spark的并行DBSCAN算法的设计与实现 Design and Implementation of Parallel DBSCAN Algorithm Based on Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark的并行DBSCAN算法的设计与实现

引用本文：	黄明吉,张倩. 基于Spark的并行DBSCAN算法的设计与实现[J]. 计算机科学, 2017, 44(Z11): 524-529

作者姓名：	黄明吉张倩

作者单位：	北京科技大学机械工程学院北京100083,北京科技大学机械工程学院北京100083

基金项目：	本文受北京市自然科学基金(2112011),中央高校基本科研业务费基金(2050205)资助

摘要：	随着云应用对运行时间和性能水平要求的逐步提高,以及内存价格的持续走低,基于内存的分布式计算框架Spark获得了前所未有的关注。主要研究DBSCAN算法在Spark上并行化的设计与实现,通过整体分析找到算法并行化可能的性能瓶颈,并从Spark的角度设计了并行DBSCAN算法的DAG图,优化了算法的并行化策略,最大化地降低了shuffle频率和数据量。最后将并行DBSCAN算法与单机DBSCAN算法进行性能对比,并通过实验分析不同参数对聚类结果的影响。结果表明,与单机DBSCAN算法相比,基于Spark的并行DBSCAN算法在聚类精度没有明显损失的情况下,数据量在3百万行时运行效率提高了37.2%,且加速比达到1.6。
关键词：	Spark 并行DBSCAN算法 DAG 并行化策略
Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

HUANG Ming-ji and ZHANG Qian. Design and Implementation of Parallel DBSCAN Algorithm Based on Spark[J]. Computer Science, 2017, 44(Z11): 524-529

Authors:	HUANG Ming-ji and ZHANG Qian

Affiliation:	School of Mechanical Engineering,University of Science and Technology Beijing,Beijing 100083,China and School of Mechanical Engineering,University of Science and Technology Beijing,Beijing 100083,China

Abstract:	With the cloud application requiring less running time and higher performance as well as memory prices continuing to decline,the technology of memory-based distributed computing framework,such as Spark,has received unprecedented attention and is widely applied.We designed and implemented parallelized DBSCAN algorithm based on Spark,which can minimize the shuffle frequency and the data amount in shuffle.In order to optimize the algorithm para-llelization strategy,we found the bottleneck of algorithm performance through the analysis and described the DAG of parallel DBSCAN algorithm based on Spark.Finally,we compared the performance of parallel DBSCAN algorithm with the DBSCAN algorithm,and analyzed the influence of different parameters on clustering results.Experimental results show that,compared with DBSCAN algorithm,the running time of parallel DBSCAN algorithm based on Spark decreases 37.2% and the speedup reaches 1.6 respectively on a data set containing three million lines without obvious loss of clustering accuracy.

Keywords:	Spark Parallel DBSCAN algorithm DAG Parallelization strategy

	点击此处可从《计算机科学》浏览原始摘要信息
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏