首页 | 官方网站   微博 | 高级检索  
     

基于Spark的BIRCH算法并行化的设计与实现
引用本文:李帅,吴斌,杜修明,陈玉峰.基于Spark的BIRCH算法并行化的设计与实现[J].计算机工程与科学,2017,39(1):35-41.
作者姓名:李帅  吴斌  杜修明  陈玉峰
作者单位:(1.北京邮电大学智能通信软件与多媒体北京重点实验室,北京 100876); 2.北京邮电大学计算机学院,北京 100876;3.国网山东省电力公司电力科学研究院,山东 济南 250000)
基金项目:国家863计划(2015AA050204);国网科技项目(60873120)
摘    要:在分布式计算和内存为王的时代,Spark作为基于内存计算的分布式框架技术得到了前所未有的关注与应用。着重研究BIRCH算法在Spark上并行化的设计和实现,经过理论性能分析得到并行化过程中时间消耗较多的Spark转化操作,同时根据并行化BIRCH算法的有向无环图DAG,减少shuffle和磁盘读写频率,以期达到性能优化。最后,将并行化后的BIRCH算法分别与单机的BIRCH算法和MLlib中的K-Means聚类算法做了性能对比实验。实验结果表明,通过Spark对BIRCH算法并行化,其聚类质量没有明显的损失,并且获得了比较理想的运行时间和加速比。

关 键 词:Spark  BIRCH并行化  性能优化
收稿时间:2016-09-05
修稿时间:2017-01-25

Design and implementation of BIRCH algorithm parallelization based on Spark
LI Shuai,WU Bin,DU Xiu ming,CHEN Yu feng.Design and implementation of BIRCH algorithm parallelization based on Spark[J].Computer Engineering & Science,2017,39(1):35-41.
Authors:LI Shuai  WU Bin  DU Xiu ming  CHEN Yu feng
Affiliation:(1.Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunicaions,Beijing 100876; 2.School of Computer Science,Beijing University of Posts and Telecommunications,Beijing 100876; 3.State Gride Shandong Electric Power Research Institute,Jinan 250000,China)
Abstract:In the era when distributed computing and memory highly count, the technology of memory based distributed computing framework, such as Spark, has gained unprecedented attention and is widely applied. We design and implement the BIRCH algorithm parallelization based on Spark, which can maximize performance optimization and reduce the frequency of shuffling and disk accessing. We do some theory analysis and describe the DAG of the BIRCH based on Spark. Finally, we compare the performance of the parallelized BIRCH algorithm with the BIRCH algorithm of a single machine and the MLlib K Means clustering algorithm. Experimental results show that the parallel BIRCH algorithm based on Spark obtains ideal running time and speedup without obvious clustering quality loss.
Keywords:Spark  BIRCH parallelization  performance optimization  
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号