基于Spark的并行频繁项集挖掘算法 A PARALLEL FREQUENT ITEMSETS MINING ALGORITHM BASED ON SPARK期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark的并行频繁项集挖掘算法

引用本文：	张素琪,孙云飞,武君艳,顾军华.基于Spark的并行频繁项集挖掘算法[J].计算机应用与软件,2019,36(2):24-28,143.

作者姓名：	张素琪孙云飞武君艳顾军华

作者单位：	天津商业大学信息工程学院天津300134;河北工业大学人工智能与数据科学学院天津300401;河北省大数据计算重点实验室天津300401

基金项目：	河北省科技计划;天津市科技计划;天津市科技计划

摘要：	关联规则挖掘是数据挖掘领域的重要研究方向之一。频繁项集的挖掘是关联规则挖掘的第一步,也是最重要的步骤。FP-Growth(Frequent Pattern-Growth)算法因其挖掘效率以及空间复杂度方面的优势被广泛应用于频繁项集挖掘任务中。面对海量数据,FP-Growth算法挖掘效率变得极低甚至失效。在Hadoop大数据平台上实现的基于MapReduce框架的并行FP-Growth算法——PFP算法解决在处理大规模数据时传统算法失效的问题,但是由于其将每次执行之后的中间结果输出到磁盘,降低算法执行效率。为提高并行FP-Growth算法执行效率,提出一种基于Spark的SPFPG算法。该算法运用负载均衡思想对分组策略进行改进,综合考虑分区计算量和FP-Tree规模两个因素,保证每个组之间负载总和近似相等。在Spark上实现FP-Growth算法——SFPG算法的基础上,实现优化后的SPFPG算法。实验结果表明,SPFPG算法相比SFPG算法挖掘效率更高,且算法具有良好的扩展性。
关键词：	大数据平台关联规则频繁项集 FP-GROWTH SPARK
A PARALLEL FREQUENT ITEMSETS MINING ALGORITHM BASED ON SPARK

Zhang Suqi,Sun Yunfei,Wu Junyan,Gu Junhua.A PARALLEL FREQUENT ITEMSETS MINING ALGORITHM BASED ON SPARK[J].Computer Applications and Software,2019,36(2):24-28,143.

Authors:	Zhang Suqi Sun Yunfei Wu Junyan Gu Junhua

Affiliation:	(School of Information Engineering, Tianjin University of Commerce, Tianjin 300134, China;School of Artificial Intelligence and Data Science Institute, Hebei University of Technology, Tianjin 300401, China;Hebei Key Laboratory of Big Data Computing, Tianjin 300401, China)

Abstract:	Association rule mining is one of the important research directions in data mining.The mining of frequent itemsets is the first and most important step in association rule mining.FP-Growth algorithm is widely used in frequent itemsets mining tasks because of its mining efficiency and its spatial complexity.Faced with big data, mining efficiency of FP-Growth algorithm becomes extremely low or even invalid.PFP algorithm, the parallel FP-Growth algorithm based on MapReduce framework implemented on Hadoop, solves the problem of failure of traditional algorithms when dealing with large-scale data.However, the efficiency of the algorithm is reduced because it output the intermediate results to disk after each execution.In order to improve the execution efficiency of parallel FP-Growth algorithm, a Spark-based SPFPG algorithm was proposed in the paper.The algorithm adopted the load balancing idea to improve the packet strategy. Considering the partition calculation and FP-Tree size, the sum of loads among each group was approximately equal. On the basis of SFPG, FP-Growth implemented on Spark, the optimized SPFPG algorithm was realized.The experimental results show that SPFPG algorithm is more efficient than SFPG algorithm and it has a good scalability.

Keywords:	Big data platform Association rules Frequent itemsets FP-Growth Spark
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏