首页 | 官方网站   微博 | 高级检索  
     

GPPR:跨域环境下的个性化PageRank算法
引用本文:陈子俊,马德龙,王一舒,袁野.GPPR:跨域环境下的个性化PageRank算法[J].软件学报,2024,35(3).
作者姓名:陈子俊  马德龙  王一舒  袁野
作者单位:东北大学 计算机科学与工程学院, 辽宁 沈阳 110169;北京理工大学 计算机学院, 北京 100081
基金项目:国家重点研发计划(2022YFB2702100); 国家自然科学基金(61932004, 62225203, U21A20516); DITDP(JCKY2021211B017); 中央高校基本科研业务专项资金资助(N232405-16)
摘    要:个性化PageRank作为大图分析中的的基本算法,在搜索引擎、社交推荐、社区检测等领域具有广泛的应用,一直是研究者们关注的热点问题.现有的分布式个性化PageRank算法均假设所有数据位于同一地理位置,且数据所在的计算节点之间具有相同的网络环境.然而,在现实世界中,这些数据可能分布在跨洲际的多个数据中心中,这些跨域分布(Geo-Distributed)的数据中心之间通过广域网连接,存在网络带宽异构、硬件差异巨大、通信费用高昂等特点.而分布式个性化PageRank算法需要多轮迭代,并在全局图上进行随机游走.因此,现有的分布式个性化PageRank算法不适用于跨域环境.针对此问题,本研究提出了GPPR(Geo-Distributed Personalized PageRank)算法.该算法首先对跨域环境中的大图数据进行预处理,通过采用启发式算法映射图数据,以降低网络带宽异构对算法迭代速度的影响.其次,GPPR改进了随机游走方式,提出了基于概率的push算法,通过减少工作节点之间传输数据的带宽负载,进一步减少算法所需的迭代次数.我们基于Spark框架实现了GPPR算法,并在阿里云中构建真实的跨域环境,在8个开源大图数据上与现有的多个代表性分布式个性化PageRank算法进行了对比实验.结果显示,GPPR的通信数据量在跨域环境中较其他算法平均减少30%.在算法运行效率方面,GPPR较其他算法平均提升2.5倍.

关 键 词:跨域分布式  个性化PageRank  近似计算
收稿时间:2023/7/17 0:00:00
修稿时间:2023/9/5 0:00:00

GPPR:Geo-distributed Personalized PageRank Algorithm
CHEN Zi-Jun,MA De-Long,WANG Yi-Shu,YUAN Ye.GPPR:Geo-distributed Personalized PageRank Algorithm[J].Journal of Software,2024,35(3).
Authors:CHEN Zi-Jun  MA De-Long  WANG Yi-Shu  YUAN Ye
Affiliation:School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract:Personalized PageRank, as a basic algorithm in large graph analysis, has a wide range of applications in search engines, social recommendation, community detection and other fields, and has been a hot problem of interest to researchers. The existing distributed personalized PageRank algorithms assume that all data are located in the same geographic location and the network environment is the same among the computing nodes where the data are located. However, in the real world, these data may be distributed in multiple data centers across continents, and these Geo-Distributed data centers are connected to each other through WANs, which are characterized by heterogeneous network bandwidth, huge hardware differences, and high communication costs. And the distributed personalized PageRank algorithm requires multiple iterations and random wandering on the global graph. Therefore, the existing distributed personalized PageRank algorithms are not applicable to the geo-distributed environment. To address this problem, the GPPR (Geo-Distributed Personalized PageRank) algorithm is proposed in this study. The algorithm first preprocesses the big graph data in the geo-distributed environment and maps the graph data by using a heuristic algorithm to reduce the impact of network bandwidth heterogeneity on the iteration speed of the algorithm. Secondly, GPPR improves the random wandering approach and proposes a probability-based push algorithm to further reduce the number of iterations required by the algorithm by reducing the bandwidth load of transmitting data between working nodes. We implement the GPPR algorithm based on the Spark framework and build a real geo-distributed environment in AliCloud to conduct experiments on eight open-source big graph data compared with several existing representative distributed personalized PageRank algorithms. The results show that the communication data volume of GPPR is reduced by 30% on average in the geo-distributed environment compared with other algorithms. In terms of algorithm running efficiency, GPPR improves by an average of 2.5 times compared to other algorithms.
Keywords:geo-distributed  personalized pagerank  approximate counting
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号