Spark内存管理及缓存策略研究 Research on Memory Management and Cache Replacement Policies in Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Spark内存管理及缓存策略研究

引用本文：	孟红涛,余松平,刘芳,肖侬.Spark内存管理及缓存策略研究[J].计算机科学,2017,44(6):31-35, 74.

作者姓名：	孟红涛余松平刘芳肖侬

作者单位：	国防科学技术大学计算机学院长沙410072,国防科学技术大学计算机学院长沙410072,国防科学技术大学计算机学院长沙410072,国防科学技术大学计算机学院长沙410072

基金项目：	本文受863计划“面向大数据的内存计算关键技术与系统”子课题“基于内存计算的并行处理系统与研究”资助

摘要：	Spark系统是基于Map-Reduce模型的大数据处理框架。Spark能够充分利用集群的内存,从而加快数据的处理速度。Spark按照功能把内存分成不同的区域:Shuffle Memory和Storage Memory,Unroll Memory,不同的区域有不同的使用特点。首先,测试并分析了Shuffle Memory和Storage Memory的使用特点。RDD是Spark系统最重要的抽象,能够缓存在集群的内存中；在内存不足时,需要淘汰部分RDD分区。接着,提出了一种新的RDD分布式权值缓存策略,通过RDD分区的存储时间、大小、使用次数等来分析RDD分区的权值,并根据RDD的分布式特征对需要淘汰的RDD分区进行选择。最后,测试和分析了多种缓存策略的性能。
关键词：	大数据 Spark内存管理 RDD缓存缓存策略
收稿时间：	2016/11/11 0:00:00
修稿时间：	2017/1/2 0:00:00
Research on Memory Management and Cache Replacement Policies in Spark

MENG Hong-tao,YU Song-ping,LIU Fang and XIAO Nong.Research on Memory Management and Cache Replacement Policies in Spark[J].Computer Science,2017,44(6):31-35, 74.

Authors:	MENG Hong-tao YU Song-ping LIU Fang and XIAO Nong

Affiliation:	School of Computer,National University of Defense Technology,Changsha 410072,China,School of Computer,National University of Defense Technology,Changsha 410072,China,School of Computer,National University of Defense Technology,Changsha 410072,China and School of Computer,National University of Defense Technology,Changsha 410072,China

Abstract:	Spark is a big data processing framework based on Map-Reduce.Spark can make full use of cluster memory,thus accelerating data processing.Spark divides memory into Shuffle Memory,Storage Memory and Unroll Memory according to their functions.These different memory zones have different characteristics.The features of Shuffle Memory and Storage Memory were tested and analyzed.RDD (Resilient Distributed Datasets) is the most important abstract in spark,which can cache in cluster memory.When the cluster memory is insufficient,Spark must select some RDD partitions to discard to make room for the new ones.A new cache replacement policies called DWRP (Distributed Weight Replacement Policy) was proposed.DWRP can compute the weight of every RDD partition based on the time of store in memory,size and frequency of use,and then select possible RDD partition to discard based on distribution features.The performance of different cache replacement policies was tested and analyzed at last.

Keywords:	Big data Spark memory management RDD cache Cache replacement policies

	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏