面向GPU计算平台的归约算法的性能优化研究 Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向GPU计算平台的归约算法的性能优化研究

引用本文：	张逸然1,陈龙2,安向哲2,颜深根3. 面向GPU计算平台的归约算法的性能优化研究[J]. 计算机科学, 2019, 46(2): 306-309

作者姓名：	张逸然1 陈龙2 安向哲2 颜深根3

作者单位：	北京信息科技大学北京100049;中国石油集团东方地球物理勘探有限责任公司河北涿州072751;深圳市商汤科技有限公司广东深圳518000

摘要：	归约算法在科学计算和图像处理等领域有着十分广泛的应用,是并行计算的基本算法之一,因此对归约算法进行加速具有重要意义。为了充分挖掘异构计算平台下GPU的计算能力以对归约算法进行加速,文中提出基于线程内归约、work-group内归约和work-group间归约3个层面的归约优化方法,并打破以往相关工作将优化重心集中在work-group内归约上的传统思维,通过论证指出线程内归约才是归约算法的瓶颈所在。实验结果表明,在不同的数据规模下,所提归约算法与经过精心优化的OpenCV库的CPU版本相比,在AMD W8000和NVIDIA Tesla K20M平台上分别达到了3.91~15.93和2.97~20.24的加速比；相比于OpenCV库的CUDA版本与OpenCL版本,在NVIDIA Tesla K20M平台上分别达到了2.25~5.97和1.25~1.75的加速比；相比于OpenCL版本,在AMD W8000平台上达到了1.24~5.15的加速比。文中工作不仅实现了归约算法在GPU计算平台上的高性能,而且实现了在不同GPU计算平台间的性能可移植。
关键词：	归约算法 GPU 线程内归约 OpenCL
收稿时间：	2018-09-12
修稿时间：	2018-11-20
Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform

ZHANG Yi-ran,CHEN Long,AN Xiang-zhe and YAN Shen-gen. Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform[J]. Computer Science, 2019, 46(2): 306-309

Authors:	ZHANG Yi-ran CHEN Long AN Xiang-zhe YAN Shen-gen

Affiliation:	Beijing Information Science & Technology University,Beijing 100049,China,BGP INC.,China National Petroleum Corporation,Zhuozhou,Hebei 072751,China,BGP INC.,China National Petroleum Corporation,Zhuozhou,Hebei 072751,China and SenseTime,Shenzhen,Guangdong 518000,China

Abstract:	Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.

Keywords:	Reduction algorithm GPU Inner threads reduction OpenCL
本文献已被万方数据等数据库收录！
	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏