首页 | 官方网站   微博 | 高级检索  
     

一种支持优化分块策略的矩阵乘加速器设计
引用本文:沈俊忠,肖涛,乔寓然,杨乾明,文梅.一种支持优化分块策略的矩阵乘加速器设计[J].计算机工程与科学,2016,38(9):1748-1754.
作者姓名:沈俊忠  肖涛  乔寓然  杨乾明  文梅
作者单位:;1.国防科学技术大学计算机学院
基金项目:国家863计划(2012AA012706);国家自然科学基金(61272145)
摘    要:在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。

关 键 词:FPGA  非均匀矩阵  矩阵乘法  分块策略
收稿时间:2015-12-10
修稿时间:2016-09-25

A matrix multiplication accelerator design for optimization blocking strategy
SHEN Jun zhong,XIAO Tao,QIAO Yu ran,YANG Qian ming,WEN Mei.A matrix multiplication accelerator design for optimization blocking strategy[J].Computer Engineering & Science,2016,38(9):1748-1754.
Authors:SHEN Jun zhong  XIAO Tao  QIAO Yu ran  YANG Qian ming  WEN Mei
Affiliation:(College of Computer,National University of Defense Technology,Changsha 410073,China)
Abstract:Large scale floating point matrix multiplication is one of the most time consuming computational kernels in many applications. There is a feature in emerging applications that matrices usually own at least one small dimension, which is called non uniform large scale matrix multiplication. Due to the limited amount of on chip memory for storing intermediate results on FPGA, partitioning large scale matrix multiplication into fine grained sub block computational tasks is needed. When accelerating non uniform matrix multiplications, most of the existing hardware matrix multipliers with a linear array architecture can suffer great performance reduction due to the fixed sub block size support. To solve this problem, we propose an efficient optimization blocking strategy. Based on it, we implement a novel matrix multiplier to support variable sub block operations on a Xilinx Zynq XC7Z045 FPGA. By integrating 224 processing elements (PEs), the multiplier achieves up to 48 GFLOPS for non uniform matrix multiplication in real application at 150 MHz with requirement of 4.8 GB/s of memory bandwidth. Results show that our proposed blocking strategy can improve up to 12% of performance in comparison with traditional blocking algorithms.
Keywords:FPGA  non uniform matrix  matrix multiplication  blocking strategy  
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号