期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

贾迅钱磊原昊张昆吴东《计算机工程与科学》2020,42(11):1913-1921

BLAS level 3运算的计算复杂度较高,其往往成为应用的性能瓶颈。采用线性阵列结构的矩阵乘协处理器可实现高性能、高效的矩阵乘运算。在矩阵乘协处理器上高效实现BLAS level 3运算,对大规模科学与工程仿真应用的计算加速至关重要。以矩阵乘为核心运算,结合线性阵列的结构特点,提出了矩阵乘协处理器上BLAS level 3运算的设计,并构建了相应的性能分析模型。实验结果表明,矩阵乘协处理器上SYMM、SYRK和TRMM运算的计算效率分别达到了99%,98%和80%,与SW26010和NVIDIA V100 GPU上矩阵运算的计算效率相比,最高提升了31%。相似文献

2.

基于GPU的稀疏矩阵Cholesky分解

《计算机学报》2014,(7)

稀疏矩阵Cholesky分解是求解大规模稀疏线性方程组的核心算法,也是求解过程中最耗时的部分.近年来,一系列并行算法通过图形处理器(GPU)获得了显著的加速比,然而,由于访存的不规则性以及任务间的大量数据依赖关系,稀疏矩阵Cholesky分解算法在GPU上的计算效率很低.文中实现了一种新的基于GPU的稀疏矩阵Cholesky分解算法.在数据组织方面,改进了稀疏矩阵超节点数据结构,通过超节点合并和分块控制计算粒度;在计算调度方面,将稀疏矩阵Cholesky分解过程映射为一系列的数据块任务,并设计了相应的任务生成与调度算法,在满足数据依赖性的前提下提高任务的并行性.实验结果表明,该算法能够显著提高稀疏矩阵Cholesky分解算法在GPU上的实现效率,在单个GPU上获得了相对4核CPU平台2.69~3.88倍的加速比. 相似文献

3.

基于GPU的稀疏矩阵存储格式优化研究

杨世伟蒋国平宋玉蓉涂潇《计算机工程》2019,45(9)

稀疏矩阵存储格式中的稀疏矩阵向量乘(SpMV)计算效率低下,且分块行列(BRC)存储格式的计算结果缺少再现性和确定性。为此,提出一种改进的BRCP存储格式。采用不同的二维分块策略,根据矩阵各行非零元素分布的统计特性自适应调节分块参数,提高SpMV在GPU平台上的并行性,并设计基于快速分段求和算法的GPU内核函数,保证计算结果的确定性及其在不同GPU平台上的再现性。实验结果表明,BRCP存储格式具有较高的计算效率,相比BRC存储格式可减少并行环境中的SpMV计算误差,并提高PageRank排序的准确率。相似文献

4.

面向异构架构的传递闭包并行算法

肖汉郭宝云李彩林周清雷《计算机工程》2021,47(8):131-139

传统求图传递闭包的方法存在计算量大与计算时间长的问题。为加快处理大数据量的传递闭包算法的计算速度,结合算法密集计算和开放式计算语言（OpenCL）框架的特征,采用本地存储器优化的并行子矩阵乘和分块的矩阵乘并行计算,提出一种基于OpenCL的传递闭包并行算法。利用本地存储器优化的并行子矩阵乘算法来优化计算步骤,提高图形处理器（GPU）的存储器利用率,降低数据获取延迟。通过分块矩阵乘并行计算算法实现大数据量的矩阵乘,提高GPU计算核心的利用率。数据结果表明,与CPU串行算法、基于开放多处理的并行算法和基于统一设备计算架构的并行算法相比,传递闭包并行算法在OpenCL架构下NVIDIA GeForce GTX 1070计算平台上分别获得了593.14倍、208.62倍和1.05倍的加速比。相似文献

5.

GPU-CPU集群上的动态规划算法

冯高锋《计算机应用》2007,27(Z2):281-282

随着GPU的飞速发展,利用GPU进行图形计算之外的高性能计算已经成为一个研究热点.由此提出,将GPU作为协处理器,插入通用计算节点,构建GPU-CPU集群系统,使用相应的分块算法,把计算矩阵分块,然后采用:function offoad编程模型,将动态规划算法映射到CPU上进行加速计算.实验证明,利用该系统对动态规划算法进行优化,获得了很好的性能提高和加速比. 相似文献

6.

GPU加速不完全Cholesky分解预条件共轭梯度法

陈尧赵永华赵慰赵莲《计算机研究与发展》2015,(4):843-850

不完全 Cholesky 分解预条件共轭梯度（incomplete Cholesky factorization preconditioned conjugate gradient ,ICCG）法是求解大规模稀疏对称正定线性方程组的有效方法。然而ICCG法要求在每次迭代中求解2个稀疏三角方程组,稀疏三角方程组求解固有的串行性成为了ICCG法在GPU上并行求解的瓶颈。针对稀疏三角方程组求解,给出了一种利用GPU 加速的有效方法。为了增加稀疏三角方程组求解在GPU上的多线程并行性,提出了对不完全Cholesky分解产生的稀疏三角矩阵进行分层调度（level scheduling ）的方法。为了进一步提高稀疏三角方程组求解的并行性能,提出了在分层调度前通过近似最小度（approximate minimum degree ,AMD）算法对系数矩阵进行重排序、在分层调度后对稀疏三角矩阵进行层排序的方法,降低了分层调度过程中产生的层数,优化了稀疏三角方程组求解的GPU内存访问模式。数值实验表明,与利用NVIDIA CUSPARSE实现的ICCG法相比,采用上述方法性能可以获得平均1倍以上的提升。相似文献

7.

线性系统求解中迭代算法的GPU加速方法 总被引：1，自引：0，他引：1

葛振杨灿群吴强陈娟《计算机工程与科学》2009,31(Z1)

在求解线性系统时,迭代法是一种基本的方法,特别是在系数矩阵为大规模稀疏矩阵的情况下,高效地使用迭代法求解变得十分重要。本文通过分析迭代法的一般特点,提出了使用具有强大计算能力和存储带宽的GPU加速迭代法的一般方法。利用这些方法,在两种主流GPU平台上实现了一个经典的迭代法PQMRCGSTAB,并且针对不同的GPU平台特点提出了具体的优化方法。与AMD Opteron 2.4GHz 4核处理器相比,双精度版本的PQMRCGSTAB算法经NVIDIA Tesla S1070加速后性能提高31倍,经AMD Radeon HD 4870 X2加速后性能提高9倍。相似文献

8.

ChattyGraph:面向异构多协处理器的高可扩展图计算系统

蒋筱斌熊轶翔张珩武延军赵琛《软件学报》2023,34(4):1977-1996

现阶段,随着数据规模扩大化和结构多样化的趋势日益凸现,如何利用现代链路内链的异构多协处理器为大规模数据处理提供实时、可靠的并行运行时环境,已经成为高性能以及数据库领域的研究热点.利用多协处理器(GPU)设备的现代服务器(multi-GPU server)硬件架构环境,已经成为分析大规模、非规则性图数据的首选高性能平台.现有研究工作基于Multi-GPU服务器架构设计的图计算系统和算法(如广度优先遍历和最短路径算法),整体性能已显著优于多核CPU计算环境.然而,这类图计算系统中,多GPU协处理器间的图分块数据传输性能受限于PCI-E总线带宽和局部延迟,导致通过增加GPU设备数量无法达到整体系统性能的类线性增长趋势,甚至会出现严重的时延抖动,进而已无法满足大规模图并行计算系统的高可扩展性要求.经过一系列基准实验验证发现,现有系统存在如下两类缺陷:(1)现代GPU设备间数据通路的硬件架构发展日益更新(如NVLink-V1,NVLink-V2),其链路带宽和延迟得到大幅改进,然而现有系统受限于PCI-E总线进行数据分块通信,无法充分利用现代GPU链路资源(包括链路拓扑、连通性和路由);(2)在... 相似文献

9.

稀疏矩阵向量乘的FPGA设计与实现

下载免费PDF全文

宋庆增顾军华《计算机工程》2011,37(23):214-216

针对传统的通用处理器(GPP)平台上执行稀疏矩阵向量乘计算效率低的问题,提出一种基于可重构计算平台的SpMXV协处理器设计。方案采用二叉树结构高度流水的数据流、IEEE-754的32 bit浮点数数据格式和对角存储格式。数据通路以流水线方式进行组织,能够优化计算性能。仿真结果表明,与GPP平台上的软件实现相比,通过硬件实现的设计能达到最高2.69倍的性能加速。相似文献

10.

图形处理器通用计算的实现与验证

下载免费PDF全文

齐记杨孔庆杨磊《计算机工程与应用》2009,45(33):67-69

讨论了显示卡用于通用科学计算的问题,并以大型矩阵的基本运算问题详细比较了CPU和GPU计算之间的差别。在基本的矩阵运算中,运用适当的矩阵分块,GPU的计算速度比CPU快50倍左右。而且,显示卡低廉的价格为更多科研工作者实现大规模运算提供了可能。相似文献

11.

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

《Parallel Computing》2016

We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive–iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies. 相似文献

12.

一种支持优化分块策略的矩阵乘加速器设计

沈俊忠肖涛乔寓然杨乾明文梅《计算机工程与科学》2016,38(9):1748-1754

在许多应用领域中,大规模浮点矩阵乘法往往是最耗时的计算核心之一。在新兴的应用中经常存在至少有一个维度很小的大规模矩阵,我们把具备这种特性的矩阵称为非均匀矩阵。由于FPGA上用以存储中间结果的片上存储器容量十分有限,计算大规模矩阵乘法时往往需要将矩阵划分成细粒度的子块计算任务。当加速非均匀矩阵乘法时,由于只支持固定分块大小,大多数现有的线性阵列结构的硬件矩阵乘法器将遭受很大的性能下降。为了解决这个问题,提出了一种有效的优化分块策略。在此基础上,在Xilinx公司的Zynq XC7Z045FPGA芯片上实现了一个支持可变分块的矩阵乘法器。通过集成224个处理单元,该矩阵乘法器在150 MHz的时钟频率下对于实际应用中的非均匀矩乘达到了48GFLOPS的实测性能,而所需带宽仅为4.8GB/s。实验结果表明,我们提出的分块策略相比于传统的分块算法实现了高达12%的性能提升。相似文献

13.

The simpler GMRES method combined with finite volume method for simulating viscoelastic flows on triangular grid

《Advances in Engineering Software》2015

An efficient solver integrating the restarted simpler generalized minimal residual method (SGMRES(m)) with finite volume method (FVM) on triangular grid is developed to simulate the viscoelastic fluid flows. In particular, the SGMRES(m) solver is used to solve the large-scale sparse linear systems, which arise from the course of FVM on triangular grid for modeling the Newtonian and the viscoelastic fluid flows. To examine the performance of the solver for the nonlinear flow equations of viscoelastic fluids, we consider two types of numerical tests: the Newtonian flow past a circular cylinder, and the Oldroyd-B fluid flow in a planar channel and past a circular cylinder. It is shown that the numerical results obtained by the SGMRES(m) are consistent with the analytical solutions or empirical values. By comparing CPU time of different solvers, we find our solver is a highly efficient one for solving the flow equations of viscoelastic fluids. 相似文献

14.

Improving the scalability of a symmetric iterative eigensolver for multi‐core platforms

Hasan Metin Aktulga Chao Yang Esmond G. Ng Pieter Maris James P. Vary 《Concurrency and Computation》2014,26(16):2631-2651

We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi‐core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large‐scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix‐vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large‐scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

15.

Numerical solution of an integral equations system of the first kind by using an operational matrix with block pulse functions

K. Maleknejad H. Safdari M. Nouri 《International journal of systems science》2013,44(1):195-199

This article proposes a simple efficient method for solving a Volterra integral equations system of the first kind. By using block pulse functions and their operational matrix of integration, a first kind integral equations system can be reduced to a linear system of algebraic equations. The coefficient matrix of this system is a block matrix with lower triangular blocks. Numerical examples show that the approximate solutions have a good degree of accuracy. 相似文献

16.

二元域大型稀疏矩阵向量乘的FPGA设计与实现

苏锦柱邬贵明贾迅《计算机工程与科学》2016,38(8):1530-1535

作为Wiedemannn算法的核心部分,稀疏矩阵向量乘是求解二元域上大型稀疏线性方程组的主要步骤。提出了一种基于FPGA的二元域大型稀疏矩阵向量乘的环网硬件系统架构,为解决Wiedemannn算法重复计算稀疏矩阵向量乘,提出了新的并行计算结构。实验分析表明,提出的架构提高了Wiedemannn算法中稀疏矩阵向量乘的并行性,同时充分利用了FPGA的片内存储器和吉比特收发器,与目前性能最好的部分可重构计算PR模型相比,实现了2.65倍的加速性能。相似文献

17.

Improvement of workload balancing using parallel loop self-scheduling on Intel Xeon Phi

Chao-Tung Yang Chao-Wei Huang Shuo-Tsung Chen 《The Journal of supercomputing》2017,73(11):4981-5005

In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. This work aims to improve the workload balance by parallel loop self-scheduling scheme performed on Xeon Phi-based computer cluster. The proposed concept is implemented by hybrid MPI and OpenMP parallel programming in C language. Since parallel loop self-scheduling composes of static and dynamic allocation, weighting algorithm is adopted in the static part, while the well-known loop self-scheduling is adopted in dynamic part. The loop block is partitioned according to the weighting of MIC and HOST nodes. Accordingly, Xeon Phi with many-core is adopted to implement parallel loop self-scheduling. Finally, we test the performance in the experiments by four applicable problems: matrix multiplication, sparse matrix multiplication, Mandelbrot set and circuit meet. The experimental results indicate how to do the weight allocation and which scheduling method can achieve the best performance. 相似文献

18.

Unconditionally stable ADI scheme of higher-order for linear hyperbolic equations

《国际计算机数学杂志》2012,89(13):3030-3038

An unconditionally stable alternating direction implicit (ADI) method of higher-order in space is proposed for solving two- and three-dimensional linear hyperbolic equations. The method is fourth-order in space and second-order in time. The solution procedure consists of a multiple use of one-dimensional matrix solver which produces a computational cost effective solver. Numerical experiments are conducted to compare the new scheme with the existing scheme based on second-order spatial discretization. The effectiveness of the new scheme is exhibited from the numerical results. 相似文献

19.

KMA-α:一个支持向量机核矩阵的近似计算算法

丁立中廖士中《计算机研究与发展》2012,49(4):746-753

核矩阵计算是求解支持向量机的关键,已有精确计算方法难以处理大规模的样本数据.为此,研究核矩阵的近似计算方法.首先,借助支持向量机的凸二次约束线性规划表示,给出支持向量机和多核支持向量机的二阶锥规划表示.然后,综合Monte Carlo方法和不完全Cholesky分解方法,提出一个新的核矩阵近似算法KMA-α,该算法首先对核矩阵进行Monte Carlo随机采样,采样后不直接进行奇异值分解,而是应用具有对称置换的不完全Cholesky分解来计算接近最优的低秩近似.以KMA-α输出的近似核矩阵作为支持向量机的输入,可提高支持向量机二阶锥规划求解的效率.进一步,分析了KMA-α的算法复杂性,证明了KMA-α的近似误差界定理.最后,通过标准数据集上的实验,验证了KMA-α的合理性和计算效率.理论分析与实验结果表明,KMA-α是一合理、有效的核矩阵近似算法. 相似文献