首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 171 毫秒
1.
SMP集群系统上矩阵特征问题并行求解器的有效算法   总被引:2,自引:0,他引:2  
对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步 .针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI OpenMP混合并行算法 .算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价 .混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销 .在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性 .  相似文献   

2.
共轭梯度算法是求解对称正定线性系统的重要方法之一,该算法求解问题通常具有稀疏性.随着问题规模的不断增大,单CPU因其存储及计算能力限制已经不能满足大规模稀疏线性方程组求解的实时需求.基于此,本文提出一种基于CPU+GPU异构平台的MPI+CUDA异构并行求解算法.首先,对共轭梯度算法进行了热点性能分析,说明该算法求解时存在的计算困难及挑战;然后,根据共轭梯度算法特性进行了任务划分,实现异构并行算法设计;最后,针对异构并行算法中存在的通信开销、数据传输开销和存储器访问开销等问题,对异构并行算法进行优化以进一步提升求解效率及性能.实验结果表明,与MPI并行和CUDALib并行相比,MPI+CUDA异构混合并行在串行计算部分较少的Jacobi预处理共轭梯度算法上分别获得336%和33%的性能提升,在串行计算部分较多的ILU预处理共轭梯度算法上也能分别获得25%和7%的性能提升,同时结果还显示MPI+CUDA混合并行随着节点数目的增加具有一定可扩展性.  相似文献   

3.
为了研究GPU的通用计算能力和适合SMP集群的编程模型,首次提出MPI+CUDA多粒度混合并行编程的新方法,节点间采用MPI实现粗粒度并行,节点内采用CUDA实现细粒度并行的混合编程方式.利用此方法在搭建的3节点SMP集群环境中,测试了大规模矩阵乘问题的并行计算能力.实验结果表明,该方法能够显著提升并行效率,同时证明MPI+CUDA混合编程模型能够充分发挥SMP集群节点间分布式存储和节点内共享内存的优势,为装有CUDA-enabled GPU的SMP集群提供了一种有效的并行策略.  相似文献   

4.
赫姆霍兹方程求解是GRAPES数值天气预报系统动力框架中的核心部分,可转换为大规模稀疏线性系统的求解问题,但受限于硬件资源和数据规模,其求解效率成为限制系统计算性能提升的瓶颈。分别通过MPI、MPI+OpenMP、CUDA三种并行方式实现求解大规模稀疏线性方程组的广义共轭余差法,并利用不完全分解LU预处理子(ILU)优化系数矩阵的条件数,加快迭代法收敛。在CPU并行方案中,MPI负责进程间粗粒度并行和通信,OpenMP结合共享内存实现进程内部的细粒度并行,而在GPU并行方案中,CUDA模型采用数据传输、访存合并及共享存储器方面的优化措施。实验结果表明,通过预处理优化减少迭代次数对计算性能提升明显,MPI+OpenMP混合并行优化较MPI并行优化性能提高约35%,CUDA并行优化较MPI+OpenMP混合并行优化性能提高约50%,优化性能最佳。  相似文献   

5.
对称矩阵三对角化的混合并行算法设计   总被引:2,自引:0,他引:2  
赵永华  迟学斌  陈江 《计算机工程》2005,31(22):39-41,53
基于Householder转换,给出了稠密对称矩阵三对角化的MPI+OpenMP混合并行算法。内容集中在SMP集群系统环境下算法的负载平衡、通信开销和性能评价。OpenMP共享内存并行采用了粗粒度方法,解决了MPI算法中的负载平衡问题,降低了通信开销。在深腾6800上的试验结果表明,MPI+OpenMP版本比纯MPI版本具有更好的性能和可扩展性。  相似文献   

6.
对称矩阵三对角化的有效并行块算法设计   总被引:1,自引:0,他引:1  
在矩阵数值计算中,块算法通常比非块算法更有效,但这也增加了并行算法设计和实现的难度.在广义稠密对称矩阵特征问题并行求解器中,并行块算法的构造可应用到正定对称矩阵的Choleski分解、对称矩阵的三对角化和回代转化(back-transiation)操作中.本文将并行块算法的讨论集中在具有代表性的对称矩阵三对角化上,给出在非块存储方式下对称矩阵三对角化的并行块算法设计方法.分析块算法大小同矩阵规模和处理器数量的关系.在深腾6800上的试验表明,我们的算法具有很好的性能,并得到了比ScaLAPACK更高的性能.  相似文献   

7.
根据21CMA相关器的算法特点,在对比基于CPU并行的MPI集群、MPI+CUDA异构并行集群和Hadoop+CUDA异构并行集群的架构特点的基础上,提出了一种基于Hadoop+CUDA平台实现软相关器的方法。本方法利用GPU在计算FFT、向量乘和向量加等密集型计算模型的优势,设计相关器的并行模型,使其性能较前期在CPU并行的MPI集群实现的相关器有了大幅提升。同时,本文选择广泛应用于大数据处理平台的Hadoop软件架构,利用Hadoop Streaming工具实现非Java编写的程序在分布式系统中并行执行,非常便捷地获得了集群系统的线性加速比。Hadoop HDFS并行文件系统管理结果数据和过程日志更加灵活可靠,为后续的大数据分析提供了支撑环境。  相似文献   

8.
在集群与GPU组成的异构并行计算平台上,使用MPI+CUDA混合编程模型,实现基于ABEEMσπ模型的分子动力学模拟中电荷分布的计算.通过对电荷分布分布求解中的计算部分移植到GPU上进行,并针对算法中通信开销大和资源未充分利用的问题,通过异构平台的异步并发方法进行优化,提高了求解效率.性能测试结果表明,相比于单纯MPI并行算法,优化后GPU加速的异构并行算法,在化学大分子模型电荷分布计算上,有着明显的性能优势.  相似文献   

9.
广义Hermitian特征问题并行求解器的性能依赖于所选择的并行算法和矩阵的分布策略等诸多方面.基于块存储和快算法策略,提出了一个新的标准化转化的并行算法,该并行算法将Cholesky分解结合到广义特征问题标准化转换中,降低了已有并行算法的通信开销,并增加了算法的并行性.新算法可显著改善已有并行算法的性能和可扩展性.另外给出了一个有效求解具有多个右端项的三角矩阵方程AX=B的并行块算法.通过自主开发的特征问题并行软件包PSEPS的测试结果表明,并行算法比传统的并行算法快大约1倍,并具有较好的可扩展性.  相似文献   

10.
该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。  相似文献   

11.
张丹丹  徐莹  徐磊 《计算机科学》2012,39(4):296-298,303
对CPU+GPU异构平台下的多种并行编程模式进行了研究,并针对格子Boltzmann方法实现了CUDA,MPI+CUDA,MPI+OpenMP+CUDA多级并行算法。结果表明,算法具有较好的加速性能;提出的根据计算量比例参数调节CPU和GPU之间负载均衡的方法,对于在异构平台上实现多级并行处理及资源的有效利用具有一定的参考和应用价值。  相似文献   

12.
The numerical solution of two-layer shallow water systems is required to simulate accurately stratified fluids, which are ubiquitous in nature: they appear in atmospheric flows, ocean currents, oil spills, etc. Moreover, the implementation of the numerical schemes to solve these models in realistic scenarios imposes huge demands of computing power. In this paper, we tackle the acceleration of these simulations in triangular meshes by exploiting the combined power of several CUDA-enabled GPUs in a GPU cluster. For that purpose, an improvement of a path conservative Roe-type finite volume scheme which is specially suitable for GPU implementation is presented, and a distributed implementation of this scheme which uses CUDA and MPI to exploit the potential of a GPU cluster is developed. This implementation overlaps MPI communication with CPU–GPU memory transfers and GPU computation to increase efficiency. Several numerical experiments, performed on a cluster of modern CUDA-enabled GPUs, show the efficiency of the distributed solver.  相似文献   

13.
MPI (Message Passing Interface)专为节点密集型大规模计算集群设计,然而,随着MPI+CUDA (Compute Unified Device Architecture)应用程序以及计算节点拥有GPU的计算机集群的出现,类似于MPI的传统通信库已无法满足.而在机器学习领域,也面临着同样的挑战,如Caff以及CNTK (Microsoft CognitiveToolkit)的深度学习框架,由于训练过程中, GPU会缓存庞大的数据量,而大部分机器学习训练的优化算法具有迭代性特点,导致GPU间的通信数据量大,通信频率高,这些已成为限制深度学习训练性能提升的主要因素之一,虽然推出了像NCCL(Nvidia Collective multi-GPU Communication Library)这种解决深度学习通信问题的集合通信库,但也存在不兼容MPI等问题.因此,设计一种更加高效、符合当前新趋势的通信加速机制便显得尤为重要,为解决上述新形势下的挑战,本文提出了两种新型通信广播机制:(1)一种基于MPI_Bcast的管道链PC (Pipelined Chain)通信机制:为GPU缓存提供高效的节点内外通信.(2)一种适用于多GPU集群系统的基于拓扑感知的管道链TA-PC (TopologyAware Pipelined Chain)通信机制:充分利用多GPU节点间的可用PCIe链路.为了验证提出的新型广播设计,分别在三种配置多样化的GPU集群上进行了实验:GPU密集型集群RX1、节点密集型集群RX2、均衡型集群RX3.实验中,将新的设计与MPI+NCCL1 MPI_Bcast进行对比实验,对于节点内通信和节点间的通信,分别取得了14倍和16.6倍左右的性能提升;与NCCL2的对比试验中,小中型消息取得10倍左右的性能提升,大型消息取得与其相当的性能水平,同时TA-PC设计相比于PC设计,在64GPU集群上实现50%左右的性能提升.实验结果充分说明,提出的解决方案在可移植性以及性能方面有较大的优势.  相似文献   

14.
Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs.  相似文献   

15.
为提高大规模并行计算的并行效率,充分发挥CPU与GPU的功能特点,特别是体现GPU强大的运算能力,提出了用消息传递接口(MPI)将一组GPU连接起来。使GPU通用计算与计算流体力学中的LBM(latticeBoltzmannmethod)算法相结合。根据GPU通用计算与LBM算法的原理,使MPI作为计算分配的机制,CUDA(compute unified device architecture)作为主要的计算执行引擎,建立支持CUDA的GPU集群,在集群上对LBM算法中的D2Q9模型进行二维方腔流数值模拟。实验结果表明,利用GPU组模拟与CPU模拟结果一致,更充分发挥了GPU的计算能力,提高了并行效率。  相似文献   

16.
Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters   总被引:2,自引:0,他引:2  
Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.  相似文献   

17.
本文研究六边形区域上快速傅里叶变换(FFTH)的CUDA-MPI算法及其实现.首先,我们通过充分利用CUDA的层次化并行机制及其库函数,设计了FFTH的高效率的CUDA算法.对于规模为3×2048~2的双精度复数类型数据,我们设计的CUDA程序与CPU串行程序相比可以达到12倍加速比,如果不计内存和显存之间的数据传输,则加速比可达40倍;其计算效率与CUFFT所提供的二维方形区域FFT程序的效率基本一致.在此基础上,我们通过研究GPU上分布式并行数据的转置与排序算法,优化设计了FFTH的CUDA-MPI算法.在3×8192~2的数据规模、10节点×6GPU的计算环境下,我们的CUDA-MPI程序与CPU串行程序相比达到了55倍的加速;其效率比MPI并行版FFTW以及基于CUFFT本地计算和FFTW并行转置的方形区域并行FFT的效率都要高出很多.FFTH的CUDA-MPI算法研究和测试为大规模CPU+GPU异构计算机系统的可扩展新型算法的探索提供了参考.  相似文献   

18.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run.  相似文献   

19.
宏基因组基因聚类是筛选致病基因的新型方法,其依赖于海量的测序数据、有效的聚类算法以及高效的计算机来实现。相关系数矩阵的计算是进行聚类前必须完成的操作,占总计算量的比重较大。以某基因库为例,包含1300个样本、每样本百万基因的数据,单线程运行需要27年。充分发挥多核CPU的潜力,利用GPU加速卡强大的计算能力,将程序扩展到多节点集群上运行,是重要而迫切的工作。在仔细分析算法的基础上,首先针对单CPU节点和单GPU卡做了高效实现,获得了接近理想的加速比;然后利用缓存优化进一步提升性能;最后使用负载均衡方法在MPI线程间分发计算任务,实现了良好的扩展。相比未优化的单线程程序,16节点CPU获得了238.8倍的加速,6 块GPU卡获得了263.8倍的加速。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号