首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 53 毫秒
1.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run.  相似文献   

2.
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan–Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes.  相似文献   

3.
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod...  相似文献   

4.
This work explores the performance of single- and multi-GPU computing on state-of-the-art NVIDIA- and AMD-based server-class hardware using various programming interfaces to accelerate a real-world scientific application for solidification modeling based on the phase-field method. The main computations of this memory-bound application correspond to 20 stencils computed across grid nodes. We investigate the application's scalability for two basic schemes of organizing computation: without and with hiding data transfers behind computation, combined with using either peer-to-peer inter-GPU data transfers through NVIDIA NVLink and AMD Infinity interconnects or communication over the PCIe and main memory. Among the studied programming interfaces is CUDA, HIP, and OpenMP Accelerator Model. While the first two are designed to write the codes for a specific hardware platform, OpenMP enables code portability between NVIDIA and AMD GPUs. The resulting performance is experimentally assessed on computing platforms containing NVIDIA V100 (up to 8 GPUs) and A100 (one GPU), as well as AMD MI210 (one device) and MI250 (up to 8 logical GPUs).  相似文献   

5.
The Stochastic Simulation Algorithm (SSA) developed by Gillespie provides a powerful mechanism for exploring the behavior of chemical systems with small species populations or with important noise contributions. Gene circuit simulations for systems biology commonly employ the SSA method, as do ecological applications. This algorithm tends to be computationally expensive, so researchers seek an efficient implementation of SSA. In this program package, the Accelerated Exact Stochastic Simulation Algorithm (AESS) contains optimized implementations of Gillespie?s SSA that improve the performance of individual simulation runs or ensembles of simulations used for sweeping parameters or to provide statistically significant results.

Program summary

Program title: AESSCatalogue identifier: AEJW_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJW_v1_0.htmlProgram obtainable from: CPC Program Library, Queen?s University, Belfast, N. IrelandLicensing provisions: University of Tennessee copyright agreementNo. of lines in distributed program, including test data, etc.: 10 861No. of bytes in distributed program, including test data, etc.: 394 631Distribution format: tar.gzProgramming language: C for processors, CUDA for NVIDIA GPUsComputer: Developed and tested on various x86 computers and NVIDIA C1060 Tesla and GTX 480 Fermi GPUs. The system targets x86 workstations, optionally with multicore processors or NVIDIA GPUs as accelerators.Operating system: Tested under Ubuntu Linux OS and CentOS 5.5 Linux OSClassification: 3, 16.12Nature of problem: Simulation of chemical systems, particularly with low species populations, can be accurately performed using Gillespie?s method of stochastic simulation. Numerous variations on the original stochastic simulation algorithm have been developed, including approaches that produce results with statistics that exactly match the chemical master equation (CME) as well as other approaches that approximate the CME.Solution method: The Accelerated Exact Stochastic Simulation (AESS) tool provides implementations of a wide variety of popular variations on the Gillespie method. Users can select the specific algorithm considered most appropriate. Comparisons between the methods and with other available implementations indicate that AESS provides the fastest known implementation of Gillespie?s method for a variety of test models. Users may wish to execute ensembles of simulations to sweep parameters or to obtain better statistical results, so AESS supports acceleration of ensembles of simulation using parallel processing with MPI, SSE vector units on x86 processors, and/or using NVIDIA GPUs with CUDA.  相似文献   

6.
We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3–4 and 4–16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development.

Program summary

Program title: HONEICatalogue identifier: AEDW_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPLv2No. of lines in distributed program, including test data, etc.: 216 180No. of bytes in distributed program, including test data, etc.: 1 270 140Distribution format: tar.gzProgramming language: C++Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3Operating system: LinuxRAM: at least 500 MB freeClassification: 4.8, 4.3, 6.1External routines: SSE: none; [1] for GPU, [2] for Cell backendNature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the underlying hardware towards heterogeneity and parallelism. This is particularly relevant for data-intensive problems stemming from discretisations with local support, such as finite differences, volumes and elements.Solution method: To address these issues, we present a hardware aware collection of libraries combining the advantages of modern software techniques and hardware oriented programming. Applications built on top of these libraries can be configured trivially to execute on CPUs, GPUs or the Cell processor. In order to evaluate the performance and accuracy of our approach, we provide two domain specific applications; a multigrid solver for the Poisson problem and a fully explicit solver for 2D shallow water equations.Restrictions: HONEI is actively being developed, and its feature list is continuously expanded. Not all combinations of operations and architectures might be supported in earlier versions of the code. Obtaining snapshots from http://www.honei.org is recommended.Unusual features: The considered applications as well as all library operations can be run on NVIDIA GPUs and the Cell BE.Running time: Depending on the application, and the input sizes. The Poisson solver executes in few seconds, while the SWE solver requires up to 5 minutes for large spatial discretisations or small timesteps.References:
  • [1] 
    http://www.nvidia.com/cuda.
  • [2] 
    http://www.ibm.com/developerworks/power/cell.
  相似文献   

7.
The application of graphics processing units (GPU) to solve partial differential equations is gaining popularity with the advent of improved computer hardware. Various lower level interfaces exist that allow the user to access GPU specific functions. One such interface is NVIDIA’s Compute Unified Device Architecture (CUDA) library. However, porting existing codes to run on the GPU requires the user to write kernels that execute on multiple cores, in the form of Single Instruction Multiple Data (SIMD). In the present work, a higher level framework, termed CU++, has been developed that uses object oriented programming techniques available in C++ such as polymorphism, operator overloading, and template meta programming. Using this approach, CUDA kernels can be generated automatically during compile time. Briefly, CU++ allows a code developer with just C/C++ knowledge to write computer programs that will execute on the GPU without any knowledge of specific programming techniques in CUDA. This approach is tremendously beneficial for Computational Fluid Dynamics (CFD) code development because it mitigates the necessity of creating hundreds of GPU kernels for various purposes. In its current form, CU++ provides a framework for parallel array arithmetic, simplified data structures to interface with the GPU, and smart array indexing. An implementation of heterogeneous parallelism, i.e., utilizing multiple GPUs to simultaneously process a partitioned grid system with communication at the interfaces using Message Passing Interface (MPI) has been developed and tested.  相似文献   

8.
GROMACS是应用广泛的开源分子动力学模拟软件,当前主要通过CUDA使用NVIDIA GPU进行加速计算。ROCm是一个开源的高性能异构计算平台。基于ROCm平台的HIP编程语言,首次实现了GROMACS 2020系列在ROCm平台上的完整移植。在MI50 GPU上,以一个复杂离子液体模拟算例为目标,使用GPU性能分析工具rocprof对移植代码进行了性能分析。针对MI50硬件特性,先后对成键力核函数、静电力的PME核函数和短程非成键力核函数进行了优化,优化后运行目标算例的性能相比初始版本整体上获得了约2.8倍的加速比,在 MI50上的性能高于GROMACS原版OpenCL代码60.5%,相对纯CPU版本有约2.7倍的加速比。在另外2个具有代表性算例的单结点测试以及离子液体算例的多结点扩展性测试中,优化后的代码也达到了较好的性能提升,这表明所采用的优化操作具有一定的通用性。  相似文献   

9.
OpenCL是面向异构计算平台的通用编程框架,然而由于硬件体系结构的差异,如何在平台间功能移植的基础上实现性能移植仍是有待研究的问题。当前已有算法优化研究一般只针对单一硬件平台,它们很难实现在不同平台上的高效运行。在分析了不同GPU平台底层硬件架构的基础上,从Global Memory的访存效率、GPU计算资源的有效利用率及其硬件资源的限制等多个角度考察了不同优化方法在不同GPU硬件平台上对性能的影响;并在此基础上实现了基于OpenCL的拉普拉斯图像增强算法。实验结果表明,优化后的算法在不考虑数据传输时间的前提下,在AMD和NVIDIA GPU上都取得了3.7~136.1倍、平均56.7倍的性能加速,优化后的kernel比NVIDIA NPP库中相应函数也取得了12.3%~346.7%、平均143.1%的性能提升,验证了提出的优化方法的有效性和性能可移植性。  相似文献   

10.
用OpenCL语言标准设计并实现了推荐系统领域的两种经典算法:交替最小二乘法(Alternating Least Squares,ALS)与循环坐标下降法(Cyclic Coordinate Descent,CCD)。将其应用到CPU,GPU,MIC多核与众核平台上,探索了在该平台上影响算法性能的因子:潜在特征维数与线程个数。同时,将OpenCL实现的两种算法与CUDA和OpenMP的实现进行比较,得出了一系列结论。在同等条件下,与ALS算法相比,CCD算法的精度更高,收敛速度更快且更稳定,但所耗时间更长。ALS和CCD算法基于OpenCL的实现性能不亚于CUDA(CCD 上加速比为1.03x,ALS上加速比为1.2x)和OpenMP的实现(CCD与ALS上加速比大约为1.6~1.7x),并且两种算法在CPU平台上的性能均比GPU与MIC好。  相似文献   

11.
图形处理单元(GPU)作为一种流体系结构的处理器,现已被广泛地用于通用高性能计算,而不仅仅局限于图像处理领域了。NVIDIA的CUDA和AMD的Stream SDK都是现在比较流行的针对GPU通用计算(GPGPU)的流编程环境。然而,它们有自身的缺陷和限制,其中最主要的便是缺乏面对不同GPU的二进制兼容性问题和重写已有程序源代码代价大的问题。通过利用二进制分析和动态二进制翻译技术,实现一个自动化执行框架GxBit,它提供一种从x86二进制程序中提取流模式,并映射到NVIDIACUDA编程环境的方法。该框架经过CUDA SDK Sample和Parboil Benchmark Suite中若干程序的验证,平均取得10倍以上的性能提升。  相似文献   

12.
Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have applied the code in the case of the Time-Dependent Schrödinger Equation of atomic hydrogen in a strong laser field and studied its performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant speedup of the GPU over the CPU device. The speedup in the benchmark tended towards a value of about 40 with significant speedups expected against multi-core CPUs. Furthermore, though we do not present the detailed benchmarks here, we also have achieved speedup values of around 75 by performing a slight optimization of the described algorithm.  相似文献   

13.
Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code.  相似文献   

14.
We report on our implementation of the RHMC algorithm for the simulation of lattice QCD with two staggered flavors on Graphics Processing Units, using the NVIDIA CUDA programming language. The main feature of our code is that the GPU is not used just as an accelerator, but instead the whole Molecular Dynamics trajectory is performed on it. After pointing out the main bottlenecks and how to circumvent them, we discuss the obtained performances. We present some preliminary results regarding OpenCL and multiGPU extensions of our code and discuss future perspectives.  相似文献   

15.
Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675× compared to a typical CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively.

Program summary

Program title: SDECatalogue identifier: AEFG_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEFG_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: Gnu GPL v3No. of lines in distributed program, including test data, etc.: 978No. of bytes in distributed program, including test data, etc.: 5905Distribution format: tar.gzProgramming language: CUDA CComputer: any system with a CUDA-compatible GPUOperating system: LinuxRAM: 64 MB of GPU memoryClassification: 4.3External routines: The program requires the NVIDIA CUDA Toolkit Version 2.0 or newer and the GNU Scientific Library v1.0 or newer. Optionally gnuplot is recommended for quick visualization of the results.Nature of problem: Direct numerical integration of stochastic differential equations is a computationally intensive problem, due to the necessity of calculating multiple independent realizations of the system. We exploit the inherent parallelism of this problem and perform the calculations on GPUs using the CUDA programming environment. The GPU's ability to execute hundreds of threads simultaneously makes it possible to speed up the computation by over two orders of magnitude, compared to a typical modern CPU.Solution method: The stochastic Runge-Kutta method of the second order is applied to integrate the equation of motion. Ensemble-averaged quantities of interest are obtained through averaging over multiple independent realizations of the system.Unusual features: The numerical solution of the stochastic differential equations in question is performed on a GPU using the CUDA environment.Running time: < 1 minute  相似文献   

16.
In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation.  相似文献   

17.
基于OpenCL的图像积分图算法优化研究   总被引:1,自引:0,他引:1  
图像积分图算法在快速特征检测中有着广泛的应用,通过GPU对其进行性能加速有着重要的现实意义。然而由于GPU硬件架构的复杂性和不同硬件体系架构间的差异性,完成图像积分图算法在GPU上的优化,进而实现不同GPU平台间的性能移植是一件非常困难的工作。在分析不同CPU平台底层硬件架构的基础上,从片外访存带宽利用率、计算资源利用率和数据本地化等多个角度考察了不同优化方法在不同GPU硬件平台上对性能的影响。并在此基础上实现了基于OpenCL的图像积分图算法。实验结果表明,优化后的算法在AMD和NVIDIA CPU上分别取得了11.26和12.38倍的性能加速,优化后的GPU kernel比NVIDIA NPP库中的相应函数也分别取得了55.01%和65.17%的性能提升。验证了提出的优化方法的有效性和性能可移植性。  相似文献   

18.
图形处理器(graphic processing unit,GPU)的最新发展已经能够以低廉的成本提供高性能的通用计算。基于GPU的CUDA(compute unified device architecture)和OpenCL(open computing language)编程模型为程序员提供了充足的类似于C语言的应用程序接口(application programming interface,API),便于程序员发挥GPU的并行计算能力。采用图形硬件进行加速计算,通过一种新的GPU处理模型——并行时间空间模型,对现有GPU上的N-body实现进行了分析,从而提出了一种新的GPU上快速仿真N-body问题的算法,并在AMD的HD Radeon 5850上进行了实现。实验结果表明,相对于CPU上的实现,获得了400倍左右的加速;相对于已有GPU上的实现,也获得了2至5倍的加速。  相似文献   

19.
基于Hadoop的高性能海量数据处理平台研究   总被引:2,自引:0,他引:2  
海量数据高性能计算蕴藏着巨大的应用价值,但是目前云计算体系只具有海量数据处理能力,而不具有足够的高性能计算能力。将具有超强并行计算能力的CPU与云计算相融合,提出了基于CPU/GPU协同的异构高性能云计算体系结构。以开源Hadoop为基础,采用注释码的形式对MapReduce函数中需要并行的部分进行标记。通过 定制GPU类加载器,将被标记代码转换为CUDA代码并动态编译运行。该平台将GPU的计算能力融合到MapReduce框架中,可高效处理海量数据。  相似文献   

20.
Fourier methods have revolutionized many fields of science and engineering,such as astronomy,medical imaging,seismology and spectroscopy,and the fast Fourier transform(FFT) is a computationally efficient method of generating a Fourier transform.The emerging class of high performance computing architectures,such as GPU,seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software.However,the complexity of GPU programming poses a significant challenge to developers.In this paper,we propose an automatic performance tuning framework for FFT on various OpenCL GPUs,and implement a high performance library named MPFFT based on this framework.For power-of-two length FFTs,our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs.Furthermore,our library also supports non-power-of-two size.For 3D non-power-of-two FFTs,our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号