期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Providing Source Code Level Portability Between CPU and GPU with MapCG

洪春涛陈德颢陈羽北陈文光郑纬民林海波《计算机科学技术学报》2012,27(1):42-56

Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod... 相似文献

2.

An OpenCL micro-benchmark suite for GPUs and CPUs

Xin Yan Xiaohua Shi Lina Wang Haiyan Yang 《The Journal of supercomputing》2014,69(2):693-713

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for any particular compute device. To develop efficient OpenCL applications for the particular platform, we still need a more profound understanding of architecture features on the OpenCL model and computing devices. For this purpose, we design and implement an OpenCL micro-benchmark suite for GPUs and CPUs. In this paper, we introduce the implementations of our OpenCL micro benchmarks, and present the measuring results of hardware and software features like performance of mathematical operations, bus bandwidths, memory architectures, branch synchronizations and scalability, etc., on two multi-core CPUs, i.e. AMD Athlon II X2 250 and Intel Pentium Dual-Core E5400, and two different GPUs, i.e. NVIDIA GeForce GTX 460se and AMD Radeon HD 6850. We also compared the measuring results with existing benchmarks to demonstrate the reasonableness and correctness of our benchmark suite. 相似文献

3.

Implementing molecular dynamics on hybrid high performance computers – short range forces

W. Michael Brown Peng Wang Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2011,182(4):898-911

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines – (1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory, (2) minimizing the amount of code that must be ported for efficient acceleration, (3) utilizing the available processing power from both multi-core CPUs and accelerators, and (4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS, however, the methods can be applied in many molecular dynamics codes. Specifically, we describe algorithms for efficient short range force calculation on hybrid high-performance machines. We describe an approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPUs and 180 CPU cores. 相似文献

4.

An application-centric evaluation of OpenCL on multi-core CPUs

Jie Shen Jianbin Fang Henk Sips Ana Lucia Varbanescu 《Parallel Computing》2013

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. 相似文献

5.

Implementing molecular dynamics on hybrid high performance computers – Particle–particle particle-mesh

W. Michael Brown Axel Kohlmeyer Steven J. Plimpton Arnold N. Tharrington 《Computer Physics Communications》2012,183(3):449-459

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with an approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle–particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers. 相似文献

6.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

7.

OpenCL-based optimization methods for utilizing forward DCT and quantization of image compression on a heterogeneous platform

Nasser Alqudami Shin-Dug Kim 《Journal of Real-Time Image Processing》2016,12(2):219-235

Recent computer systems and handheld devices are equipped with high computing capability, such as general purpose GPUs (GPGPU) and multi-core CPUs. Utilizing such resources for computation has become a general trend, making their availability an important issue for the real-time aspect. Discrete cosine transform (DCT) and quantization are two major operations in image compression standards that require complex computations. In this paper, we develop an efficient parallel implementation of the forward DCT and quantization algorithms for JPEG image compression using Open Computing Language (OpenCL). This OpenCL-based parallel implementation utilizes a multi-core CPU and a GPGPU to perform DCT and quantization computations. We demonstrate the capability of this design via two proposed working scenarios. The proposed approach also applies certain optimization techniques to improve the kernel execution time and data movements. We developed an optimal OpenCL kernel for a particular device using device-based optimization factors, such as thread granularity, work-items mapping, workload allocation, and vector-based memory access. We evaluated the performance in a heterogeneous environment, finding that the proposed parallel implementation was able to speed up the execution time of the DCT and quantization by factors of 7.97 and 8.65, respectively, obtained from 1024 × 1024 and 2084 × 2048 image sizes in 4:4:4 format. 相似文献

8.

An OpenCL implementation for the solution of the time-dependent Schrödinger equation on GPUs and CPUs

Cathal Ó Broin L.A.A. Nikolopoulos 《Computer Physics Communications》2012,183(10):2071-2080

Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have applied the code in the case of the Time-Dependent Schrödinger Equation of atomic hydrogen in a strong laser field and studied its performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant speedup of the GPU over the CPU device. The speedup in the benchmark tended towards a value of about 40 with significant speedups expected against multi-core CPUs. Furthermore, though we do not present the detailed benchmarks here, we also have achieved speedup values of around 75 by performing a slight optimization of the described algorithm. 相似文献

9.

Vector data flow analysis for SIMD optimizations on OpenCL programs

Yu‐Te Lin Jenq‐Kuen Lee 《Concurrency and Computation》2016,28(5):1629-1654

Multi‐core systems equipped with micro processing units and accelerators such as digital signal processors (DSPs) and graphics processing units (GPUs) have become a major trend in processor design in recent years in attempts to meet ever‐increasing application performance requirements. Open Computing Language (OpenCL) is one of the programming languages that include new extensions proposed to exploit the computing power of these kinds of processors. Among the newly extended language features, the single‐instruction multiple‐data (SIMD) linguistics and vector types are added to OpenCL to exploit hardware features of the accelerators. The addition makes it necessary to consider how traditional compiler data flow analysis can be adopted to meet the optimization requirements of vector linguistics. In this paper, we propose a calculus framework to support the data flow analysis of vector constructs for OpenCL programs that compilers can use to perform SIMD optimizations. We model OpenCL vector operations as data access functions in the style of mathematical functions. We then show that the data flow analysis for OpenCL vector linguistics can be performed based on the data access functions. Based on the information gathered from data flow analysis, we illustrate a set of SIMD optimizations on OpenCL programs. The experimental results incorporating our calculus and our proposed compiler optimizations show that the proposed SIMD optimizations can provide average performance improvements of 22% on x86 CPUs and 4% on advanced micro devices GPUs. For the selected 15 benchmarks, 11 of them are improved on x86 CPUs, and six of them are improved on advanced micro devices GPUs. The proposed framework has the potential to be used to construct other SIMD optimizations on OpenCL programs. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

10.

SkelCL: a high-level extension of OpenCL for multi-GPU systems

Michel Steuwer Sergei Gorlatch 《The Journal of supercomputing》2014,69(1):25-33

Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. We present SkelCL—a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL makes three main enhancements to the OpenCL standard: (1) memory management is simplified using parallel container data types (vectors and matrices); (2) an automatic data (re)distribution mechanism allows for implicit data movements between GPUs and ensures scalability when using multiple GPUs; (3) computations are conveniently expressed using parallel algorithmic patterns (skeletons). We demonstrate how SkelCL is used to implement parallel applications, and we report experimental evaluation of our approach in terms of programming effort and performance. 相似文献

11.

Introducing and Implementing the Allpairs Skeleton for Programming Multi-GPU Systems

Michel Steuwer Malte Friese Sebastian Albers Sergei Gorlatch 《International journal of parallel programming》2014,42(4):601-618

Algorithmic skeletons simplify software development: they abstract typical patterns of parallelism and provide their efficient implementations, allowing the application developer to focus on the structure of algorithms, rather than on implementation details. This becomes especially important for modern parallel systems with multiple graphics processing units (GPUs) whose programming is complex and error-prone, because state-of-the-art programming approaches like CUDA and OpenCL lack high-level abstractions. We define a new algorithmic skeleton for allpairs computations which occur in real-world applications, ranging from bioinformatics to physics. We develop the skeleton’s generic parallel implementation for multi-GPU Systems in OpenCL. To enable the automatic use of the fast GPU memory, we identify and implement an optimized version of the allpairs skeleton with a customizing function that follows a certain memory access pattern. We use matrix multiplication as an application study for the allpairs skeleton and its two implementations and demonstrate that the skeleton greatly simplifies programming, saving up to 90 % of lines of code as compared to OpenCL. The performance of our optimized implementation is up to 6.8 times higher as compared with the generic implementation and is competitive to the performance of a manually written optimized OpenCL code. 相似文献

12.

Parallel probabilistic relaxation labelling based on Markov random fields for spectral-spatial hyperspectral image classification

Brajesh Kumar Onkar Dikshit 《International journal of remote sensing》2016,37(18):4356-4379

The large volume of data and computational complexity of algorithms limit the application of hyperspectral image classification to real-time operations. This work addresses the use of different parallel processing techniques to speed up the Markov random field (MRF)-based method to perform spectral-spatial classification of hyperspectral imagery. The Metropolis relaxation labelling approach is modified to take advantage of multi-core central processing units (CPUs) and to adapt it to massively parallel processing systems like graphics processing units (GPUs). The experiments on different hyperspectral data sets revealed that the implementation approach has a huge impact on the execution time of the algorithm. The results demonstrated that the modified MRF algorithm produced classification accuracy similar to conventional methods with greatly improved computational performance. With modern multi-core CPUs, good computational speed-up can be achieved even without additional hardware support. The CPU-GPU hybrid framework rendered the otherwise computationally expensive approach suitable for time-constrained applications. 相似文献

13.

Box-counting algorithm on GPU and multi-core CPU: an OpenCL cross-platform study

Jesús Jiménez Juan Ruiz de Miras 《The Journal of supercomputing》2013,65(3):1327-1352

In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation. 相似文献

14.

Accelerating computation of Euclidean distance map using the GPU with efficient memory access

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(5):383-406

Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation. 相似文献

15.

一种适应GPU的混合OLAP查询处理模型

张宇张延松陈红王珊《软件学报》2016,27(5):1246-1265

通用GPU因其强大的并行计算能力成为新兴的高性能计算平台,并逐渐成为近年来学术界在高性能数据库实现技术领域的研究热点.但当前GPU数据库领域的研究沿袭的是ROLAP(relational OLAP)多维分析模型,研究主要集中在关系操作符在GPU平台上的算法实现和性能优化技术,以哈希连接的GPU并行算法研究为中心.GPU拥有数千个并行计算单元,但其逻辑控制单元较少,相对于CPU具有更强的并行计算能力,但逻辑控制和复杂内存管理能力较弱,因此并不适合需要复杂数据结构和复杂内存管理机制的内存数据库查询处理算法直接移植到GPU平台.提出了面向GPU向量计算特性的混合OLAP多维分析模型semi-MOLAP,将MOLAP(multidimensionalOLAP)模型的直接数组访问和计算特性与ROLAP模型的存储效率结合在一起,实现了一个基于完全数组结构的GPU semi-MOLAP多维分析模型,简化了GPU数据管理,降低了GPU semi-MOLAP算法复杂度,提高了GPU semi-MOLAP算法的代码执行率.同时,基于GPU和CPU计算的特点,将semi-MOLAP操作符拆分为CPU和GPU平台的协同计算,提高了CPU和GPU的利用率以及OLAP的查询整体性能. 相似文献

16.

dOCAL: high-level distributed programming with OpenCL and CUDA

Rasch Ari Bigge Julian Wrodarczyk Martin Schulze Richard Gorlatch Sergei 《The Journal of supercomputing》2020,76(7):5117-5138

In the state-of-the-art parallel programming approaches OpenCL and CUDA, so-called host code is required for program’s execution. Efficiently implementing host code is often a cumbersome task, especially when executing OpenCL and CUDA programs on systems with multiple nodes, each comprising different devices, e.g., multi-core CPU and graphics processing units; the programmer is responsible for explicitly managing node’s and device’s memory, synchronizing computations with data transfers between devices of potentially different nodes and for optimizing data transfers between devices’ memories and nodes’ main memories, e.g., by using pinned main memory for accelerating data transfers and overlapping the transfers with computations. We develop distributed OpenCL/CUDA abstraction layer (dOCAL)—a novel high-level C++ library that simplifies the development of host code. dOCAL combines major advantages over the state-of-the-art high-level approaches: (1) it simplifies implementing both OpenCL and CUDA host code by providing a simple-to-use, high-level abstraction API; (2) it supports executing arbitrary OpenCL and CUDA programs; (3) it allows conveniently targeting the devices of different nodes by automatically managing node-to-node communications; (4) it simplifies implementing data transfer optimizations by providing different, specially allocated memory regions, e.g., pinned main memory for overlapping data transfers with computations; (5) it optimizes memory management by automatically avoiding unnecessary data transfers; (6) it enables interoperability between OpenCL and CUDA host code for systems with devices from different vendors. Our experiments show that dOCAL significantly simplifies the development of host code for heterogeneous and distributed systems, with a low runtime overhead.

相似文献

17.

Parallel cube computation on modern CPUs and GPUs

Guoliang Zhou Hong Chen 《The Journal of supercomputing》2012,61(3):394-417

With the popularity of column-store databases, modern multi-core CPUs, and general-purpose computing on graphics processing units (GPGPUs), there will be radical changes in how processing is done in the online analytical processing (OLAP) and data warehousing fields. Cube computation is a core and time-consuming problem which has been researched extensively. However, most of the algorithms have been proposed without considering the prevalent multi-core architectures and column storage. This paper presents a new parallel cube algorithm that takes advantage of multi-core architectures. We first propose a cache-conscious bottom-up computation (BUC) algorithm called CC-BUC that adopts an integrated bottom-up and breadth-first partitioning order. Each dimension is separately stored and processed. In processing each dimension, breadth-first data scanning and results outputting reduce memory I/O and enhance cache locality. Cache misses are limited in a dimension scope, and translation lookaside buffer (TLB) misses are reduced. Based on CC-BUC, we give a multi-core architecture-based cube algorithm called MC-Cubing. Multiple partitions are processed simultaneously and multiple threads undergo parallel execution inside each partition. MC-Cubing is consistent with multi-core architectures and high parallelism. The layout and associated algorithms take advantage of single instruction, multiple data (SIMD) instructions and thread-level parallelism (TLP). We implement and demonstrate the effectiveness of MC-Cubing on two multi-core architectures: multi-core CPUs and GPUs. Experimental results show that the MC-Cubing algorithm can speed up nearly six times faster than BUC in real datasets. 相似文献

18.

Multi-core-CPU and GPU-accelerated radiative transfer models based on the discrete ordinate method

Dmitry S. Efremenko Diego G. Loyola Adrian Doicu Robert J.D. Spurr 《Computer Physics Communications》2014

The operational processing of remote sensing data usually requires high-performance radiative transfer model (RTM) simulations. To date, multi-core CPUs and also Graphical Processing Units (GPUs) have been used for highly intensive parallel computations. In this paper, we have compared multi-core and GPU implementations of an RTM based on the discrete ordinate solution method. To implement GPUs, the original CPU code has been redesigned using the C-oriented Compute Unified Device Architecture (CUDA) developed by NVIDIA. 相似文献

19.

Performance Evaluation of GPU-Accelerated Spatial Interpolation Using Radial Basis Functions for Building Explicit Surfaces

Zengyu Ding Gang Mei Salvatore Cuomo Nengxiong Xu Hong Tian 《International journal of parallel programming》2018,46(5):963-991

This paper focuses on evaluating the computational performance of parallel spatial interpolation with Radial Basis Functions (RBFs) that is developed by utilizing modern GPUs. The RBFs can be used in spatial interpolation to build explicit surfaces such as Discrete Elevation Models. When interpolating with large-size of data points and interpolated points for building explicit surfaces, the computational cost would be quite expensive. To improve the computational efficiency, we specifically develop a parallel RBF spatial interpolation algorithm on many-core GPUs, and compare it with the parallel version implemented on multi-core CPUs. Five groups of experimental tests are conducted on two machines to evaluate the computational efficiency of the presented GPU-accelerated RBF spatial interpolation algorithm. Experimental results indicate that: in most cases, the parallel RBF interpolation algorithm on many-core GPUs does not have any significant advantages over the parallel version on multi-core CPUs in terms of computational efficiency. This unsatisfied performance of the GPU-accelerated RBF interpolation algorithm is due to: (1) the limited size of global memory residing on the GPU, and (2) the need to solve a system of linear equations in each GPU thread to calculate the weights and prediction value of each interpolated point. 相似文献

20.

A high performance parallel DCT with OpenCL on heterogeneous computing environment

Cheong Ghil Kim Yong Soo Choi 《Multimedia Tools and Applications》2013,64(2):475-489

A noteworthy thing in desktop PCs is that they can provide a great opportunity to increase the performance of processing multimedia data by exploiting task- and data-parallelism with multi-core CPU and many-core GPU. This paper presents a high performance parallel implementation of 2D DCT on this heterogeneous computing environment. For this purpose, Intel TBB (threading building blocks) and OpenCL (Open Compute Language) are utilized for task- and data-parallelism, respectively. The simulation result shows that the parallel DCT implementations far the serial ones in processing speed. Especially, OpenCL implementation shows a linear speedup, a typical SIMD characteristic as the increase of 2D data sets. 相似文献