期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations

Jiayuan Meng Kevin Skadron 《International journal of parallel programming》2011,39(1):115-142

Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup. 相似文献

2.

Accelerating computation of Euclidean distance map using the GPU with efficient memory access

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(5):383-406

Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation. 相似文献

3.

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Fengguang Song Jack Dongarra 《Concurrency and Computation》2015,27(14):3702-3723

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

4.

A new approach to the lattice Boltzmann method for graphics processing units

Christian Obrecht Frédéric Kuznik Bernard Tourancheau Jean-Jacques Roux 《Computers & Mathematics with Applications》2011,61(12):3628-3638

Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia’s GT200. We show that as a consequence highly efficient implementations of LBM on GPUs are possible, even for complex models. 相似文献

5.

GPU-Friendly Parallel Genome Matching with Tiled Access and Reduced State Transition Table

Yunho Oh Doohwan Oh Won W. Ro 《International journal of parallel programming》2013,41(4):526-551

In this paper, we propose a new parallel genome matching algorithm using graphics processing units (GPUs). Our proposed approach is based on the Aho–Corasick algorithm and it was developed based on a consideration of the architectural features of existing GPUs with a hundred or more cores. Thus, we provide an appropriate task partitioning method that runs on multiple threads and we fully utilize the cache memory and the shared memory structures available in GPUs. Especially, we propose a tiled access method for rapid data transfer from the global memory to the shared memory. We also provide new models for cache-friendly state transition table to improve performance of pattern matching operations on GPUs. The maximum throughput we achieved in various experiments was 15.3 Gbps. Moreover, we showed that our proposed design outperformed an earlier approach with a 15.4 % performance improvement. 相似文献

6.

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

John K. Holmen David L. Foster 《International journal of parallel programming》2014,42(2):343-363

The most commonly used approach for solving reaction–diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for the heavy reliance of stencil computations on neighboring data points to be exploited to significantly increase simulation speeds by reducing these memory bandwidth demands. Upon reviewing previously published works, a wide-variety of efforts have been made to optimize NVIDIA CUDA-based stencil computations. However, a critical aspect contributing to algorithm performance is commonly glossed over: the halo region loading technique utilized in conjunction with a given spatial blocking technique. This paper presents an in-depth examination of this aspect and the associated single iteration performance impacts when using symmetric, nearest neighbor 19-point stencils. This is accomplished by closely examining how the simulated space is partitioned into thread blocks and the balance between memory accesses, divergence, and computing threads. The resulting optimization strategy for accelerating 3-dimensional reaction–diffusion simulations offers up to 2.45 times speedup for single-precision floating point numbers in reference to GPU-based speedups found within the previously published work that this paper directly extends. In reference to our multithreaded CPU-based implementation, the resulting optimization strategy offers up to 8.69 times speedup for single-precision floating point numbers. 相似文献

7.

A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

S. Tabik M. Peemen L. F. Romero 《The Journal of supercomputing》2018,74(4):1580-1608

This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (\(>25\%\), this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65\(\times \) faster than the case in which we fully decompose our stencil without tiling and 5.3\(\times \) faster with respect to the fully fused version on the NVIDIA GPUs. 相似文献

8.

A GPU‐based Streaming Algorithm for High‐Resolution Cloth Simulation

Min Tang Ruofeng Tong Rahul Narain Chang Meng Dinesh Manocha 《Computer Graphics Forum》2013,32(7):21-30

We present a GPU‐based streaming algorithm to perform high‐resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU‐based kernels and data structures. Our algorithm perform intra‐object and inter‐object collisions, handles contacts and friction, and is able to accurately simulate folds and wrinkles. We describe the streaming pipeline and address many issues in terms of obtaining high throughput on many‐core GPUs. In practice, our algorithm can perform high‐fidelity simulation on a cloth mesh with 2M triangles using 3GB of GPU memory. We highlight the parallel performance of our algorithm on three different generations of GPUs. On a high‐end NVIDIA Tesla K20c, we observe up to two orders of magnitude performance improvement as compared to a single‐threaded CPU‐based algorithm, and about one order of magnitude improvement over a 16‐core CPU‐based parallel implementation. 相似文献

9.

An area- and energy-efficient hybrid architecture for floating-point FFT computations

《Microprocessors and Microsystems》2019

Floating-point fast Fourier transform (FFT) has been widely expected in scientific computing and high-resolution imaging applications due to the wide dynamic range and high processing precision. However, it suffers high area and energy overhead problems in comparison to fixed-point implementations. To address these issues, this paper presents an area- and energy-efficient hybrid architecture for floating-point FFT computations. It minimizes the required arithmetic units and reduces the memory usage significantly by combining three different parts. The serial radix-4 butterfly (SR4BF) is used in the single-path delay commutator (SDC) part to minimize the required arithmetic units with 100% adder utilization ratio obtained. A modified single-path delay feedback (MSDF) architecture is proposed to achieve a tradeoff between arithmetic resources and memory usage by using the new half radix-4 butterfly (HR4BF) with 50% adder utilization ratio obtained. The intermediate caching buffer is modified accordingly in the MSDF part. By combining both the advantages on arithmetic units reducing and memory usage optimization in different parts, the optimized area and power are obtained without throughput loss. The logic synthesis results in a 65 nm CMOS technology show that the energy per FFT is about 331.5 nJ for 1024-point FFT computations at 400 MHz. The total hardware overhead is equivalent to 460k NAND2 gates. 相似文献

10.

Global magnetohydrodynamic simulations on multiple GPUs

Un-Hong Wong Hon-Cheng Wong Yonghui Ma 《Computer Physics Communications》2014

Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind–magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision. 相似文献

11.

Accelerating frequent itemset mining on graphics processing units

Fan Zhang Yan Zhang Jason D. Bakos 《The Journal of supercomputing》2013,66(1):94-117

In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU. 相似文献

12.

Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Peter Wittek Sándor Darányi 《Journal of Parallel and Distributed Computing》2013

Scientific computations have been using GPU-enabled computers successfully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infrastructure. Since the initial steps of text mining are typically data intensive, and the ease of deployment of algorithms is an important factor in developing advanced applications, we introduce a flexible, distributed, MapReduce-based text mining workflow that performs I/O-bound operations on CPUs with industry-standard tools and then runs compute-bound operations on GPUs which are optimized to ensure coalesced memory access and effective use of shared memory. We have performed extensive tests of our algorithms on a cluster of eight nodes with two NVidia Tesla M2050s attached to each, and we achieve considerable speedups for random projection and self-organizing maps. 相似文献

13.

Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace’s Equation

Michał Czapiński Chris Thompson Stuart Barnes 《International journal of parallel programming》2014,42(6):1032-1047

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this paper, we investigate techniques reducing overhead on hybrid CPU–GPU platforms, including careful data layout and usage of GPU memory spaces, and use of non-blocking communication. In addition, we propose an accurate automatic load balancing technique for heterogeneous environments. We validate our approach on a hybrid Jacobi solver for 2D Laplace’s Equation. Experiments carried out using various graphics hardware and types of connectivity have confirmed that the proposed data layout allows our fastest CUDA kernels to reach the analytical limit for memory bandwidth (up to 106 GB/s on NVidia GTX 480), and that the non-blocking communication significantly reduces overhead, allowing for almost linear speed-up, even when communication is carried out over relatively slow networks. 相似文献

14.

GPU近实时线性双目立体代价聚合

下载免费PDF全文

陈彬陈和平李晓卉《中国图象图形学报》2014,19(10):1481-1489

目的近年来双目视觉领域的研究重点逐步转而关注其“实时化”策略的研究,而立体代价聚合是双目视觉中最为复杂且最为耗时的步骤,为此,提出一种基于GPU通用计算(GPGPU)技术的近实时双目立体代价聚合算法。方法选用一种匹配精度接近于全局匹配算法的局部算法——线性立体匹配算法(linear stereo matching)作为代价聚合策略;结合线性代价聚合的原理,对其主要步骤(代价计算、均值滤波及系数求解等)的计算流程进行有针对性地并行优化。结果对于相同的实验样本,用本文方法在NVIDA GTX780 实验平台上能在更短的时间计算出代价矩阵,与原有的CPU实现方法相比,代价聚合的效率平均有了数十倍的提升。结论实时双目立体代价聚合方法,为在个人通用PC平台上实时获取高质量双目视觉深度信息提供了一个高效可靠的途径。相似文献

15.

Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct–MPI hybrid approach

Un-Hong Wong Takayuki Aoki Hon-Cheng Wong 《Computer Physics Communications》2014

Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 1200³ grid points using 216 GPUs. 相似文献

16.

一类基于迭代空间条块的并行有限差分Stencil 算法

张纪林狄鹏蒋从锋张伟徐向华万健任永坚《软件学报》2010,21(Z1):270-283

高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性. 相似文献

17.

VLSI architecture for low latency radix-4 CORDIC

B. Lakshmi A.S. Dhar 《Computers & Electrical Engineering》2011,37(6):1032-1042

The CORDIC algorithm, originally proposed using nonredundant radix-2 arithmetic, has been refined in terms of throughput and latency with the introduction of redundant arithmetic and higher radix techniques. In this paper, we propose a pipelined architecture using signed digit arithmetic for the VLSI efficient implementation of rotational radix-4 CORDIC algorithm, eliminating z path completely. A detailed comparison of the proposed architecture with the available radix-2 architectures shows the latency and hardware improvement. The proposed architecture achieves latency improvement over the previously proposed radix-4 architecture with a relatively small hardware overhead. The proposed architecture for 16-bit precision was implemented using VHDL and extensive simulations have been performed to validate the results. The functionally simulated net list has been synthesized for 16-bit precision with 90 nm CMOS technology library and the area-time measures are provided. This architecture was also implemented using Xilinx ISE9.1 software and a Virtex device. 相似文献

18.

A fast algorithm for constructing inverted files on heterogeneous platforms

Zheng Wei Joseph JaJa 《Journal of Parallel and Distributed Computing》2012

Given a collection of documents residing on a disk, we develop a new strategy for processing these documents and building the inverted files extremely quickly. Our approach is tailored for a heterogeneous platform consisting of multicore CPUs and highly multithreaded GPUs. Our algorithm is based on a number of novel techniques, including a high-throughput pipelined strategy, a hybrid trie and B-tree dictionary data structure, dynamic work allocation to CPU and GPU threads, and optimized CUDA indexer implementation. We have performed extensive tests of our algorithm on a single node (two Intel Xeon X5560 Quad-core CPUs) with two NVIDIA Tesla C1060 GPUs attached to it, and were able to achieve a throughput of more than 262 MB/s on the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those run on large clusters. 相似文献

19.

LBM based flow simulation using GPU computing processor

Frédéric Kuznik Christian Obrecht Gilles Rusaouen Jean-Jacques Roux 《Computers & Mathematics with Applications》2010,59(7):2380-2392

Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. In this paper, we develop a general purpose Lattice Boltzmann code that runs entirely on a single GPU. The results show that: (1) simple precision floating point arithmetic is sufficient for LBM computation in comparison to double precision; (2) the implementation of LBM on GPUs allows us to achieve up to about one billion lattice update per second using single precision floating point; (3) GPUs provide an inexpensive alternative to large clusters for fluid dynamics prediction. 相似文献

20.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献