共查询到20条相似文献,搜索用时 31 毫秒
1.
Mathias Bourgoin Emmanuel Chailloux Jean-Luc Lamotte 《International journal of parallel programming》2014,42(4):583-600
General purpose (GP)GPU programming demands to couple highly parallel computing units with classic CPUs to obtain a high performance. Heterogenous systems lead to complex designs combining multiple paradigms and programming languages to manage each hardware architecture. In this paper, we present tools to harness GPGPU programming through the high-level OCaml programming language. We describe the SPOC library that allows to handle GPGPU subprograms (kernels) and data transfers between devices. We then present how SPOC expresses GPGPU kernel: through interoperability with common low-level extensions (from Cuda and OpenCL frameworks) but also via an embedded DSL for OCaml. Using simple benchmarks as well as a real world HPC software, we show that SPOC can offer a high performance while efficiently easing development. To allow better abstractions over tasks and data, we introduce some parallel skeletons built upon SPOC as well as composition constructs over those skeletons. 相似文献
2.
The computing power of graphics processing units (GPU) has increased rapidly, and there has been extensive research on general‐purpose computing on GPU (GPGPU) for cryptographic algorithms such as RSA, Elliptic Curve Cryptosystem (ECC), NTRU, and Advanced Encryption Standard. With the rise of GPGPU, commodity computers have become complex heterogeneous GPU+CPU systems. This new architecture poses new challenges and opportunities in high‐performance computing. In this paper, we present high‐speed parallel implementations of the rainbow method based on perfect tables, which is known as the most efficient time‐memory trade‐off, in the heterogeneous GPU+CPU system. We give a complete analysis of the effect of multiple checkpoints on reducing the cost of false alarms and take advantage of it for load balancing between GPU and CPU. For GTX460, our implementation is about 1.86 and 3.25 times faster than other GPU‐accelerated implementations, RainbowCrack and Cryptohaze, respectively, and for GTX580, 1.53 and 2.40 times faster. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献
3.
4.
Cellular automata simulation of urban dynamics through GPGPU 总被引:1,自引:0,他引:1
In recent years, urban models based on Cellular Automata (CA) are becoming increasingly sophisticated and are being applied to real-world problems covering large geographical areas. As a result, they often require extended computing times. However, in spite of the improved availability of parallel computing facilities, the applications in the field of urban and regional dynamics are almost always based on sequential algorithms. This paper makes a contribution toward a wider use in the field of geosimulation of high performance computing techniques based on General-Purpose computing on Graphics Processing Units (GPGPU). In particular, we investigate the parallel speedup achieved by applying GPGPU to a popular constrained urban CA model. The major contribution of this work is in the specific modeling we propose to achieve significant gains in computing time, while maintaining the most relevant features of the traditional sequential model. 相似文献
5.
6.
7.
半导体工艺的发展使得芯片上集成的晶体管数目不断增加,图形处理器的存储和计算能力也越来越强大。目前,GPU的峰值运算能力已经远远超出主流的CPU,它在非图形计算领域,特别是高性能计算领域的潜力已经引起越来越多研究者的关注。本文介绍了GPU用于通用计算的原理以及目前学术界和产业界关于GPGPU体系结构和编程模型方面的最新研究成果。 相似文献
8.
Mathias Bourgoin Emmanuel Chailloux Jean-Luc Lamotte 《International journal of parallel programming》2017,45(2):242-261
To increase software performance, it is now common to use hardware accelerators. Currently, GPUs are the most widespread accelerators that can handle general computations. This requires to use GPGPU frameworks such as Cuda or OpenCL. Both are very low-level and make the benefit of GPGPU programming difficult to achieve. In particular, they require to write programs as a combination of two subprograms, and, to manually manage devices and memory transfers. This increases the complexity of the overall software design. The idea we develop in this paper is to guarantee expressiveness and safety for CPU and GPU computations and memory managements with high-level data-structures and static type-checking. In this paper, we present how statically typed languages, compilers and libraries help harness high level GPGPU programming. In particular, we show how we added high-level user-defined data structures to a GPGPU programming framework based on a statically typed programming language: OCaml. Thus, we describe the introduction of records and tagged unions shared between the host program and GPGPU kernels described via a domain specific language as well as a simple pattern matching control structure to manage them. Examples, practical tests and comparisons with state of the art tools, show that our solutions improve code design, productivity, and safety while providing a high level of performance. 相似文献
9.
WENO(weighted essentially non-oscillatory)是计算流体力学中广泛采用的一种高阶数值格式。由于算法本身和异构计算编程的复杂性,需要开展异构计算代码自动生成的研究,以加速更多的应用。本文基于Physis这一领域编程语言框架,针对三维五阶WENO计算的天文应用,实现了其异构代码的自动生成。在超级计算机"元"上的测试结果表明,自动生成的异构计算代码具有良好的可扩展性,计算性能达到手工优化异构代码的72%,可为相关流体计算的异构代码生成提供借鉴。 相似文献
10.
Zhiyong Yuan Weixin Si Xiangyun Liao Zhaoliang Duan Yihua Ding Jianhui Zhao 《The Journal of supercomputing》2012,61(1):84-102
Open Computing Language (OpenCL) is an open royalty-free standard for general purpose parallel programming across Central Processing Units (CPUs), Graphic Processing Units (GPUs) and other processors. This paper introduces OpenCL to implement real-time smoking simulation in a virtual surgery training simulation system. Firstly, the Computational Fluid Dynamics (CFD) is adopted to construct the real-time smoking simulation model based on the Navier?CStokes (N-S) equations of an incompressible fluid under the condition of normal temperature and pressure. Then we propose a parallel computing technique based on OpenCL to accomplish the parallel computing of smoking simulation model on CPU and GPU, respectively. Finally, we render the smoke in real time by using a three-dimensional (3D) texture volume rendering method. Experimental results show that the parallel computing technique we have proposed achieve a satisfactory effect on image quality and rendering rate both on CPU and GPU. 相似文献
11.
数据重分布是实现消息传递环境下负载平衡的重要手段,提出了数据交错分布的模型问题及模型问题的并行计算模型,分析了模型问题在消息传递环境下的实现,讨论了性能和适用条件,给出了分析结果,讨论了通信与计算的时间重叠问题,将数据交错重分布负载平衡技术应用到非平衡刚性动力学方程组的并行计算中,获得了很好的负载平衡效果。 相似文献
12.
This paper presents a general-purpose simulation approach integrating a set of technological developments and algorithmic
methods in cellular automata (CA) domain. The approach provides a general-purpose computing on graphics processor units (GPGPU) implementation for computing and multiple rendering of any direct-neighbor three-dimensional (3D) CA. The major contributions
of this paper are: the CA processing and the visualization of large 3D matrices computed in real time; the proposal of an
original method to encode and transmit large CA functions to the graphics processor units in real time; and clarification
of the notion of top-down and bottom-up approaches to CA that non-CA experts often confuse. Additionally a practical technique to simplify the finding of CA functions
is implemented using a 3D symmetric configuration on an interactive user interface with simultaneous inside and surface visualizations.
The interactive user interface allows for testing the system with different project ideas and serves as a test bed for performance
evaluation. To illustrate the flexibility of the proposed method, visual outputs from diverse areas are demonstrated. Computational
performance data are also provided to demonstrate the method’s efficiency. Results indicate that when large matrices are processed,
computations using GPU are two to three hundred times faster than the identical algorithms using CPU. 相似文献
13.
In recent years, the GPU (graphics processing unit) has evolved into an extremely powerful and flexible processor, with it
now representing an attractive platform for general-purpose computation. Moreover, changes to the design and programmability
of GPUs provide the opportunity to perform general-purpose computation on a GPU (GPGPU). Even though many programming languages,
software tools, and libraries have been proposed to facilitate GPGPU programming, the unusual and specific programming model
of the GPU remains a significant barrier to writing GPGPU programs. In this paper, we introduce a novel compiler-based approach
for GPGPU programming. Compiler directives are used to label code fragments that are to be executed on the GPU. Our GPGPU
compiler, Guru, converts the labeled code fragments into ISO-compliant C code that contains appropriate OpenGL and Cg APIs.
A native C compiler can then be used to compile it into the executable code for GPU. Our compiler is implemented based on
the Open64 compiler infrastructure. Preliminary experimental results from selected benchmarks show that our compiler produces
significant performance improvements for programs that exhibit a high degree of data parallelism. 相似文献
14.
Salvatore Di Gregorio Giuseppe Filippone William Spataro Giuseppe A. Trunfio 《Journal of Parallel and Distributed Computing》2013
In the field of wildfire risk management the so-called burn probability maps (BPMs) are increasingly used with the aim of estimating the probability of each point of a landscape to be burned under certain environmental conditions. Such BPMs are usually computed through the explicit simulation of thousands of fires using fast and accurate models. However, even adopting the most optimized algorithms, the building of simulation-based BPMs for large areas results in a highly intensive computational process that makes mandatory the use of high performance computing. In this paper, General-Purpose Computation with Graphics Processing Units (GPGPU) is applied, in conjunction with a wildfire simulation model based on the Cellular Automata approach, to the process of BPM building. Using three different GPGPU devices, the paper illustrates several implementation strategies to speedup the overall mapping process and discusses some numerical results obtained on a real landscape. 相似文献
15.
The general-purpose graphic processing unit (GPGPU) is a popular accelerator for general applications such as scientific computing because the applications are massively parallel and the significant power of parallel computing inheriting from GPUs. However, distributing workload among the large number of cores as the execution configuration in a GPGPU is currently still a manual trial-and-error process. Programmers try out manually some configurations and might settle for a sub-optimal one leading to poor performance and/or high power consumption. This paper presents an auto-tuning approach for GPGPU applications with the performance and power models. First, a model-based analytic approach for estimating performance and power consumption of kernels is proposed. Second, an auto-tuning framework is proposed for automatically obtaining a near-optimal configuration for a kernel computation. In this work, we formulated that automatically finding an optimal configuration as the constraint optimization and solved it using either simulated annealing (SA) or genetic algorithm (GA). Experiment results show that the fidelity of the proposed models for performance and energy consumption are 0.86 and 0.89, respectively. Further, the optimization algorithms result in a normalized optimality offset of 0.94% and 0.79% for SA and GA, respectively. 相似文献
16.
Konstantinidis EI Frantzidis CA Pappas C Bamidis PD 《Computer methods and programs in biomedicine》2012,107(1):16-27
In this paper the feasibility of adopting Graphic Processor Units towards real-time emotion aware computing is investigated for boosting the time consuming computations employed in such applications. The proposed methodology was employed in analysis of encephalographic and electrodermal data gathered when participants passively viewed emotional evocative stimuli. The GPU effectiveness when processing electroencephalographic and electrodermal recordings is demonstrated by comparing the execution time of chaos/complexity analysis through nonlinear dynamics (multi-channel correlation dimension/D2) and signal processing algorithms (computation of skin conductance level/SCL) into various popular programming environments. Apart from the beneficial role of parallel programming, the adoption of special design techniques regarding memory management may further enhance the time minimization which approximates a factor of 30 in comparison with ANSI C language (single-core sequential execution). Therefore, the use of GPU parallel capabilities offers a reliable and robust solution for real-time sensing the user's affective state. 相似文献
17.
伴随着GPGPU计算技术的不断发展,HPC高性能计算系统体系结构正在悄然发生着一场变革,这场变革为高性能计算发展提供了一个新的方向、CUDA是NIVIDIA公司提供的利用GPGPU进行并行运算应用开发的一套C语言编程平台,通过它可以利用特定显卡的高性能运算能力进行一些大规模高性能计算,有效提升计算机系统的使用效率,本文主要介绍GPU发展现状以及如何利用CUDA编程技术进行并行运算软件开发. 相似文献
18.
《Journal of Systems Architecture》2014,60(5):420-430
General-purpose graphics processing unit (GPGPU) plays an important role in massive parallel computing nowadays. A GPGPU core typically holds thousands of threads, where hardware threads are organized into warps. With the single instruction multiple thread (SIMT) pipeline, GPGPU can achieve high performance. But threads taking different branches in the same warp violate SIMD style and cause branch divergence. To support this, a hardware stack is used to sequentially execute all branches. Hence branch divergence leads to performance degradation. This article represents the PDOM (post dominator) stack as a binary tree, and each leaf corresponds to a branch target. We propose a new PDOM stack called PDOM-ASI, which can schedule all the tree leaves. The new stack can hide more long operation latencies with more schedulable warps without the problem of warp over-subdivision. Besides, a multi-level warp scheduling policy is proposed, which lets part of the warps run ahead and creates more opportunities to hide the latencies. The simulation results show that our policies achieve 10.5% performance improvements over baseline policies with only 1.33% hardware area overhead. 相似文献
19.
FuzzyCLIPS is a rule-based programming language and it is very suitable for developing fuzzy expert systems. However, it usually
requires much longer execution time than algorithmic languages such as C and Java. To address this problem, we propose a parallel
version of FuzzyCLIPS to parallelize the execution of a fuzzy expert system with data dependence on a cluster system. We have
designed some extended parallel syntax following the original FuzzyCLIPS style. To simplify the programming model of parallel
FuzzyCLIPS, we hide, as much as possible, the tasks of parallel processing from programmers and implement them in the inference
engine by using MPI, the de facto standard for parallel programming for cluster systems. Furthermore, a load balancing function
has been implemented in the inference engine to adapt to the heterogeneity of computing nodes. It will intelligently allocate
different amounts of workload to different computing nodes according to the results of dynamic performance monitoring. The
programmer only needs to invoke the function in the program for better load balancing. To verify our design and evaluate the
performance, we have implemented a human resource website. Experimental results show that the proposed parallel FuzzyCLIPS
can garner a superlinear speedup and provide a more reasonable response time. 相似文献