共查询到20条相似文献,搜索用时 15 毫秒
1.
Load balancing and OpenMP implementation of nested parallelism 总被引:1,自引:0,他引:1
Many problems have multiple layers of parallelism. The outer-level may consist of few and coarse-grained tasks. Next, each of these tasks may also be rich in parallelism, and be split into a number of fine-grained tasks, which again may consist of even finer subtasks, and so on. Here we argue and demonstrate by examples that utilizing multiple layers of parallelism may give much better scaling than if one restricts oneself to only one level of parallelism.Two non-trivial issues for multi-level parallelism are load balancing and implementation. In this paper we provide an algorithm for finding good distributions of threads to tasks and discuss how to implement nested parallelism in OpenMP. 相似文献
2.
Dieter an Mey Samuel Sarholz Christian Terboven 《International journal of parallel programming》2007,35(5):459-476
OpenMP is widely accepted as a de facto standard for shared memory parallel programming in Fortran, C and C++. Nested parallelization
has been included in the first OpenMP specification, but it took a few years until the first commercially available compilers
supported this optional part of the specification. We employed nested parallelization using OpenMP in three production codes:
a C++ code for content-based image retrieval, a C++ code for the computation of critical points in multi-block CFD datasets,
and a multi-block Navier-Stokes solver written in Fortran90. In this paper we discuss the opportunities as well as the deficiencies
of the nested parallelization support in OpenMP. 相似文献
3.
Alan Morris Allen D. Malony Sameer S. Shende 《International journal of parallel programming》2007,35(4):417-436
Nested OpenMP parallelism allows an application to spawn teams of nested threads. This hierarchical nature of thread creation
and usage poses problems for performance measurement tools that must determine thread context to properly maintain per-thread
performance data. In this paper we describe the problem and a novel solution for identifying threads uniquely. Our approach
has been implemented in the TAU performance system and has been successfully used in profiling and tracing OpenMP applications
with nested parallelism. We also describe how extensions to the OpenMP standard can help tool developers uniquely identify
threads. 相似文献
4.
5.
Haoqiang Jin Barbara Chapman Lei Huang Dieter an Mey Thomas Reichstein 《International journal of parallel programming》2008,36(3):312-325
We describe a performance study of a multi-zone application benchmark implemented in several OpenMP approaches that exploit
multi-level parallelism and deal with unbalanced workload. The multi-zone application was derived from the well-known NAS
Parallel Benchmarks (NPB) suite that involves flow solvers on collections of loosely coupled discretization meshes. Parallel
versions of this application have been developed using the Subteam concept and Workqueuing model as extensions to the current
OpenMP. We examine the performance impact of these extensions to OpenMP and compare with hybrid and nested OpenMP approaches
on several large parallel systems. 相似文献
6.
现有的OpenMP代价模型较为简单,既没有充分考虑OpenMP程序的执行细节,也无法适应不同的循环并行执行方式.针对上述问题,对最先进的产品级优化编译器Open64中已有的代价模型进行扩展,以单个并行候选循环为对象,建立一种用于OpenMP自动并行收益分析的代价模型.该模型在改进了Open64原有DOALL并行代价模型的基础上,又增加了DOACROSS流水并行代价模型和DSWP并行代价模型.实验结果表明,建立的代价模型能够较好地评估循环并行执行开销的趋势,为OpenMP自动并行化中的收益分析提供了有效的支持. 相似文献
7.
WANG Jue HU ChangJun ZHANG JiLin & LI JianJiang School of Information Engineering University of Science Technology Beijing Beijing China 《中国科学:信息科学(英文版)》2010,(5):932-944
OpenMP is an emerging industry standard for shared memory architectures. While OpenMP has advantages on its ease of use and incremental programming, message passing is today still the most widely-used programming model for distributed memory architectures. How to effectively extend OpenMP to distributed memory architectures has been a hot spot. This paper proposes an OpenMP system, called KLCoMP, for distributed memory architectures. Based on the partially replicating shared arrays memory model, we propose ... 相似文献
8.
《中国科学:信息科学(英文版)》2012,(9):1961-1971
In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献
9.
Miloš Milovanović Roger Ferrer Vladimir Gajinov Osman S. Unsal Adrian Cristal Eduard Ayguadé Mateo Valero 《International journal of parallel programming》2008,36(3):326-346
Future generations of Chip Multiprocessors (CMP) will provide dozens or even hundreds of cores inside the chip. Writing applications
that benefit from the massive computational power offered by these chips is not going to be an easy task for mainstream programmers
who are used to sequential algorithms rather than parallel ones. This paper explores the possibility of using Transactional
Memory (TM) in OpenMP, the industrial standard for writing parallel programs on shared-memory architectures, for C, C++ and
Fortran. One of the major complexities in writing OpenMP applications is the use of critical regions (locks), atomic regions
and barriers to synchronize the execution of parallel activities in threads. TM has been proposed as a mechanism that abstracts
some of the complexities associated with concurrent access to shared data while enabling scalable performance. The paper presents
a first proof-of-concept implementation of OpenMP with TM. Some language extensions to OpenMP are proposed to express transactions.
These extensions are implemented in our source-to-source OpenMP Mercurium compiler and our Software Transactional Memory (STM)
runtime system Nebelung that supports the code generated by Mercurium. Hardware Transactional Memory (HTM) or Hardware-assisted
STM (HaSTM) are seen as possible paths to make the tandem TM-OpenMP more scalable. In the evaluation section we show the preliminary
results. The paper finishes with a set of open issues that still need to be addressed, either in OpenMP or in the hardware/software
implementations of TM. 相似文献
10.
In this paper, an efficient unstructured mesh calculation method in an OpenMP parallel computation using multi-core processor is proposed. This is a new domain decomposition method with two characteristics. The first characteristic is to define the size of the sub-block in the computation domain by the size of the cache memory in each core. The second one is to reduce idle time by distributing a defined sub-block for each core appropriately. Using the proposed method, a computation on compressible flow around a plane was able to achieve speed-up more than about 20% in comparison with a conventional method. 相似文献
11.
检查点/续算是软件容错的重要途径之一。论文描述了一个系统级和应用级混合的OpenMP检查点机制,系统级支持不仅使检查点系统具有了好的透明性,并且使共享数据的保存不再由主线程单独完成,具有良好的数据局部性。应用级OpenMP协议将与OpenMP相关的协议处理独立出来,提高了系统的可移植性。NPB3.2-OMP测试结果表明,检查点和续算所需要的时间开销小,能够满足大规模程序的实际需求。 相似文献
12.
Alejandro Duran Roger Ferrer Eduard Ayguadé Rosa M. Badia Jesus Labarta 《International journal of parallel programming》2009,37(3):292-305
Tasking in OpenMP 3.0 has been conceived to handle the dynamic generation of unstructured parallelism. New directives have
been added allowing the user to identify units of independent work (tasks) and to define points to wait for the completion
of tasks (task barriers). In this document we propose extensions to allow the runtime detection of dependencies between generated
tasks, broading the range of applications that can benefit from tasking or improving the performance when load balancing or
locality are critical issues for performance. The proposed extensions are evaluated on a SGI Altix multiprocessor architecture
using a couple of small applications and a prototype runtime system implementation. 相似文献
13.
OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks. 相似文献
14.
Agent-based models, an emerging paradigm of simulation of complex systems, appear very suitable to parallel processing. However, during the parallelization of a simulator of financial markets, we found that some features of these codes highlight non-trivial issues of the present hardware/software platforms for parallel processing. Here we present the results of a series of tests, on different platforms, of simplified codes that reproduce such problems and can be used as a starting point in the search of a possible solution. 相似文献
15.
The widespread use of multicore processors is not a consequence of significant advances in parallel programming. In contrast, multicore processors arise due to the complexity of building power-efficient, high-clock-rate, single-core chips. Automatic parallelization of sequential applications is the ideal solution for making parallel programming as easy as writing programs for sequential computers. However, automatic parallelization remains a grand challenge due to its need for complex program analysis and the existence of unknowns during compilation. This paper proposes a new method for converting a sequential application into a parallel counterpart that can be executed on current multicore processors. It hinges on an intermediate representation based on the concept of domain-independent kernel (e.g., assignment, reduction, recurrence). Such kernel-centric view hides the complexity of the implementation details, enabling the construction of the parallel version even when the source code of the sequential application contains different syntactic variations of the computations (e.g., pointers, arrays, complex control flows). Experiments that evaluate the effectiveness and performance of our approach with respect to state-of-the-art compilers are also presented. The benchmark suite consists of synthetic codes that represent common domain-independent kernels, dense/sparse linear algebra and image processing routines, and full-scale applications from SPEC CPU2000. 相似文献
16.
A method and results of static and dynamic analysis of Pascal programs are described. In order to investigate characteristics of large systems programs developed by the stepwise refinement programming approach and written in Pascal, several Pascal compilers written in Pascal were analysed from both static and dynamic points of view. As a main conclusion, procedures play an important role in the stepwise refinement approach and implementors of a compiler and designers of high level language machines for Pascal-like languages should pay careful attention to this point. The set data structure is one of the characteristics of the Pascal language and statistics of set operations are also described. 相似文献
17.
18.
多核处理器能够提升多线程程序的性能,但早已存在的诸多单线程程序无法从中获益,程序员也习惯于编写单线程程序.自动并行化技术是将单线程程序移植到多核上的重要手段,但是当循环中存在无法确定的数据依赖或复杂的控制流时,传统的自动并行化技术无法取得良好效果.Ottoni等人针对传统自动并行失败的循环提出了Decoupled Software Pipelining(DSWP)算法用以实现指令级的细粒度并行,但其需要对处理器体系结构的深入了解以及对核间通信队列和专用指令的硬件支持,并行性能和应用广泛性受到限制.基于OpenMP应用编程接口实现的DSWP并行不依赖于硬件上对核间通信队列和专用指令的支持,且不受平台的限制,但现有的OpenMP任务调度机制无法满足DSWP并行中对任务调度的需求.对现有的OpenMP任务调度机制进行扩展,增加了任务与线程绑定的属性,保证了基于OpenMP的DSWP并行程序的正确执行.在GCC的OpenMP运行库libgomp中扩展了任务绑定属性子句的功能,扩展后的GCC作为OpenMP DSWP程序的基础编译器,为自动并行提供支持.通过对基准测试集NPB3.3.1的测试表明,传统自动并行失败的循环,经OpenMP DSWP自动并行后在双核处理器上平均加速比达到1.23以上;使用添加了OpenMP DSWP算法的Open64编译器生成的并行程序,与仅使用传统自动并行方法的Intel 编译器和Open64编译器所得程序相比,平均加速比分别高出22%和26%. 相似文献
19.
20.
P. E. Hadjidoukas V. V. Dimakopoulos M. Delakis C. Garcia 《Concurrency and Computation》2009,21(15):1819-1837
We present the development of a novel high‐performance face detection system using a neural network‐based classification algorithm and an efficient parallelization with OpenMP. We discuss the design of the system in detail along with experimental assessment. Our parallelization strategy starts with one level of threads and moves to the exploitation of nested parallel regions in order to further improve, by up to 19%, the image‐processing capability. The presented system is able to process images in real time (38 images/sec) by sustaining almost linear speedups on a system with a quad‐core processor and a particular OpenMP runtime library. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献