期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Takashi Yanagawa Kenji Suehiro 《Parallel Computing》2004,30(12):1315-1327

The Earth Simulator (ES) is a large scale, distributed memory, parallel computer system consisting of 640 processor nodes (PN) with shared memory vector multi-processors (64GFLOPS/PN, 5120 APs in total, AP: arithmetic processor). All the nodes are connected via a high speed (16GB/s) single-stage crossbar network called the Interconnection Network (IN).

The operating system for the Earth Simulator is based on SUPER-UX, the UNIX operating system for the SX series scientific supercomputers. In order to realize high-performance parallel processing on the highly parallel machine, the operating system is enhanced for scalability.

The Earth Simulator system is managed as a two-level cluster system called the Super Cluster System. In the Super Cluster System, the Earth Simulator system is divided into 40 clusters (16 PNs/cluster). A single controller called Super Cluster Control Station (SCCS) manages all these clusters. This management system provides Single System Image (SSI) operation, management and job control for the large scale multi-node system.

The Job Scheduler (JS) and NQS running on the SCCS control all jobs of the system. They schedule the resources such as processing nodes and files which have not usually been treated as scheduling resources. This allows efficient scheduling of large scale jobs.

The MPI library (MPI/ES) and the HPF compiler (HPF/ES) are available for distributed parallel programming on the Earth Simulator. MPI/ES conforms to the MPI 2.0 standard and is optimized to exploit the hardware features. HPF/ES conforms to the core part of HPF 2.0 and supports some features of the HPF 2.0 approved extensions and HPF/JA 1.0 extensions. HPF/ES suitably handles the 3-level parallelism of the Earth Simulator system, that is, vectorization, shared-memory parallelization, and distributed-memory parallelization. Moreover, HPF/ES extends the language to easily handle irregular problems. 相似文献

2.

Compilation techniques for parallel systems

《Parallel Computing》1999,25(13-14):1741-1783

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献

3.

基于SMP集群的多层次并行编程模型与并行优化技术* 总被引：4，自引：0，他引：4

单莹吴建平王正华《计算机应用研究》2006,23(10):254-256

详细描述了适用于SMP集群这种多层次并行体系结构的混合并行编程模型MPI／OpenMP,它提供了实现SMP节点间和节点内多层次并行的机制。在此基础上结合实用的性能评价方法,分别介绍了MPI,OpenMP和单处理器三个层次上的一些常用和有效的并行优化技术,并指出单处理器性能优化是提高并行程序性能一个不容忽视的问题。相似文献

4.

OpenMP for Networks of SMPs

Y. Charlie Hu Honghui Lu Alan L. Cox Willy Zwaenepoel 《Journal of Parallel and Distributed Computing》2000,60(12):160

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed shared-memory (SDSM) system. In contrast to previous SDSM systems for SMPs, the modified TreadMarks system uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intranode hardware shared memory. We present performance results for seven applications (Barnes-Hut, CLU, and Water from SPLASH-2, 3D-FFT from NAS, Red-Black SOR, TSP, and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the thread implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes and consequently achieves speedups that are up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7–30% of the MPI versions. 相似文献

5.

MPI+TBB混合并行编程模型在分子动力学中的应用

白明泽赵文辉豆育升孙世新温迪《计算机应用研究》2012,29(5):1772-1774

为了提高分子动力学模拟在对称多处理(SMP)集群上的计算速度,在分子动力学并行方法中引入MPI+TBB的混合并行编程模型。基于该模型,在分子动力学软件LAMMPS中设计并实现混合并行算法,在节点间采用MPI及空间分解技术实施进程级并行,节点内采用TBB及临界区技术实施线程级并行。在SMP集群中的测试表明,该方法在体系较大以及节点数较多时可以明显减少通信时间,使加速比在纯MPI模型上提高45%。结果表明,MPI+TBB混合并行编程模型可促进分子动力学并行模拟且效率明显提升。相似文献

6.

A Library-Based Approach to Task Parallelism in a Data-Parallel Language

Ian Foster David R. Kohr Jr. Rakesh Krishnaiyer Alok Choudhary 《Journal of Parallel and Distributed Computing》1997,45(2):284

Pure data-parallel languages such as High Performance Fortran version 1 (HPF) do not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how these common parallel program structures can be represented, with only minor extensions to the HPF model, by using a coordination library based on the Message Passing Interface (MPI). This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compiler technology. 相似文献

7.

Parallel medical image reconstruction: from graphics processing units (GPU) to Grids

Maraike Schellmann Sergei Gorlatch Dominik Meiländer Thomas Kösters Klaus Schäfers Frank Wübbeling Martin Burger 《The Journal of supercomputing》2011,57(2):151-160

We present and compare a variety of parallelization approaches for a real-world case study on modern parallel and distributed computer architectures. Our case study is a production-quality, time-intensive algorithm for medical image reconstruction used in computer tomography (PET). We parallelize this algorithm for the main kinds of contemporary parallel architectures: shared-memory multiprocessors, distributed-memory clusters, graphics processing units (GPU) using the CUDA framework, the Cell processor and, finally, how various architectures can be accessed in a distributed Grid environment. The main contribution of the paper, besides the parallelization approaches, is their systematic comparison regarding four important criteria: performance, programming comfort, accessibility, and cost-effectiveness. We report results of experiments on particular parallel machines of different architectures that confirm the findings of our systematic comparison. 相似文献

8.

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249

This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献

9.

基于SMP集群系统的并行编程模式研究与分析

宋伟宋玉《微机发展》2007,17(2):164-167

并行计算技术是计算机技术发展的重要方向之一,SMP与集群是当前主流的并行体系结构。当前并行程序设计方法主要采用基于消息传递模型的MPI和基于共享存储模型的OpenMP,两种编程模式各有特点和适用范围。对SMP集群以及MPI和OpenMP的特点进行了分析,介绍了在SMP集群系统中利用MPI和OpenMP混合编程的可行性方法。相似文献

10.

VCluster: a thread‐based Java middleware for SMP and heterogeneous clusters with thread migration support

Hua Zhang Joohan Lee Ratan Guha 《Software》2008,38(10):1049-1071

Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

11.

P3 L: A structured high-level parallel language,and its structured support

Bruno Bacci Marco Danelutto Salvatore Orlando Susanna Pelagatti Marco Vanneschi 《Concurrency and Computation》1995,7(3):225-255

The paper presents a parallel programming methodology that ensures easy programming, efficiency and portability of programs to different machines belonging to the class of the general-purpose, distributed-memory, MIMD architectures. The methodology is based on the definition of a new, high-level, explicitly parallel language, called P³ L, and of a set of static tools that automatically adapt the program features for each target architecture. P³ L does not require programmers to specify process activations, the actual parallelism degree, scheduling, or interprocess communications, i.e. all those features that need to be adjusted to harness each specific target machine. Parallelism is, on the other hand, expressed in a structured and qualitative way, by hierarchical composition of a restricted set of language constructs, corresponding to those forms of parallelism that are frequently encountered in parallel applications, and that can be efficiently implemented. The efficient portability of P³ L applications is guaranteed by the compiler along with the novel structure of the support. The compiler automatically adapts the program features for each specific architecture, using the costs (in terms of performance) of the low-level mechanisms exported by the architecture itself. In our methodology, these costs, along with other features of the architecture, are viewed through an abstract machine, whose interface is used by the compiler to produce the final object code. 相似文献

12.

Evaluating automatic parallelization in SUIF

Sungdo Moon Byoungro So Hall M.W. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(1):36-49

This paper presents the results of an experiment to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by the Stanford SUIF compiler, a state-of-the-art automatic parallelization system targeting shared-memory multiprocessor architectures. For the purposes of this experiment, we have developed a run-time parallelization test called the Extended Lazy Privatizing Doall (ELPD) test, which is able to simultaneously test multiple loops in a loop nest. The ELPD test identifies a specific type of parallelism where each iteration of the loop being tested accesses independent data, possibly by making some of the data private to each processor. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results 相似文献

13.

Portable, parallelizing Pascal compiler

Gabber E. Averbuch A. Yehudai A. 《Software, IEEE》1993,10(2):71-81

The structure and implementation of P³C, a compiler that uses relatively simple parallelization techniques to produce highly efficient code for both shared-memory and distributed-memory MIMD multiprocessors, are discussed. P³C is a fully automatic, portable parallelizing Pascal compiler for scientific code, which is characterized by loops operating on regular data structures. P³C compiles the same source code to all target machines without modification. P³C gives an accurate estimate about whether the parallel code will actually reduce execution time over serial code, taking into account the associated overhead. It derives the estimate by statically analyzing the program at compile time, referring to a table of the target machine's parameters. Speedups of up to 24 have been achieved on a Sequent Symmetry with 25 active processors. The estimated speedup was within 8% of the actual figure 相似文献

14.

基于OpenMP/MPI并行编程模型的N体问题的优化实现

祝永志续士强禹继国《计算机工程与应用》2016,52(5):16-21

多核集群的层次化并行编程模型一直是高性能计算的研究热点。以SMP集群为例,从硬件上可分为节点间和节点内的两层架构。阐述了层次化并行编程的实现技术,针对N体问题算法进行了基于Hybrid并行编程模型的并行化研究。提出了一种块同步MPI/OpenMP细粒度N体问题的优化算法。基于曙光TC5000A集群,将该算法与传统的N体并行算法进行了执行时间与加速比的比较,得出了几句总结性具体论述。相似文献

15.

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Eduard Ayguadé Rosa M. Badia Pieter Bellens Daniel Cabrera Alejandro Duran Roger Ferrer Marc Gonzàlez Francisco Igual Daniel Jiménez-González Jesús Labarta Luis Martinell Xavier Martorell Rafael Mayo Josep M. Pérez Judit Planas Enrique S. Quintana-Ortí 《International journal of parallel programming》2010,38(5-6):440-459

This paper advances the state-of-the-art in programming models for exploiting task-level parallelism on heterogeneous many-core systems, presenting a number of extensions to the OpenMP language inspired in the StarSs programming model. The proposed extensions allow the programmer to write portable code easily for a number of different platforms, relieving him/her from developing the specific code to off-load tasks to the accelerators and the synchronization of tasks. Our results obtained from the StarSs instantiations for SMPs, the Cell, and GPUs report reasonable parallel performance. However, the real impact of our approach in is the productivity gains it yields for the programmer. 相似文献

16.

Parallelization of a multi-blocked CFD code via three strategies for fluid flow and heat transfer analysis

Rongguang Jia Bengt Sundén 《Computers & Fluids》2004,33(1):57-80

This paper reports on a parallel implementation of a general 3D multi-block CFD code. The parallelization is achieved by using three strategies. Firstly, it is done on dual-processor PC-clusters where Windows NT systems are running. A multi-thread programming model is adopted for the multi-block code, where one thread corresponds to a block. Shared-memory is used for the exchange of inner-boundaries between neighboring blocks (threads) on the same node, while WinSockets are employed for those on different nodes. Secondly, the parallelization is extended to UNIX operating system. MPI is applied for all the message passing between different processors, including those on the same node. Thirdly, Pthreads (POSIX threads), a standardized application interface for threads, are adopted to take the advantage of the shared-memory feature of the SMP nodes, while MPI is only applied for the message passing between processors on different nodes. In all the strategies, a static load-balancing method is employed for equitable distribution of computational work to specified nodes. The parameters of the present code is studied in detail to facilitate the explanation of the speedup results. Two examples are provided to show the speedup and load balancing of the parallel calculation. Detailed comparison is made to evaluate the efficiency of different strategies. 相似文献

17.

p—HPF支持多范例并行计算的并行编译技术 总被引：1，自引：1，他引：0

胡长军余华山姜伟陆爱胜许卓群《计算机学报》2001,24(7):685-693

多范例并行是大规模并行应用系统的本质特征,实现p－HPF对多范例并行计算的编译支持不仅可以弥补数据并行示例本身的一些缺点,而且可以提高并行应用系统的效率,文中在论述cluster环境下Global,Local,Serial三种典型并行计算模型的基础上,给出了实现p－HPF对三种模型的典型代表F77＋MPI,ScaLAPACK调用的并行编译技术,包括参数重分布技术、存储转换技术、全局与局部信息交换技术以及局部数组参数的上下界处理技术等,给出了调用实例并分析了实现技术的正确性和有效性。相似文献

18.

访存密集型应用在SMP机群系统中的性能分析

顾丽红吴少刚《小型微型计算机系统》2006,27(7):1258-1261

SMP机群系统因其良好的性价比、卓越的可扩展性与可用性，逐渐成为当前高性能计算机领域的主流结构．这种结点内共享存储、结点间消息传递的两级混合结构是目前并行计算研究的热点,在单个SMP结点中，总线和内存带宽是否满足CPU和I／O的需求对于访存密集型应用的性能影响很大。本文针对访存密集型应用的特点测试分析了在SMP机群中访存冲突对系统性能的影响，结果表明我们的SMP结点存在性能瓶颈，这种量化分析对于设计大规模的基于SMP的机群系统有很好的指导意义．相似文献

19.

A compiler optimization algorithm for shared-memory multiprocessors

McKinley K.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):769-787

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines 相似文献

20.

ParC—An Extension of C for Shared Memory Parallel Processing

YOSI BEN-ASHER DROR G. FEITELSON LARRY RUDOLPH 《Software》1996,26(5):581-612

ParC is an extension of the C programming language with block-oriented parallel constructs that allow the programmer to express fine-grain parallelism in a shared-memory model. It is suitable for the expression of parallel shared-memory algorithms, and also conducive for the parallelization of sequential C programs. In addition, performance enhancing transformations can be applied within the language, without resorting to low-level programming. The language includes closed constructs to create parallelism, as well as instructions to cause the termination of parallel activities and to enforce synchronization. The parallel constructs are used to define the scope of shared variables, and also to delimit the sets of activities that are influenced by termination or synchronization instructions. The semantics of parallelism are discussed, especially relating to the discrepancy between the limited number of physical processors and the potentially much larger number of parallel activities in a program. 相似文献