期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Partitioning and mapping of nested loops for linear array multicomputers

Jang-Ping Sheu Tzung-Shi Chen 《The Journal of supercomputing》1995,9(1-2):183-202

In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers. 相似文献

2.

Skewed Data Partition and Alignment Techniques for Compiling Programs on Distributed Memory Multicomputers 总被引：3，自引：0，他引：3

Chen Tzung-Shi Chang Chih-Yung 《The Journal of supercomputing》2002,21(2):191-211

Minimizing data communication over processors is the key to compile programs for distributed memory multicomputers. In this paper, we propose new data partition and alignment techniques for partitioning and aligning data arrays with a program in a way of minimizing communication over processors. We use skewed alignment instead of the dimension-ordered alignment techniques to align data arrays. By developing the skewed scheme, we can solve more complex programs with minimized data communication than that of the dimension-ordered scheme. Finally, we compare the proposed scheme with the dimension-ordered alignment one by experimental results. The experimental results show that our proposed scheme has more opportunities to align data arrays such that data communications over processors can be minimized. 相似文献

3.

Runtime and language support for compiling adaptive irregular programs on distributed-memory machines

Yuan-Shin Hwang Bongki Moon Shamik D. Sharma Ravi Ponnusamy Raja Das Joel H. Saltz 《Software》1995,25(6):597-621

In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed-memory message-passing machines, such as the Paragon, Delta, CM-5 and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient index translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsumes the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirection arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life adaptive applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC. Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how CHAOS can be effectively used in such a framework. By embedding CHAOS primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized. 相似文献

4.

An Interleaving Transformation for Parallelizing Reductions for Distributed-Memory Parallel Machines

Wu Jan-Jan 《The Journal of supercomputing》2000,15(3):321-339

Reduction operations frequently appear in algorithms. Due to their mathematical invariance properties (assuming that round-off errorscan be tolerated), it is reasonable to ignore ordering constraints on the computation of reductions in order to take advantage of the computing power of parallel machines.One obvious and widely-used compilation approach for reductions is syntactic pattern recognition. Either the source language includes explicit reduction operators, or certain specific loops are recognized as equivalent to known reductions. Once such patterns are recognized, hand optimized code for the reductions are incorporated in the target program. The advantage of this approach is simplicity. However, it imposes restrictions on the reduction loops—no data dependence other than that caused by the reduction operation itself is allowed in the reduction loops.In this paper, we present a parallelizing technique, interleaving transformation, for distributed-memory parallel machines. This optimization exploits parallelism embodied in reduction loops through combination of data dependence analysis and region analysis. Data dependence analysis identifies the loop structures and the conditions that can trigger this optimization. Region analysis divides the iteration domain into a sequential region and an order-insensitive region. Parallelism is achieved by distributing the iterations in the order-insensitive region among multiple processors. We use a triangular solver as an example to illustrate the optimization. Experimental results on various distributed-memory parallel machines, including the Connection Machines CM-5, the nCUBE, the IBM SP-2, and a network of Sun Workstations are reported. 相似文献

5.

A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems

Santosh Pande Tareq Bali 《Journal of Parallel and Distributed Computing》1999,58(3):251

Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D. 相似文献

6.

Communication-Free Alignment for Array References with Linear Subscripts in Three Loop Index Variables or Quadratic Subscripts

Chang Weng-Long Chu Chih-Ping Wu Jia-Hwa 《The Journal of supercomputing》2001,20(1):67-83

相似文献

7.

Symbolic Communication Set Generation for Irregular Parallel Applications

Guo Minyi Pan Yi Liu Zhen 《The Journal of supercomputing》2003,25(3):199-214

Communication set generation significantly influences the performance of parallel programs. However, studies seldom give attention to the problem of communication set generation for irregular applications. In this paper, we propose communication optimization techniques for the situation of irregular array references in nested loops. In our methods, the local array distribution schemes are determined so that the total number of communication messages is minimized. Then, we explain how to support communication set generation at compile-time by introducing some symbolic analysis techniques. In our symbolic analysis, symbolic solutions of a set of symbolic expression are obtained by using certain restrictions. We introduce symbolic analysis algorithms to obtain the solutions in terms of a set of equalities and inequalities. Finally, experimental results on a parallel computer CM-5 are presented to validate our approach. 相似文献

8.

构造并行化系统交互环境的若干关键技术 总被引：5，自引：0，他引：5

杨博王鼎兴郑纬民《软件学报》2001,12(5):698-705

交互式并行化系统通过提供友好的交互功能并引入用户知识来协助程序的并行化,是解决自动并行化能力不足的一条有效途径.描述了一个并行化系统交互环境TIPSIE(interactive en vironment of Tsinghua interactive parallelizing system),并就构造该环境的性能预测、增量编译和数据相关查询等关键技术进行了讨论.实验结果表明,这些技术能够有效地提高系统的并行化能力和效率. 相似文献

9.

Analyzing reference patterns in automatic data distribution tools

Eduard Ayguadé Jesús Labarta Jordi Garcia Mercè Gironès Mateo Valero 《International journal of parallel programming》1995,23(6):515-535

相似文献

10.

两种并行化机制的分析

曹琳杨学军《计算机研究与发展》1993,30(9):34-38

相似文献

11.

一种面向异构众核处理器的并行编译框架 总被引：1，自引：0，他引：1

李雁冰赵荣彩韩林赵捷徐金龙李颖颖《软件学报》2019,30(4):981-1001

异构众核处理器是面向高性能计算领域处理器发展的重要趋势,但其更为复杂的体系结构使得编程难的问题更加突出.针对这一问题,基于开源编译器Open64,提出了一种面向异构众核处理器的并行编译框架,将程序自动转换为异构并行程序.该框架主要包括4个模块：任务划分模块用来识别适合进行加速计算的程序段,实现了嵌套循环的多维并行识别方法;数据布局模块完成数据在主存和SPM之间的布局,实现了数组边界分析和指针范围分析;传输优化模块实现了数据传输合并、传输外提、打包传输、数组转置等多种数据传输优化方法;收益评估模块在构建代价模型的基础上实现了一种动静结合的收益评估方法.并且,基于SW26010处理器,对该编译框架进行了实现,测试结果表明,该编译框架能够实现一些程序以面向异构众核结构的并行变换,且获得较好的加速效果. 相似文献

12.

Segmented Alignment: An Enhanced Model to Align Data Parallel Programs of HPF

Hwang Gwan-Hwan Chen Cheng-Wei Lee Jenq Kuen Dz-Ching Ju Roy 《The Journal of supercomputing》2003,25(1):17-41

In this paper, we propose a new automatic data alignment model called segmented alignment. The conventional data alignment model, such as that used in High-Performance Fortran (HPF), aligns arrays with the whole index domain. The principle of our proposed segmented alignment is to allow alignment relations within delimited index domains. We first provide motivating examples to illustrate how code fragments of HPF with EOSHIFT or CSHIFT operations, or produced by synthesis operations can benefit from our enhanced alignment scheme. Second, we show that this new model can be implemented in HPF-like languages by adding WHEN and IN constructs to them. In addition, we show that the new proposed schemes for WHEN and IN constructs can be emulated using standard HPF syntax. Finally, we address issues related to automatic data alignment for the new proposed model, and present an algorithm to automatically align programs using our segmented alignment scheme. Since the optimal algorithm to do this is NP-hard, a practical heuristic is also given. Our experiments were performed on a DEC Alpha Farm with HPF environments. Our experiments confirm our theory that our proposed alignment scheme can significantly enhance not only the performance of HPF code fragments with EOSHIFT or CSHIFT operations, but also that of codes produced by synthesis operations. 相似文献

13.

一类非规则并行应用问题的通信集生成算法

胡长军李静王珏姚广利李永红丁良李建江《计算机学报》2008,31(1):120-126

非规则计算是大规模并行应用中普遍存在和影响效率的关键问题.在基于分布式内存的数据并行范例中,如何针对非规则数组引用,有效地生成本地内存访问序列和通信集,是并行编译生成SPMD结点程序所必须解决的重要问题.文中针对两重嵌套循环中,下一层循环边界是上一层循环变量的线性或非线性函数,数组下标是两层循环变量的非线性函数这样一类包含非规则数组引用的并行应用问题,提出了一种在编译时生成通信集的代数算法.并且针对cyclic(k)数据分布和线性对齐模板,借助整数格概念,给出了编译时全局地址和本地地址之间的转换方法.文中还给出了相应的经过通信优化的SPMD结点程序.最后通过实例验证了算法的正确性.该算法的意义在于避免了传统Inspector/Executor非规则计算模型中的Inspector阶段,从而节省了运行时Inspector阶段通过穷举下标生成通信集的巨大开销. 相似文献

14.

A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers 总被引：2，自引：2，他引：0

Guo Minyi Nakata Ikuo 《The Journal of supercomputing》2001,20(3):243-265

相似文献

15.

Efficient Symbolic Analysis for Parallelizing Compilers and Performance Estimators 总被引：1，自引：1，他引：1

Fahringer Thomas 《The Journal of supercomputing》1998,12(3):227-252

Symbolic analysis is of paramount importance for parallelizing compilers and performance estimators to examine symbolic expressions with program unknowns such as machine and problem sizes and to solve queries based on systems of constraints (equalities and inequalities). This paper describes novel techniques for counting the number of solutions to a system of constraints, simplifying systems of constraints, computing lower and upper bounds of symbolic expressions, and determining the relationship between symbolic expressions. All techniques target wide classes of linear and non-linearsymbolic expressions and systems of constraints. Our techniques have been implemented and are used as part of a parallelizing compiler and a performance estimator to support analysis and optimization of parallel programs. Various examples and experiments demonstrate the effectiveness of our symbolic analysis techniques. 相似文献

16.

Compiler Support for Array Distribution on NUMA Shared Memory Multiprocessors

Abdelrahman Tarek S. Wong Thomas N. 《The Journal of supercomputing》1998,12(4):349-371

Management of program data to improve data locality and reduce false sharing is critical for scaling performance on NUMA shared memory multiprocessors. We use HPF-like data decomposition directives to partition and place arrays in data-parallel applications on Hector, a shared-memory NUMA multiprocessor. We describe a compiler system for automating the partitioning and placement of arrays. The compiler exploits Hectors shared memory architecture to efficiently implement distributed arrays. Experimental results from a prototype implementation demonstrate the effectiveness of these techniques. They also demonstrate the magnitude of the performance improvement attainable when our compiler-based data management schemes are used instead of operating system data management policies; performance improves by up to a factor of 5. 相似文献

17.

A Dynamic Diffusion Optimization Method for Irregular Finite Element Graph Partitioning

Yang Don-Lin Chung Yeh-Ching Chen Chih-Chang Liao Ching-Jung 《The Journal of supercomputing》2000,17(1):91-110

To efficiently execute a finite element application program on a distributed memory multicomputer, we need to distribute nodes of a finite element graph to processors of a distributed memory multicomputer as evenly as possible and minimize the communication cost of processors. This partitioning problem is known to be NP-complete. Therefore, many heuristics have been proposed to find satisfactory sub-optimal solutions. Based on these heuristics, many graph partitioners have been developed. Among them, Jostle, Metis, and Party are considered as the best graph partitioners available up-to-date. For these three graph partitioners, in order to minimize the total cut-edges, in general, they allow 3% to 5% load imbalance among processors. This is a tradeoff between the communication cost and the computation cost of the partitioning problem. In this paper, we propose an optimization method, the dynamic diffusion method (DDM), to balance the 3% to 5% load imbalance allowed by these three graph partitioners while minimizing the total cut-edges among partitioned modules. To evaluate the proposed method, we compare the performance of the dynamic diffusion method with the directed diffusion method and the multilevel diffusion method on an IBM SP2 parallel machine. Three 2D and two 3D irregular finite element graphs are used as test samples. For each test sample, 3% and 5% load imbalance situations are tested. From the experimental results, we have the following conclusions. (1) The dynamic diffusion method can improve the partition results of these three partitioners in terms of the total cut-edges and the execution time of a Laplace solver in most test cases while the directed diffusion method and the multilevel diffusion method may fail in many cases. (2) The optimization results of the dynamic diffusion method are better than those of the directed diffusion method and the multilevel diffusion method in terms of the total cut-edges and the execution time of a Laplace solver for most test cases. (3) The dynamic diffusion method can balance the load of processors for all test cases. 相似文献

18.

Optimization Parallelizing for Discrete Programming Problems 总被引：1，自引：1，他引：0

I. V. Sergienko V. P. Shilo V. A. Roshchin 《Cybernetics and Systems Analysis》2004,40(2):184-189

Parallelizing of calculations is considered when discrete optimization problems are solved. A version of parallelizing of copies of an algorithm is proposed and analyzed. The results of computational experiments are given. 相似文献

19.

全局部分重复计算划分 总被引：1，自引：0，他引：1

王轶然陈莉冯晓兵张兆庆《计算机研究与发展》2006,43(12):2158-2165

并行化编译器常常采用拥有者计算规则来进行计算划分,为了提高性能和可扩展性,后来引入了部分重复计算划分的概念.这是一种针对并行程序节点间局部性的重要优化方法.以前的部分重复计算划分局限于一个循环套的范围,因此新提出了全局部分重复计算划分的问题,给出一个简化的性能模型和一个基于整数线性规划的全局部分重复计算划分框架.实验结果表明,其结果显著优于局限于单个循环套的部分重复计算划分,比以前提出的启发式方法有更好的适应性. 相似文献

20.

PROPERTIES OF SIMULATED ANNEALING AND GENETIC ALGORITHMS FOR MAPPING DATA TO MULTICOMPUTERS

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):279-296

We experimentally analyze some properties of simulated annealing algorithms (SA) and genetic algorithms (GA) for mapping data to multicomputers. These properties include sensitiviiy to user parameters, fault tolerance capability, and applicability to different multicomputer topologies. Some user parameters are included in the objective function and are architecture- or problem-dependent parameters. The others are used in the GA and SA algorithms. The fault tolerance capability is demonstrated by mapping data to a multicomputer with some faulty processors. We assume a hypercube multicomputer architecture in most experiments. However, mapping to mesh, array, ring, tree, and star graph topologies is also demonstrated. The experimental results show that the GA and SA are insensitive to user parameters in wide ranges, completely fault tolerant, and unbiased towards particular multicomputer topologies. These properties of flexibility and general applicability, which are lacking in other heuristic algorithms, make the GA and SA attractive for automatic parallelization systems. 相似文献