首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Exploiting the parallelism available in loops   总被引:1,自引:0,他引:1  
Lilja  D.J. 《Computer》1994,27(2):13-26
Because a loop's body often executes many times, loops provide a rich opportunity for exploiting parallelism. This article explains scheduling techniques and compares results on different architectures. Since parallel architectures differ in synchronization overhead, instruction scheduling constraints, memory latencies, and implementation details, determining the best approach for exploiting parallelism can be difficult. To indicate their performance potential, this article surveys several architectures and compilation techniques using a common notation and consistent terminology. First we develop the critical dependence ratio to determine a loop's maximum possible parallelism, given infinite hardware. Then we look at specific architectures and techniques. Loops can provide a large portion of the parallelism available in an application program, since the iterations of a loop may be executed many times. To exploit this parallelism, however, one must look beyond a single basic block or a single iteration for independent operations. The choice of technique depends on the underlying architecture of the parallel machine and the characteristics of each individual loop  相似文献   

2.
Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques.  相似文献   

3.
Lee  B. Hurson  A.R. 《Computer》1994,27(8):27-39
Contrary to initial expectations, implementing dataflow computers has presented a monumental challenge. Now, however, multithreading offers a viable alternative for building hybrid architectures that exploit parallelism. The eventual success of dataflow computers will depend on their programmability. Traditionally, they've been programmed in languages such as Id and SISAL (Streams and Iterations in a Single Assignment Language) that use functional semantics. These languages reveal high levels of concurrency and translate on to dataflow machines and conventional parallel machines via the Threaded Abstract Machine (TAM). However, because their syntax and semantics differ from the imperative counterparts such as Fortran and C, they have been slow to gain acceptance in the programming community. An alternative is to explore the use of established imperative languages to program dataflow machines. However, the difficulty will be analyzing data dependencies and extracting parallelism from source code that contains side effects. Therefore, more research is still needed to develop compilers for conventional languages that can produce parallel code comparable to that of parallel functional languages  相似文献   

4.
Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches.This paper presents a novel global software pipelining technique,called Trace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines.Trace software pipelining applies a global code scheduling technique to compact the original loop body.The resulting loop is called a trace software pipelined (TSP) code.The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors.Thus,exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique.This makes our new technique very promising in practical compilers.Finally,we also present the preliminary experimental results to support our new approach.  相似文献   

5.
Discovering and exploiting instruction level parallelism in code will be key to future increases in microprocessor performance. What technical challenges must compiler writers meet to better use ILP? Instruction level parallelism allows a sequence of instructions derived from a sequential program to be parallelized for execution on multiple pipelined functional units. If industry acceptance is a measure of importance, ILP has blossomed. It now profoundly influences the design of almost all leading edge microprocessors and their compilers. Yet the development of ILP is far from complete, as research continues to find better ways to use more hardware parallelism over a broader class of applications  相似文献   

6.
It is relatively clear how to map regular,repetitive or grid oriented computations onto SIMD architectures.It is not so clear,however,how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code.We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper,we present one of the compilation techniques,global register allocation,we have developed for SIMD computers,and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code.This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers.Our performance results validate our method.  相似文献   

7.
Vienna Fortran, High Performance Fortran (HPF), and other data parallel languages have been introduced to allow the programming of massively parallel distributed-memory machines (DMMP) at a relatively high level of abstraction, based on the SPMD paradigm. Their main features include directives to express the distribution of data and computations across the processors of a machine. In this paper, we use Vienna-Fortran as a general framework for dealing with sparse data structures. We describe new methods for the representation and distribution of such data on DMMPs, and propose simple language features that permit the user to characterize a matrix as “sparse” and specify the associated representation. Together with the data distribution for the matrix, this enables the complier and runtime system to translate sequential sparse code into explicitly parallel message-passing code. We develop new compilation and runtime techniques, which focus on achieving storage economy and reducing communication overhead in the target program. The overall result is a powerful mechanism for dealing efficiently with sparse matrices in data parallel languages and their compilers for DMMPs  相似文献   

8.
《Parallel Computing》2014,40(3-4):1-33
There has been a renewed interest in dataflow computing models in recent years of technology scaling. Potentiality of exploiting huge parallelism, with the expense of low power, simpler circuit, less silicon area, is the main characteristic of a dataflow model. Growing trends in housing large number of functional units in a single chip, making use of local clocks, reducing energy consumptions, avoiding global wires are the main reasons behind the resurgence of dataflow models. To program a dataflow machine, new architectures suggest imperative languages rather than functional type dataflow languages or parallel languages because this is the right way to make the new architectures popular among the general community. Although for several decades scientists have been working on how imperative languages can be used in dataflow models efficiently, there is no systematic review on those works. Existing reviews on dataflow paradigm mainly focus on the architectures. Although few papers review programming languages of dataflow architectures, their discussions are limited to only dataflow languages and visual programming languages which are fundamentally different from imperative languages. In this paper, we conduct a systematic review on those works that attempt to provide a way to use imperative languages in any type of dataflow architectures. Our survey of compilers and related architectures cover the aspects like translation mechanisms of program construct, their optimization techniques, memory ordering methods, program allocation and scheduling and special architectural features. We also present some of our observations and future research directions obtained by exploring the literature.  相似文献   

9.
Programming multiprocessor parallel architectures is a complex task. This paper describes a block-structured scientific programming language, BLAZE, designed to simplify this task. BLAZE contains array arithmetic, ‘forall’ loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow.

A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture, while neglecting the remainder. This paper describes the features of BLAZE, and show how this language would be used in typical scientific programming.  相似文献   


10.
ILDJIT, a new‐generation dynamic compiler and virtual machine designed to support parallel compilation, is introduced here. Our dynamic compiler targets the increasingly popular ECMA‐335 specification. The goal of this project is twofold: on one hand, it aims at exploiting the parallelism exposed by multi‐core architectures to hide the dynamic compilation latencies by pipelining compilation and execution tasks; on the other hand, it provides a flexible, modular and adaptive framework for dynamic code optimization. The ILDJIT organization and the compiler design choices are presented and discussed highlighting how adaptability and extensibility can be achieved. Thanks to the compilation latency masking effect of the pipeline organization, our dynamic compiler is able to mask most of the compilation delay, when the underlying hardware exposes sufficient parallelism. Even when running on a single core, the ILDJIT adaptive optimization framework manages to speedup the computation with respect to other open‐source implementations of ECMA‐335. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

11.
The lack of high-level languages and good compilers for parallel machines hinders their widespread acceptance and use. Programmers must address issues such as process decomposition, synchronization, and load balancing. We have developed a parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference. A process decomposition is obtained by specializing the program for each processor to the data that resides on that processor. If this analysis fails, the compiler falls back to a simple but inefficient scheme called run-time resolution. Each process's role in the computation is determined by examining the data required for execution at run-time. Thus, our approach to process decomposition is data-driven rather than program-driven. We discuss several message optimizations that address the issues of overhead and synchronization in message transmission. Accumulation reorganizes the computation of a commutative and associative operator to reduce message traffic. Pipelining sends a value as close to its computation as possible to increase parallelism. Vectorization of messages combines messages with the same source and the same destination to reduce overhead. Our results from experiments in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Intel iPSC/2, show a speedup within 60% to 70% of handwritten code  相似文献   

12.
The widespread use of parallel machines has been hampered by the difficulty of mapping applications onto them effectively. The difficulty arises because current programming languages require the programmer to specify a problem to be solved at a low level of abstraction in an imperative form. Thus the programmer must immediately encode an architecture-specific algorithm detailing every communication and calculation. This process is prone to error and complicates the reuse of software.

An alternative approach is to specify the problem to be solved at a high-level in a functional language. Meaning-preserving program transformations can then be used to derive a parallel algorithm. Such algorithms can be run on parallel graph-reduction or dataflow machines which automatically exploit the implicit parallelism in a functional language program. Such automatic decomposition techniques, however, are not yet capable of fully yielding the extra performance offered by the parallel hardware.

We show how, by including an architecture specification with the problem specification, and extending the amount of transformation performed, it is possible to produce functional language code that explicity expresses the calculations and communications to be performed by the processors. This simplifies compilation, yields faster programs and enables parallel software to be developed for a wide variety of parallel computer architectures.

A goal-seeking transformation methodology has been developed which enables a high-level functional specification of the problem and a high-level functional abstraction of the target computer architecture to be systematically manipulated to produce an efficient parallel algorithm tailored to the target architecture. As the transformations start from very high-level specifications, the discovery of new algorithms is facilitated.

A case study is used to demonstrate the effectiveness of the technique. We show how a high level specification for sort can be transformed with a pipeline architecture specification to give a mergesort and how the same specification with a dynamic-message-passing architecture specification can be transformed to a novel parallel quicksort.  相似文献   


13.
Dynamic programming (DP) is a popular technique which is used to solve combinatorial search and optimization problems. This paper focuses on one type of DP, which is called nonserial polyadic dynamic programming (NPDP). Owing to the nonuniform data dependencies of NPDP, it is difficult to exploit either parallelism or locality. Worse still, the emerging multi/many-core architectures with small on-chip memory make these issues more challenging. In this paper, we address the challenges of exploiting the fine grain parallelism and locality of NPDP on multicore architectures. We describe a latency-tolerant model and a percolation technique for programming on multicore architectures. On an algorithmic level, both parallelism and locality do benefit from a specific data dependence transformation of NPDP. Next, we propose a parallel pipelining algorithm by decomposing computation operators and percolating data through a memory hierarchy to create just-in-time locality. In order to predict the execution time, we formulate an analytical performance model of the parallel algorithm. The parallel pipelining algorithm achieves not only high scalability on the 160-core IBM Cyclops64, but portable performance as well, across the 8-core Sun Niagara and quad-cores Intel Clovertown.  相似文献   

14.
In recent years, high performance computing underwent a deep transformation. In this paper, we review the state of parallel computation with detailed discussion of the current and future research issues in the area of parallel architectures and compilation methods, instruction level parallelism and optimization methods to improve the performance of the memory hierarchy.  相似文献   

15.
并行化编译中的一种集成优化方法   总被引:1,自引:0,他引:1  
孙彤  李三立  李晓明 《软件学报》1996,7(12):705-713
本文提出了一种面向分布存储器多机系统的并行化编译方法.针对分布存储并行系统的特点,作者采用的基本优化策略是:折衷并行性与数据引用局部性;减少和隐藏通信开销.通过对基于仿射函数的程序分解方式所导致的数据通信性质的分析,得到了适合分布存储结构特殊要求的并行性开发方法.为了在保持并行性的前提下最小化通信数据总量,提出了基于齐次线性方程组求解的程序全局优化分解方法.为了优化数据通信的组织,提高结点代码的效率,又提出了一种以线性不等式组作为工具的更加实用的通信优化和结点代码生成方法.  相似文献   

16.
陶小涵  朱雨  庞建民  赵捷  徐金龙 《软件学报》2023,34(4):1570-1593
异构架构逐渐成为高性能计算领域的主流架构,但相较于同构多核架构,其硬件结构及存储层次更为复杂,程序编写更为困难.先进的优化编译器可以协助程序开发人员实现更为高效的代码,降低程序开发复杂度.多面体编译模型通过抽象分析将程序抽象成空间多面体表示形式,能够将多种循环变换与硬件映射相结合,并面向特定体系结构生成相应的代码.设计实现了一个面向国产申威异构架构的并行代码自动生成系统,采用“源-源”编译模式,基于多面体编译模型实现.系统针对申威异构架构特点将程序计算过程进行硬件部署,同时实现数据传输与内存空间的自动管理.实验基于Polybench测试集中线性代数相关用例进行测试.结果表明,利用代码自动生成系统生成的异构并行代码能够在申威异构平台上正确运行,并能够有效发挥申威异构平台的性能,基于申威异构平台利用64线程加速计算的平均加速比达到了539.16倍.  相似文献   

17.
18.
Gabriel, a second-generation digital signal-processing (DSP) design environment, is described. The Gabriel experimental signal-processing software performs non-real-time algorithm simulations and code synthesis for real-time hardware based on programmable DSPs. Gabriel eases code development for architectures that are not easy targets for conventional compilers. The model of compilation used by Gabriel is discussed at length. Modification of the model to run in real time, target architectures, and scheduling are examined  相似文献   

19.
Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines.  相似文献   

20.
Compiling scientific code using partial evaluation   总被引:1,自引:0,他引:1  
Berlin  A. Weise  D. 《Computer》1990,23(12):25-37
The partial evaluation approach, which transforms a high-level program into a low-level program that is specialized for a particular application, exposing the parallelism inherent in the underlying numerical computation, is discussed. A prototype compiler that uses partial evaluation is described. Experiments with the compiler have shown that for an important class of numerical programs, partial evaluation can provide marked performance improvements: speedups over conventionally compiled code that range from seven times faster to 91 times faster have been measured. By coupling partial evaluation with parallel scheduling techniques, the low-level parallelism inherent in a computation can be exploited on heavily pipelined or parallel architectures. The approach has been demonstrated by applying a parallel scheduler to a partially evaluated program that simulates the motion of a nine-body solar system  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号