期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Compilation techniques for parallel systems

《Parallel Computing》1999,25(13-14):1741-1783

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献

2.

Compiling scientific code using partial evaluation 总被引：1，自引：0，他引：1

Berlin A. Weise D. 《Computer》1990,23(12):25-37

The partial evaluation approach, which transforms a high-level program into a low-level program that is specialized for a particular application, exposing the parallelism inherent in the underlying numerical computation, is discussed. A prototype compiler that uses partial evaluation is described. Experiments with the compiler have shown that for an important class of numerical programs, partial evaluation can provide marked performance improvements: speedups over conventionally compiled code that range from seven times faster to 91 times faster have been measured. By coupling partial evaluation with parallel scheduling techniques, the low-level parallelism inherent in a computation can be exploited on heavily pipelined or parallel architectures. The approach has been demonstrated by applying a parallel scheduler to a partially evaluated program that simulates the motion of a nine-body solar system 相似文献

3.

Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem 总被引：1，自引：0，他引：1

下载免费PDF全文

Jin Guohua Yang Xuejun Chen Fujie 《计算机科学技术学报》1993,8(1):49-57

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop. 相似文献

4.

一个新的多分支全局软件流水方法 总被引：1，自引：0，他引：1

下载免费PDF全文

汤志忠张赤红陈刚《软件学报》1996,7(1):16-24

在指令级并行性很高的体系结构中，为了得到比较好的并行优化效果，通常需要设置多个分支控制机构，本文提出一个新的支持多个分支操作并行执行的全局软件流水方法──ＧＰＭＢ．并用衡量全局软件流水方法性能的两个主要参数：时间开销和空间开销把我们的方法与其它几种全局软件流水方法进行了比较．模拟实验结果表明：ＧＰＭＢ方法的时间开销和空间开销都比较小，所需要的硬件支持也比较少．相似文献

5.

Exploiting Distributed-Memory and Shared-Memory Parallelism on Clusters of SMPs with Data Parallel Programs

Benkner Siegfried Sipkova Viera 《International journal of parallel programming》2003,31(1):3-19

Clusters of SMPs are hybrid-parallel architectures that combine the main concepts of distributed-memory and shared-memory parallel machines. Although SMP clusters are widely used in the high performance computing community, there exists no single programming paradigm that allows exploiting the hierarchical structure of these machines. Most parallel applications deployed on SMP clusters are based on MPI, the standard API for distributed-memory parallel programming, and thus may miss a number of optimization opportunities offered by the shared memory available within SMP nodes. In this paper we present extensions to the data parallel programming language HPF and associated compilation techniques for optimizing HPF programs on clusters of SMPs. The proposed extensions enable programmers to control key aspects of distributed-memory and shared-memory parallelization at a high-level of abstraction. Based on these language extensions, a compiler can adopt a hybrid parallelization strategy which closely reflects the hierarchical structure of SMP clusters by automatically exploiting shared-memory parallelism based on OpenMP within cluster nodes and distributed-memory parallelism utilizing MPI across nodes. We describe the implementation of these features in the VFC compiler and present experimental results which show the effectiveness of these techniques. 相似文献

6.

Carrying on the legacy of imperative languages in the future parallel computing era

《Parallel Computing》2014,40(3-4):1-33

There has been a renewed interest in dataflow computing models in recent years of technology scaling. Potentiality of exploiting huge parallelism, with the expense of low power, simpler circuit, less silicon area, is the main characteristic of a dataflow model. Growing trends in housing large number of functional units in a single chip, making use of local clocks, reducing energy consumptions, avoiding global wires are the main reasons behind the resurgence of dataflow models. To program a dataflow machine, new architectures suggest imperative languages rather than functional type dataflow languages or parallel languages because this is the right way to make the new architectures popular among the general community. Although for several decades scientists have been working on how imperative languages can be used in dataflow models efficiently, there is no systematic review on those works. Existing reviews on dataflow paradigm mainly focus on the architectures. Although few papers review programming languages of dataflow architectures, their discussions are limited to only dataflow languages and visual programming languages which are fundamentally different from imperative languages. In this paper, we conduct a systematic review on those works that attempt to provide a way to use imperative languages in any type of dataflow architectures. Our survey of compilers and related architectures cover the aspects like translation mechanisms of program construct, their optimization techniques, memory ordering methods, program allocation and scheduling and special architectural features. We also present some of our observations and future research directions obtained by exploring the literature. 相似文献

7.

Trace Software Pipelining

下载免费PDF全文

Wang Jian Andreas Krall M.Anton Ertl 《计算机科学技术学报》1995,10(6):481-490

Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches.This paper presents a novel global software pipelining technique,called Trace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines.Trace software pipelining applies a global code scheduling technique to compact the original loop body.The resulting loop is called a trace software pipelined (TSP) code.The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors.Thus,exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique.This makes our new technique very promising in practical compilers.Finally,we also present the preliminary experimental results to support our new approach. 相似文献

8.

An introduction to compilation issues for parallel machines

Maya Gokhale William Carlson 《The Journal of supercomputing》1992,6(3-4):283-314

The exploitation of today's high-performance computer systems requires the effective use of parallelism in many forms and at numerous levels. This survey article discusses program analysis and restructuring techniques that target parallel architectures. We first describe various categories of architectures that are oriented toward parallel computation models: vector architectures, shared-memory multiprocessors, massively parallel machines, message-passing architectures, VLIWs, and multithreaded architectures. We then describe a variety of optimization techniques that can be applied to sequential programs to effectively utilize the vector and parallel processing units. After an overview of basic dependence analysis, we present restructuring transformations on DO loops targeted both to vectorization and to concurrent execution, interprocedural and pointer analysis, task scheduling, instruction-level parallelization, and compiler-assisted data placement. We conclude that although tremendous advances have been made in dependence theory and in the development of a toolkit of transformations, parallel systems are used most effectively when the programmer interacts in the optimization process. 相似文献

9.

Scalable communication architectures for massively parallel hardware multi-processors

Yahya Jan Lech Jóźwiak 《Journal of Parallel and Distributed Computing》2012

Modern complex embedded applications in multiple application fields impose stringent and continuously increasing functional and parametric demands. To adequately serve these applications, massively parallel multi-processor systems on a single chip (MPSoCs) are required. This paper is devoted to the design of scalable communication architectures of massively parallel hardware multi-processors for highly-demanding applications. We demonstrated that in the massively parallel hardware multi-processors the communication network influence on both the throughput and circuit area dominates the processors influence, while the traditionally used flat communication architectures do not scale well with the increase of parallelism. Therefore, we propose to design highly optimized application-specific partitioned hierarchical organizations of the communication architectures through exploiting the regularity and hierarchy of the actual information flows of a given application. We developed related communication architecture synthesis strategies and incorporated them into our quality-driven model-based multi-processor design methodology and related automated architecture exploration framework. Using this framework we performed a large series of architecture synthesis experiments. Some of the results of the experiments are presented in this paper. They demonstrate many features of the synthesized communication architectures and show that our method and related framework are able to efficiently synthesize well scalable communication architectures even for the high-end massively parallel multi-processors that have to satisfy extremely stringent computation demands. 相似文献

10.

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor 总被引：1，自引：0，他引：1

Michael Gschwind 《International journal of parallel programming》2007,35(3):233-262

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006. 相似文献

11.

Improved parallelism and scheduling in multi-core software routers

Norbert Egi Gianluca Iannaccone Maziar Manesh Laurent Mathy Sylvia Ratnasamy 《The Journal of supercomputing》2013,63(1):294-322

Recent technological advances in commodity server architectures, with multiple multi-core CPUs, integrated memory controllers, high-speed interconnects, and enhanced network interface cards, provide substantial computational capacity, and thus an attractive platform for packet forwarding. However, to exploit this available capacity, we need a suitable software platform that allows effective parallel packet processing and resource management. In this paper, we at first introduce an improved forwarding architecture for software routers that enhances parallelism by exploiting hardware classification and multi-queue support, already available in recent commodity network interface cards. After evaluating the original scheduling algorithm of the widely-used Click modular router, we propose solutions for extending this scheduler for improved fairness, throughput, and more precise resource management. To illustrate the potential benefits of our proposal, we implement and evaluate a few key elements of our overall design. Finally, we discuss how our improved forwarding architecture and resource management might be applied in virtualized software routers. 相似文献

12.

A unified framework for optimizing locality, parallelism, andcommunication in out-of-core computations

Kandemir M. Choudhary A. Ramanujam J. Kandaswamy M.A. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(7):648-668

This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests 相似文献

13.

A scalable method for run-time loop parallelization

Lawrence Rauchwerger Nancy M. Amato David A. Padua 《International journal of parallel programming》1995,23(6):537-576

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and An abstract of this paper has been publsihed in Ref. 1. Research supported in part by Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army of the Government. Research supported in part by Intel and NASA Graduate Fellowships. Research supported in part by an AT&T Bell Laboratoroies Graduate Fellowship and by the International Computer Science Institute, Berkeley, California. 相似文献

14.

The multiflow trace scheduling compiler 总被引：3，自引：0，他引：3

P. Geoffrey Lowney Stefan M. Freudenberger Thomas J. Karzes W. D. Lichtenstein Robert P. Nix John S. O'Donnell John C. Ruttenberg 《The Journal of supercomputing》1993,7(1-2):51-142

The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks. 相似文献

15.

Using processor affinity in loop scheduling on shared-memorymultiprocessors

Markatos E.P. LeBlanc T.J. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(4):379-400

Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds 相似文献

16.

Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

Seonggun Kim Hwansoo Han Kwang-Moo Choe 《The Journal of supercomputing》2011,56(1):25-55

Multicore architectures are evolving with the promise of extreme performance for the classes of applications that require high performance and large bandwidth of memory. Irregular reduction is one of important computation patterns for many complex scientific applications, and it typically requires high performance and large bandwidth of memory. In this article, we propose region-based parallelization techniques for irregular reductions on multicore architectures with explicitly managed memory hierarchies. Managing memory hierarchy in software requires a lot of programming efforts and tends to be error-prone. The difficulties are even worse for applications with irregular data access patterns. To relieve the burden of memory management from programmers, we develop abstractions, particularly targeted to irregular reduction, for structuring parallel tasks, mapping the parallel tasks to processing units and scheduling data transfers between the memory hierarchies. Our framework employs iteration reordering based on regions of data along with dynamic scheduling of parallel tasks. We experimentally evaluate the effectiveness of our techniques for irregular reduction kernels on the Cell processor embedded in a Sony PlayStation3. Experimental results show the speedups of 8 to 14 on the six available SPEs. 相似文献

17.

Loop coalescing and scheduling for barrier MIMD architectures

O'Keefe M.T. Dietz H.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(9):1060-1064

Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem 相似文献

18.

Speeding up high-speed protocol processors

Serpanos D.N. 《Computer》2004,37(9):108-111

Many network technologies aim to exploit the bandwidth of high-speed links, which now achieve data transfer rates up to several terabits per second. As packet interarrival times shrink to a few tens of nanoseconds, network systems must address a transmission-processing gap by providing extremely fast data paths as well as high-performance subsystems to implement such functions as protocol processing, memory management, and scheduling. Today, network processors are an important class of embedded processors, used all across the network systems space-from personal to local and wide area networks. Network processor architectures focus on exploiting parallelism to achieve high performance. They usually employ conventional architectural concepts to accelerate the processing required to switch packets between different protocol stacks. The architectures support the mechanisms that network protocols implement in a specific stack by providing efficient data paths and by executing many intelligent network or more homogeneous links - for example, a set of Ethernet links. Although network processors can also handle packets concurrently from different protocol stacks, we describe only single-stack processing here. However, the arguments and results extend to a multistack environment. 相似文献

19.

Using knowledge-based systems for research on parallelizing compilers

Chao-Tung Yang Shian-Shyong Tseng Yun-Woei Fann Ting-Ku Tsai Ming-Huei Hsieh Cheng-Tien Wu 《Concurrency and Computation》2001,13(3):181-208

相似文献

20.

Declustering: a new multiprocessor scheduling technique

Sih G.C. Lee E.A. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(6):625-637

The authors present a new compile-time scheduling heuristic called declustering, which schedules acyclic precedence graphs that fit the synchronous data flow (SDF) model onto multiprocessor architectures. This technique accounts for interprocessor communication (IPC) overheads and considers interconnection constraints in the architecture so that shared resource contention can be avoided. The algorithm initially invokes a new clustering method that uses graph-analysis techniques to isolate parallelism instances. When constructing an initial set of clusters, this procedure explicitly addresses the tradeoff between exploiting parallelism and incurring communication cost. By hierarchically combining these clusters and then systematically decomposing this hierarchy, the declustering method exposes parallelism instances in order of importance and attains a cluster granularity that fits the characteristics of the architecture. It is shown that declustering retains the clustering advantage of avoiding IPC, yet overcomes the inflexibility associated with traditional clustering approaches 相似文献