期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Code compaction for parallel architectures

Kasi Anantha Fred Long 《Software》1990,20(6):537-554

There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique. 相似文献

2.

A Partial Evaluator for A Parallel Lambda Language

下载免费PDF全文

Kuang Hairong Sun Yongqiang Lin Kai 《计算机科学技术学报》1997,12(5):441-457

This paper describes theoretical and practical aspects of a partial evaluator that treats a parallel lambda language.The parallel language presented is a combination of lambda calculus and message passing communication mechanism.This parallel language can be used to write a programming language‘s denotational semantics which extracts the paallelism in the program.From this denotational definition of the programming language,the partial evaluator can generate parallel compiler of the language by self-application. The key technique of partial evaluation is binding time analysis that determines in advance which parts of the source program can be evaluated during partial evaluation,and which parts cannot,A binding time analysis is described based upon type inference.A new type chcode in introduced into the type system,which denotes the type of those expressions containing residual channel operations.A well-formedness criterion is given which ensures that partial evaluation not only doesn‘t commit type errors but also doesn‘t change the sequence of channel operations.Before binding time analysis,channel analysis is used to analyze the communication relationship between send and receive processes. 相似文献

3.

一种改进数据流图的子字并行程序表示方法

唐文胜吴丹《计算机工程与应用》2006,42(29):103-106

论文致力于对图像处理算法的串行C程序进行子字并行分析,并重定向到带有多媒体扩展的通用处理器和多媒体专用嵌入式微处理器。图像处理算法的特点决定其是内在可并行的,这种并行粒度介于数据并行(DLP)和指令级并行(ILP)之间,称之为子字并行。但是,当前的编译技术很难充分挖掘和定位程序基本块内的子字并行,对此设计了一种基于流图程序表示的编译方法,能够从串行程序中显式地定位子字并行。扩展了编译器的功能,增加了特定的模式库,基于模式识别的控制流和数据流分析后,产生特定的子字并行流图(SWFG,Sub-WordFlowGraph),并将该图作为中间表示,提供给子字并行指令选择,进而实现有效的子字并行代码产生。相似文献

4.

Compiling for distributed memory architectures

Rogers A. Pingali K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(3):281-298

The lack of high-level languages and good compilers for parallel machines hinders their widespread acceptance and use. Programmers must address issues such as process decomposition, synchronization, and load balancing. We have developed a parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference. A process decomposition is obtained by specializing the program for each processor to the data that resides on that processor. If this analysis fails, the compiler falls back to a simple but inefficient scheme called run-time resolution. Each process's role in the computation is determined by examining the data required for execution at run-time. Thus, our approach to process decomposition is data-driven rather than program-driven. We discuss several message optimizations that address the issues of overhead and synchronization in message transmission. Accumulation reorganizes the computation of a commutative and associative operator to reduce message traffic. Pipelining sends a value as close to its computation as possible to increase parallelism. Vectorization of messages combines messages with the same source and the same destination to reduce overhead. Our results from experiments in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Intel iPSC/2, show a speedup within 60% to 70% of handwritten code 相似文献

5.

A performance debugging tool for high performance Fortran programs

Takashi Suzuoka Jaspal Subhlok Thomas Gross 《Concurrency and Computation》1997,9(10):927-945

Parallel languages allow the programmer to express parallelism at a high level. The management of parallelism and the generation of interprocessor communication is left to the compiler and the runtime system. This approach to parallel programming is particularly attractive if a suitable widely accepted parallel language is available. High Performance Fortran (HPF) has emerged as the first popular machine independent parallel language, and remarkable progress has been made towards compiling HPF efficiently. However, the performance of HPF programs is often poor and unpredictable, and obtaining adequate performance is a major stumbling block that must be overcome if HPF is to gain widespread acceptance. The programmer is often in the dark about how to improve the performance of an HPF program since poor performance can be attributed to a variety of reasons, including poor choice of algorithm, limited use of parallelism, or an inefficient data mapping. This paper presents a profiling tool that allows the programmer to identify the regions of the program that execute inefficiently, and to focus on the potential causes of poor performance. The central idea is to distinguish the code that is executing efficiently from the code that is executing poorly. Efficient code uses all processors of a parallel system to make progress, while inefficient code causes processors to wait, execute replicated code, idle, communicate, or perform compiler bookkeeping. We designate the latter code as non-scalable, since adding more processors generally does not lead to improved performance for such code. By analogy, the former code is called scalable. The tool presented here separates a program into scalable and non-scalable components and identifies the causes of non-scalability of different components. We show that compiler information is the key to dividing the execution times into logical categories that are meaningful to the programmer. We present the design and implementation of a profiler that is integrated with Fx, a compiler for a variant of HPF. The paper includes two examples that demonstrate how the data reported by the profiler are used to identify and resolve performance bugs in parallel programs. © 1997 John Wiley & Sons, Ltd. 相似文献

6.

Toward an automatic parallelization of sparse matrix computations

《Journal of Parallel and Distributed Computing》2005,65(3):313-330

In this paper, we propose a generic method of automatic parallelization for sparse matrix computation. This method is based on both a refinement of the data-dependence test proposed by Bernstein and an inspector–executor scheme which is specialized to each input program of the compiler. This analysis mixes compilation process and run-time process.The sparsity of underlying data-structure determines a specific parallelism which increases the degree of parallelism of an algorithm. Such a source of parallelism had already been applied to many numerical algorithms such as the usual Cholesky factorization or LU-decomposition algorithms considered as the gold standards of parallelization based on sparsity. The standard automatic parallelization method cannot tackle such source of parallelism because it is based on the value of cells arrays and not merely on the memory addressing function.Addressing the automatization of this parallelism requires to develop a mixed compile-time and runtime approach integrated in a inspector–executor process. The compilation step provides a dedicated inspector devoted to the analyzed program. The inspector computes the dependence graph at runtime which allows a dynamic parallelization of the execution.As expressed just before, the generic scheme developed in this paper follows the design principles which have been applied, but at each time in an ad hoc way, to many sparse parallelization of numerical algorithms such as Cholesky algorithm. As far as we know, no general formal framework has been proposed to automate such a method of sparse parallelization. In this paper, we propose a generic framework of sparse parallelization (i.e. numerical program independent) which can be applied to any numerical programs satisfying the usual syntactic constraints of parallelization. 相似文献

7.

A compiler optimization algorithm for shared-memory multiprocessors

McKinley K.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):769-787

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines 相似文献

8.

Fast bio-inspired computation using a GPU-based systemic computer

Marjan Rouhipour Peter J. Bentley Hooman Shayani 《Parallel Computing》2010,36(10-11):591-617

Biology is inherently parallel. Models of biological systems and bio-inspired algorithms also share this parallelism, although most are simulated on serial computers. Previous work created the systemic computer – a new model of computation designed to exploit many natural properties observed in biological systems, including parallelism. The approach has been proven through two existing implementations and many biological models and visualizations. However to date the systemic computer implementations have all been sequential simulations that do not exploit the true potential of the model. In this paper the first ever parallel implementation of systemic computation is introduced. The GPU Systemic Computation Architecture is the first implementation that enables parallel systemic computation by exploiting the multiple cores available in graphics processors. Comparisons with the serial implementation when running two programs at different scales show that as the number of systems increases, the parallel architecture is several hundred times faster than the existing implementations, making it feasible to investigate systemic models of more complex biological systems. 相似文献

9.

基于分块数据结构的冲击问题并行计算 总被引：1，自引：0，他引：1

肖永浩黄清南《计算机辅助工程》2011,20(1):33-36

针对三维冲击问题,基于分块数据结构在共享内存并行机上实现OpenMP并行计算.分块数据结构不仅能有效利用计算机多层存储结构,而且增加OpenMP的并行粒度.数值实验表明:在使用分块数据结构后,串行程序的计算速度能提高3倍.通过柱体冲击平板数值模拟实验讨论并行程序的加速比和效率,表明并行程序能有效减少总计算时间. 相似文献

10.

Efficient Parallel Nonlinear Multigrid Relaxation Algorithms for Low-Level Vision Applications

《Journal of Parallel and Distributed Computing》1995,29(1):96-103

Multigrid techniques have been shown to significantly improve the convergence rate of the nonlinear relaxation algorithms used in computer vision for the extraction of low-level image features. It is also well known that the computations involved with relaxation algorithms are regular and local, and lead naturally to massive data parallelism. However, standard data parallelism does not exploit the large computing resources of the now available massively parallel 2D processor arrays when coarse image resolutions (i.e., small image grids) have to be processed, like in multigrid methods. In this research note, we present an algorithmic framework which enables us making a full use of the large potential of data parallelism for the implementation of nonlinear multigrid relaxation methods. The approach combines two different levels of parallelism: parallel updating of the image sites and concurrent explorations of the configuration space of the problem. The efficiency of the method is demonstrated on two different low-level vision applications: restoration of noisy images and optical flow computation. 相似文献

11.

谱方法求解水声传播问题的优化与并行

马现王勇献朱小谦屠厚旺李朋颜恺壮《计算机工程与科学》2022,44(3):381-389

水声传播数值计算的效率是各类水声学应用关心的核心因素之一,谱方法作为求解微分方程的一种数值方法,具有精度高、收敛速度快等优点,因此,近年来利用简正波-谱方法求解水声传播方程引起了许多学者的关注;然而,谱方法计算量更大,计算时间更长,在求解大范围海域声传播问题时,难以满足实时性的需求.因此,需要借助现代高性能计算机系统,... 相似文献

12.

Multiprocessor execution of functional programs 总被引：1，自引：0，他引：1

Benjamin Goldberg 《International journal of parallel programming》1988,17(5):425-473

Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times.Two implementations of the functional language ALFL were built on commercially available multiprocessors.Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, andBuckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, calledserial combinators, that can be executed in parallel.The abstract machine model supported by Alfalfa and Buckwheat is calledheterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and highe order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.This research was supported in part by National Science Foundation grants DCR-8302018 and DCR-8521451, by a DARPA subcontract with SDC/Unisys, and by gifts from Burroughs Austin Research Center and the Intel Corporation. 相似文献

13.

A framework for exploiting task and data parallelism on distributedmemory multicomputers

Ramaswamy S. Sapatnekar S. Banerjee P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(11):1098-1116

Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster 相似文献

14.

P3 L: A structured high-level parallel language,and its structured support

Bruno Bacci Marco Danelutto Salvatore Orlando Susanna Pelagatti Marco Vanneschi 《Concurrency and Computation》1995,7(3):225-255

The paper presents a parallel programming methodology that ensures easy programming, efficiency and portability of programs to different machines belonging to the class of the general-purpose, distributed-memory, MIMD architectures. The methodology is based on the definition of a new, high-level, explicitly parallel language, called P³ L, and of a set of static tools that automatically adapt the program features for each target architecture. P³ L does not require programmers to specify process activations, the actual parallelism degree, scheduling, or interprocess communications, i.e. all those features that need to be adjusted to harness each specific target machine. Parallelism is, on the other hand, expressed in a structured and qualitative way, by hierarchical composition of a restricted set of language constructs, corresponding to those forms of parallelism that are frequently encountered in parallel applications, and that can be efficiently implemented. The efficient portability of P³ L applications is guaranteed by the compiler along with the novel structure of the support. The compiler automatically adapts the program features for each specific architecture, using the costs (in terms of performance) of the low-level mechanisms exported by the architecture itself. In our methodology, these costs, along with other features of the architecture, are viewed through an abstract machine, whose interface is used by the compiler to produce the final object code. 相似文献

15.

A transformation approach to derive efficient parallelimplementations

Rauber T. Runger G. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(4):315-339

The construction of efficient parallel programs usually requires expert knowledge in the application area and a deep insight into the architecture of a specific parallel machine. Often, the resulting performance is not portable, i.e., a program that is efficient on one machine is not necessarily efficient on another machine with a different architecture. Transformation systems provide a more flexible solution. They start with a specification of the application problem and allow the generation of efficient programs for different parallel machines. The programmer has to give an exact specification of the algorithm expressing the inherent degree of parallelism and is released from the low-level details of the architecture. We propose such a transformation system with an emphasis on the exploitation of the data parallelism combined with a hierarchically organized structure of task parallelism. Starting with a specification of the maximum degree of task and data parallelism, the transformations generate a specification of a parallel program for a specific parallel machine. The transformations are based on a cost model and are applied in a predefined order, fixing the most important design decisions like the scheduling of independent multitask activations, data distributions, pipelining of tasks, and assignment of processors to task activations. We demonstrate the usefulness of the approach with examples from scientific computing 相似文献

16.

Data-parallel programming on multicomputers

Quinn M.J. Hatcher P.J. 《Software, IEEE》1990,7(5):69-76

The inadequacies of conventional parallel languages for programming multicomputers are identified. The C* language is briefly reviewed, and a compiler that translates C* programs into C programs suitable for compilation and execution on a hypercube multicomputer is presented. Results illustrating the efficiency of executing data-parallel programs on a hypercube multicomputer are reported. They show the speedup achieved by three hand-compiled C* programs executing on an N-Cube 3200 multicomputer. The first two programs, Mandelbrot set calculation and matrix multiplication, have a high degree of parallelism and a simple control structure. The C* compiler can generate relatively straightforward code with performance comparable to hand-written C code. Results for a C* program that performs Gaussian elimination with partial pivoting are also presented and discussed 相似文献

17.

基于嵌套循环分类的并行识别技术

赵捷赵荣彩丁锐黄品丰《软件学报》2012,23(10):2695-2704

传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上. 相似文献

18.

Partial evaluation in parallel

Charles Consel Olivier Danvy 《LISP and Symbolic Computation》1993,5(4):327-342

Partially evaluating a procedural program amounts to building a series of mutually-recursive specialized procedures. When a procedure call in the source program gets specialized into a residual call, the called procedure needs to be processed to occur in the residual program. Because the order of procedure definitions in the residual program is immaterial, it does not matter in which order these two events — building the residual call and building the residual procedure — are scheduled. Therefore, partial evaluation offers a basic opportunity for an MIMD type of parallelism with shared global memory where in essence, the mutually-recursive specialized procedures are built in parallel as specialization points are met and the relation binding source and residual procedures is globalized, to preserve its uniqueness.We have translated a sequential partial evaluator written in T (a dialect of Scheme) into Mul-T (a parallel extension of T) by adding one semaphore with each specialization point and one future to construct the residual procedure in parallel with the current specialization. The resulting parallel partial evaluator has been observed to be faster than the sequential one in proportion to the size of the source program and to the number of specialized procedures in the residual program.Our sequential partial evaluator is self-applicable. Because the semaphores and the future are run-time operations, our parallel partial evaluator is still self-applicable. In principle it can be and in practice we have used it to generate parallel compilers, i.e., specializers dedicated to an interpreter and processing its static and dynamic semantics in parallel, non trivially. Again, parallelism in dedicated specializers is determined by the size of the source program and the number of specialized procedures in the residual program.This work was supported by Darpa under Grant N00014-88-k-0573.This work was carried out during a summer visit to Yale University in 1990. This paper was written at Kansas State University and completed during a spring visit at Carnegie Mellon University in 1993. 相似文献

19.

Compiler technology for machine-indepenent parallel programming

Ken Kennedy 《International journal of parallel programming》1994,22(1):79-98

Historically, the principal achievement of compiler technology has been to make it possible to program in a high-level, machine-independent style. The absence of compiler technology to provide such a style for parallel computers is the main reason these systems have not found widespread acceptance. This paper examines the prospects for machine-independent parallel programming, concentrating on Fortran D and High Performance Fortran, which support machine-independent expression of “data parallelism.” 相似文献

20.

Automatic Parallelization: Executing Sequential Programs on a Task-Based Parallel Runtime

Alcides?Fonseca Email author Bruno?Cabral Jo?o?Rafael Ivo?Correia 《International journal of parallel programming》2016,44(6):1337-1358

There are billions of lines of sequential code inside nowadays’ software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the available parallelism, has been a research goal for some time now. This work proposes a new approach for achieving such goal. We created a new parallelizing compiler that analyses the read and write instructions, and control-flow modifications in programs to identify a set of dependencies between the instructions in the program. Afterwards, the compiler, based on the generated dependencies graph, rewrites and organizes the program in a task-oriented structure. Parallel tasks are composed by instructions that cannot be executed in parallel. A work-stealing-based parallel runtime is responsible for scheduling and managing the granularity of the generated tasks. Furthermore, a compile-time granularity control mechanism also avoids creating unnecessary data-structures. This work focuses on the Java language, but the techniques are general enough to be applied to other programming languages. We have evaluated our approach on 8 benchmark programs against OoOJava, achieving higher speedups. In some cases, values were close to those of a manual parallelization. The resulting parallel code also has the advantage of being readable and easily configured to improve further its performance manually. 相似文献