首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Loop skewing is a new procedure to derive the wavefront method of execution of nested loops. The wavefront method is used to execute nested loops on parallel and vector computers when none of the loops can be done in vector mode. Loop skewing is a simple transformation of loop bounds and is combined with loop interchanging to generate the wavefront. This derivation is particularly suitable for implementation in compilers that already perform automatic detection of parallelism and generation of vector and parallel code, such as are available today. Loop normalization, a loop transformation used by several vectorizing translators, is related to loop skewing, and we show how loop normalization, applied blindly, can adversely affect the parallelism detected by these translators.  相似文献   

2.
In this work we consider a special optimization problem involved with compiling compound loops (combining nested and consecutive sub-loops) to Verilog. Each sub-loop of the compound loop may require a different optimized hardware configuration (OHC) for optimized execution times. For example, one loop requires at least two memory ports and one multiplier for an optimized execution time, while another loop may require only one memory port but two multipliers, yet one OHC should be selected for both loops. The goal is to compute a minimal OHC which, based on the different heat levels (expected number of iterations) of the sub-loops, is a good compromise between all the conflicting requirements of each sub-loop. Though synthesis of nested loops has been implemented in quite a few systems this aspect has not been considered so far. We avoid the use of time consuming integer linear programming (ILP) techniques and instead use a fast space exploration technique combined with an efficient variant of list scheduling.Another novel aspect of the proposed system is the observation that the real latencies of the hardware units should be considered as variables of the OHC rather than fixed real values as is usually done in high-level synthesis systems. Experimental results show a significant improvement in the OHC without a significant increase in the execution time due to the use of this search procedure.1  相似文献   

3.
The author presents strategies for static loop decomposition and scheduling as well as computer-assisted run-time scheduling that take into account, in addition to the cost of performing operations, the overhead costs associated with a decomposition and schedule. An algorithm for static decomposition of multidimensional loops based on the operation execution costs, communication costs, and synchronization costs is discussed. Synchronization instructions are introduced to ensure correct program execution following program decomposition. An algorithm for determining the explicit synchronization instruction that should be introduced to ensure correct execution of a program with arbitrarily nested loops is presented. Techniques for reducing run-time scheduling and communication and synchronization costs due to self-scheduling of multidimensional loops are also presented. Experiments performed on the Encore multiprocessor system demonstrate that the techniques developed can reduce overhead costs  相似文献   

4.
Many approaches have been described for the parallel loop scheduling problem for shared-memory systems, but little work has been done on the data-dependent loop scheduling problem (nested loops with loop carried dependencies). In this paper, we propose a general model for the data-dependent loop scheduling problem on distributed as well as shared memory systems. In order to achieve load balancing and low runtime scheduling and communication overhead, our model is based on a loop task graph and the notion of critical path. In addition, we develop a heuristic algorithm based on our model and on genetic algorithms to test the reliability of the model. We test our approach on different scenarios and benchmarks. The results are very encouraging and suggest a future parallel compiler implementation based on our model.  相似文献   

5.
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented.  相似文献   

6.
This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes  相似文献   

7.
This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.  相似文献   

8.
New methods are presented for bounding and approximating the mean execution time of partitioning algorithm, and these methods are compared to previous approaches. Distribution-driven and program-driven simulations show that two of the methods are usually accurate to within 10% and give good estimates even when certain independence assumptions are violated. Asymptotic approximations and upper bounds are derived for the average execution time of multiphase algorithms when there is no contention for processes in the parallel phase. In addition, the authors bound the average execution time under static and dynamic scheduling policies and determine the optimum number of parallel tasks to be created to minimize the execution time bounds with constant scheduling overhead  相似文献   

9.
Using runtime information of load distributions and processor affinity, the authors propose an adaptive scheduling algorithm and its variations from different control mechanisms. The proposed algorithm applies different degrees of aggressiveness to adjust loop scheduling granularities, aiming at improving the execution performance of parallel loops by making scheduling decisions that match the real workload distributions at runtime. They experimentally compared the performance of the algorithm and its variations with several existing scheduling algorithms on two parallel machines: the KSR-1 and the Convex Exemplar. The kernel application programs used for performance evaluation were carefully selected for different classes of parallel loops. The results show that using runtime information to adaptively adjust scheduling granularity is an effective way to handle loops with a wide range of load distributions when no prior knowledge of the execution can be used. The overhead caused by collecting runtime information is insignificant in comparison with the performance improvement. The experiments show that the adaptive algorithm and its five variations outperformed the existing scheduling algorithms  相似文献   

10.
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.  相似文献   

11.
Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and the steps of the loops are equal to 1. Unimodular loop transformations have been widely used since they permit the implementation of many useful loop transformations. Recently, nonunimodular transformations have been proposed to reduce communication requirements or to use the memory hierarchy efficiently. The methods used for unimodular transformations do not work in the case of nonunimodular transformations, since they do not produce the exact bounds of the transformed loop nest. In this paper, we present a method for nested loop transformation which gives the exact bounds for both unimodular and nonunimodular transformations. The basic idea is to use the Hermite Normal Form (HNF) of the transformation matrix  相似文献   

12.
Supporting Timing Analysis by Automatic Bounding of Loop Iterations   总被引:2,自引:0,他引:2  
Static timing analyzers, which are used to analyze real-time systems, need to know the minimum and maximum number of iterations associated with each loop in a real-time program so accurate timing predictions can be obtained. This paper describes three complementary methods to support timing analysis by bounding the number of loop iterations. First, an algorithm is presented that determines the minimum and maximum number of iterations of loops with multiple exits. Even when the number of iterations cannot be exactly determined, it is desirable to know the lower and upper iteration bounds. Second, when the number of iterations is dependent on unknown values of variables, the user is asked to provide bounds for these variables. These bounds are used to determine the minimum and maximum number of iterations. Specifying the values of variables is less error prone than specifying the number of loop iterations directly. Finally, a method is given to tightly predict the execution time of inner loops whose number of iterations is dependent on counter variables of outer level loops. This is accomplished by formulating the total number of iterations of a loop in terms of summations and solving the resulting equation. These three methods have been successfully integrated in an existing timing analyzer that predicts the performance for optimized code on a machine that exploits caching and pipelining. The result is tighter timing analysis predictions and less work for the user.  相似文献   

13.
Energy saving is becoming one of the major design issues in processor architectures with multiple functional units (FUs). Nested loops are usually the most critical part in multimedia and high-performance DSP systems. There is a tradeoff between power saving and performance, such as timing constraint and code size requirement, of nested loops. This paper studies how to minimize the total energy while satisfying performance requirement for applications with multidimensional nested loops. An algorithm, energy minimization with loop fusion and FU schedule (EMLFS), is proposed. We first use retiming and partition to fuse nested loops. Then we use novel FU scheduling algorithms to maximize energy saving without sacrificing performance. The experimental results show that the average improvement on energy saving is significant by using our EMLFS algorithm.  相似文献   

14.
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop.  相似文献   

15.
Specification of software pipelining using petri nets   总被引:1,自引:0,他引:1  
This paper presents a flexible model for software pipelining using the petri nets. Our technique, called the Petri Net Pacemaker (PNP), can create near optimal pipelines with less algorithmic effort than other techniques. The pacemaker is a novel idea which exploits the cyclic behavior of petri nets to model the problem of scheduling operations of a loop body for software pipelining. A way of improving the performance of loops containing predicates is given. The PNP technique also shows how nested loops can be pipelined. A comparison with some of the other techniques is presented. THis work was partially supported by the National Science Foundation under grants CDA-9100788 and CDA-9200371.  相似文献   

16.
翟娟  汤震浩  李彬  赵建华  李宣东 《软件学报》2017,28(5):1051-1069
采用形式化方法证明软件的正确性是保障软件可靠性的有效方法,而对循环语句的分析与验证是形式化证明中的关键,对循环语句的处理一直是程序分析与验证中的一个难点问题.本文提出使用循环语句修改的内存和这些内存中存放的新值来描述循环语句的执行效果,并将该执行效果定义为循环摘要.同时,本文提出了一种自动生成循环摘要的方法,可以为操作常用数据结构的循环自动生成循环摘要,包含嵌套循环.此外,基于循环摘要,我们可以自动生成循环语句的规约,包括循环不变式、循环的前置条件以及循环的后置条件.我们已经实现了自动生成循环摘要以及循环规约的方法,并将它们集成到验证工具Accumulator中,实验表明,我们的方法可以有效地生成循环摘要,并生成多种类型的规约,从而辅助软件程序的形式化证明,提高验证的自动化程度和效率,减轻验证人员的负担.  相似文献   

17.
Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach.  相似文献   

18.
容红波  汤志忠 《软件学报》2001,12(4):544-555
软件流水是循环调度的重要方法.有分支循环的流水依然是个难题.现有算法可以分为4类:循环线性化、路径分离、整体调度和路径选择.它们都未能和谐地解决两个对立问题:转移时间最小化和最差约束问题.提出了基于路径分组和数据相关松弛的软件流水框架,试图无矛盾地解决上述问题.其主要思想是:(1)路径分组,即按照路径的执行概率和转移概率将路径分组,力求最小化转移时间;(2)数据相关松弛,力求避免最差约束,即当循环有多条路径时,有些相关在循环执行中并不一定有实例,理想的策略是仅当它有实例时才遵守.初步实验和定性分析表明,此  相似文献   

19.
An efficient algorithm to remove redundant dependences in simple loops with constant dependences is presented. Dependences constrain the parallel execution of programs and are typically enforced by synchronization instructions. The synchronization instructions represent a significant part of the overhead in the parallel execution of a program. Some program dependences are redundant because they are covered by other dependences. It is shown that unlike with single loops, in the case of nested loops, a particular dependence may be redundant at some iterations but not redundant at others, so that the redundancy of a dependence may not be uniform over the entire iteration space. A sufficient condition for the uniformity of redundancy in a doubly nested loop is developed  相似文献   

20.
现有OpenMP调度策略通常采用动态策略处理程序中的线性循环结构,存在负载不均衡和调度开销大的问题。提出一种针对线性递增或线性递减循环结构的非线性静态调度策略Nonlinear_static。将线性循环负载均匀变化参数与总负载、负载峰值、线程数相结合构建调度模型,计算循环迭代在线程上的映射,使迭代块大小呈非线性递增或递减趋势。将线性循环的负载平均地分配在每个线程上,并在开源OMPi编译器中进行编码。在Adjoint Convolution、Compute Pots、Matrix Multiplication、Mandelbrot Set应用程序上进行多线程调度,实验结果表明,相比静态调度、动态调度、指导调度等策略,Nonlinear_static调度策略在处理线性循环结构时执行时间缩短了5%~10%,且具有无调度开销的优点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号