首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
2.
Optimizing compilers for data-parallel languages such as High Performance Fortran perform a complex sequence of transformations. However, the effects of many transformations are not independent, which makes it challenging to generate high quality code. In particular, some transformations introduce conditional control flow, while others make some conditionals unnecessary by refining program context. Eliminating unnecessary conditional control flow during compilation can reduce code size and remove a source of overhead in the generated code. This paper describes algorithms to compute symbolic constraints on the values of expressions used in control predicates and to use these constraints to identify and remove unnecessary conditional control flow. These algorithms have been implemented in the Rice dHPF compiler and we show that these algorithms are effective in reducing the number of conditionals and the overall size of generated code. Finally, we describe a synergy between control flow simplification and data-parallel code generation based on loop splitting which achieves the effects of more narrow data-parallel compiler optimizations such as vector message pipelining and the use of overlap areas.  相似文献   

3.
传统的向量化方法和超字并行方法依靠数据依赖关系分析确定程序中的并行性,而依赖关系分析无法处理非结构化控制流语句,现有的编译器对该类语句的向量化能力有限。为此,给出一种面向SIMD扩展体系结构的出口分支语句向量化方法,该方法针对一个向量因子内的出口分支语句,能够有效地进行自动向量化处理。测试结果表明,该方法既充分发掘了程序数据流中的并行性,又保证了控制流语义的正确性。  相似文献   

4.
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.  相似文献   

5.
Region scheduling, a technique applicable to both fine-grain and coarse-grain parallelism, uses a program representation that divides a program into regions consisting of source and intermediate level statements and permits the expression of both data and control dependencies. Guided by estimates of the parallelism present in regions, the region scheduler redistributes code, thus providing opportunities for parallelism in those regions containing insufficient parallelism compared to the capabilities of the executing architecture. The program representation and the transformations are applicable to both structured and unstructured programs, making region scheduling useful for a wide range of applications. The results of experiments conducted using the technique in the generation of code for a reconfigurable long instruction word architecture are presented. The advantages of region scheduling over trace scheduling are discussed  相似文献   

6.
随着向量长度的不断增长, SIMD扩展部件得以处理更为庞大的数据级并行, 但程序的并行阈值也随之提高. 对于现有的自动向量化编译器, 如果在分析阶段不能从串行代码中发掘出足够的数据级并行以完全填充向量寄存器, 则不会进入相应的向量代码变换阶段, 从而无法向量化. 较长的向量长度使得某些并行性不足的程序失去了向量化的机会, 造成了性能下降. 为了更加充分的利用SIMD部件, 介绍了一种面向基本块的非满载向量化方法ISLP. 基于开源GCC编译器, 从并行性检测、代码生成和代价模型3个方面详细阐述了ISLP的设计与实现. 在标准测试集上的实验结果表明, 该方法可以有效地对超字级并行性不足的程序进行向量化处理, 提高程序执行效率. 选取的测试用例在向量化后的平均加速比达到1.14, 性能较常规SLP方法提升11.8%.  相似文献   

7.
We present a novel technique called comparison checking that helps optimizer writers debug optimizers by testing, for given inputs, that the semantics of a program are not changed by the application of optimizations. We have successfully applied comparison checking to a large class of program transformations that alter (1) the relative ordering in which values are computed by the intermediate code statements, (2) the form of the intermediate code statements, and (3) the control flow structure using code replication. We outline the key steps that lead to the automation of comparison checking. The application of comparison checking to test the implementations of high level loop transformations, low level code optimizations, and global register allocation for given program inputs is then described.  相似文献   

8.
姚远  赵荣彩 《计算机工程》2012,38(12):272-275
编译器由于程序分析能力不足,无法自动实现循环向量化或者会造成盲目自动向量化。为此,提出一种基于编译指示的向量化方法。通过在代码中插入向量化编译指示语句,指导自动向量化编译工具的处理过程,自动生成高效的向量化代码。测试结果表明,该方法能够有效提高目标代码的运行性能。  相似文献   

9.
Rewrite rules with side conditions can elegantly express many classical compiler optimizations for imperative programming languages. In this paper, programs are written in an intermediate language and transformation-enabling side conditions are specified in a temporal logic suitable for describing program data flow.The purpose of this paper is to show how such transformations may be proven correct. Our methodology is illustrated by three familiar optimizations: dead code elimination, constant folding, and code motion. A transformation is correct if whenever it can be applied to a program, the original and transformed programs are semantically equivalent, i.e., they compute the same input-output function. The proofs of semantic equivalence inductively show that a transformation-specific bisimulation relation holds between the original and transformed program computations.  相似文献   

10.
This paper presents an intermediate program representation called the Hierarchical Task Graph (HTG), and argues that it is not only suitable as the basis for program optimization and code generation, but it fully encapsulates program parallelism at all levels of granularity. As such, the HTG can be used as the basis for a variety of restructuring and optimization techniques, and hence as the target for front-end compilers as well as the input to source and code generators. Our implementation and testing of the HTG in the Parafrase-2 compiler has demonstrated its suitability and versatility as a potentially universal intermediate representation. In addition to encapsulating semantic information, data and control dependences, the HTG provides more information vital to efficient code generation and optimizations related to parallel code generation. In particular, we introduce the notion of precedence between nodes of the structure whose grain size can range from atomic operations to entire subprograms. This work was supported in part by the National Science Foundation under Grant No. NSF-CCR-89-57310, the U. S. Department of Energy under Grant No. DOE-DE-FG02-85ER25001, and a grant from Texas Instruments Inc.  相似文献   

11.
随着程序的规模的扩大和复杂度的提高,通过直接分析源码进行程序切片,变得十分困难。在现有的利用编译优化技术来优化程序切片的方法中,存在无法有效利用程序的编译时信息和编译器的优化技术,以及对语言的支持不完善的问题。为此,分析了GCC编译器在编译时的中间表示,首次提出了基于GCC关键变量数据流分析算法的程序切片技术,以程序的GIMPLE中间表示为基础,以程序基本块为单位,通过迭代求解数据流方程,分析程序基本块内和不同基本块间的关键变量数据流信息。该程序切片技术可以获取源程序中仅与预设目标函数相关的关键变量和关键语句,缩减程序规模。最后通过实验,证明了该算法的可行性。  相似文献   

12.
This paper describes methods to adapt existing optimizing compilers for sequential languages to produce code for parallel processors. In particular it looks at targeting data-parallel processors using SIMD (single instruction multiple data) or vector processors where users need features similar to high-level control flow across the data-parallelism. The premise of the paper is that we do not want to write an optimizing compiler from scratch. Rather, a method is described that allows a developer to take an existing compiler for a sequential language and modify it to handle SIMD extensions. As well as modifying the front-end, the intermediate representation and the code generation to handle the parallelism, specific optimizations are described to target the architecture efficiently.  相似文献   

13.
The paper presents approaches to the validation of optimizing compilers. The emphasis is on aggressive and architecture-targeted optimizations which try to obtain the highest performance from modern architectures, in particular EPIC-like micro-processors. Rather than verify the compiler, the approach of translation validation performs a validation check after every run of the compiler, producing a formal proof that the produced target code is a correct implementation of the source code.First we survey the standard approach to validation of optimizations which preserve the loop structure of the code (though they may move code in and out of loops and radically modify individual statements), present a simulation-based general technique for validating such optimizations, and describe a tool, VOC-64, which implements these technique. For more aggressive optimizations which, typically, alter the loop structure of the code, such as loop distribution and fusion, loop tiling, and loop interchanges, we present a set of permutation rules which establish that the transformed code satisfies all the implied data dependencies necessary for the validity of the considered transformation. We describe the necessary extensions to the VOC-64 in order to validate these structure-modifying optimizations.Finally, the paper discusses preliminary work on run-time validation of speculative loop optimizations, that involves using run-time tests to ensure the correctness of loop optimizations which neither the compiler nor compiler-validation techniques can guarantee the correctness of. Unlike compiler validation, run-time validation has not only the task of determining when an optimization has generated incorrect code, but also has the task of recovering from the optimization without aborting the program or producing an incorrect result. This technique has been applied to several loop optimizations, including loop interchange, loop tiling, and software pipelining and appears to be quite promising.  相似文献   

14.
韩林  徐金龙  李颖颖  王阳 《计算机科学》2017,44(2):70-74, 81
大量循环中都存在着少数无法向量化的语句以及许多可向量化语句,循环分布通常可以将这些语句分离到不同的循环中,进而实现循环的部分向量化。目前主流的优化编译器仅支持简单激进的循环分布方法,因而导致向量化后的循环开销过大,且不利于寄存器和cache的重用。针对上述问题,提出了面向部分向量化的循环分布及聚合方法。首先,分析了一般循环分布的两个关键问题:语句集的划分和循环执行顺序的确定;其次,提出了面向最大聚合的凝聚图结点排序方法来指导循环合并,在不影响并行性的前提下减小了循环开销;最后,通过实验对提出的方法进行了验证。实验结果表明,对于测试用例,提出的方法能够生成正确的向量化代码,并且能够显著提高向量化程序的执行效率。  相似文献   

15.
论文致力于对图像处理算法的串行C程序进行子字并行分析,并重定向到带有多媒体扩展的通用处理器和多媒体专用嵌入式微处理器。图像处理算法的特点决定其是内在可并行的,这种并行粒度介于数据并行(DLP)和指令级并行(ILP)之间,称之为子字并行。但是,当前的编译技术很难充分挖掘和定位程序基本块内的子字并行,对此设计了一种基于流图程序表示的编译方法,能够从串行程序中显式地定位子字并行。扩展了编译器的功能,增加了特定的模式库,基于模式识别的控制流和数据流分析后,产生特定的子字并行流图(SWFG,Sub-WordFlowGraph),并将该图作为中间表示,提供给子字并行指令选择,进而实现有效的子字并行代码产生。  相似文献   

16.
Region-based compilation: Introduction, motivation, and initial experience   总被引:1,自引:0,他引:1  
The most important task of a compiler designed to exploit instruction-level parallelism (ILP) is instruction scheduling. If higher levels of ILP are to be achieved, the compiler must use, as the unit of scheduling, regions consisting of multiple basic blocks—preferably those that frequently execute consecutively, and which capture cycles in the program’s execution. Traditionally, compilers have been built using the function as the unit of compilation. In this framework, function boundaries often act as barriers to the formation of the most suitable scheduling regions. Function inlining may be used to circumvent this problem by assembling strongly coupled functions into the same compilation unit, but at the cost of very large function bodies. Consequently, global optimizations whose compile time and space requirements are superlinear in the size of the compilation unit, may be rendered prohibitively expensive. This paper introduces a new approach, called region-based compilation, wherein the compiler, after inlining, repartitions the program into more desirable compilation units, termed regions. Region-based compilation allows the compiler to control problem size and complexity while exposing inter-procedural scheduling, optimization and code motion opportunities.  相似文献   

17.
Translation validation is an approach for validating the output of optimizing compilers. Rather than verifying the compiler itself, translation validation mandates that every run of the compiler generate a formal proof that the produced target code is a correct implementation of the source code. Speculative loop optimizations are aggressive optimizations which are only correct under certain conditions which cannot be validated at compile time. We propose using an automatic theorem prover together with the translation validation framework to automatically generate run-time tests for such speculative optimizations. This run-time validation approach must not only detect the conditions under which an optimization generates incorrect code, but also provide a way to recover from the optimization without aborting the program or producing an incorrect result. In this paper, we apply the run-time validation technique to a class of speculative reordering transformations and give some initial results of run-time tests generated by the theorem prover CVC.  相似文献   

18.
指针别名分析是数据流分析中的关键性技术,其分析结果是编译优化和程序变换的基础.在向量化方法和动态指针别名分析相关研究的基础上,设计了一种面向向量化的动态指针别名分析框架.该框架通过动态插桩和试运行提取指针别名信息,并反馈到向量化阶段指导向量化代码生成.从提取候选别名分析集、插桩及试运行和反馈优化3个方面对整体框架进行分析和研究.该框架基于Open64实现,并以通用测试集GCC-VECT和典型应用进行了实验评估,结果表明,该框架相比静态指针别名分析具有更精确的别名分析结果,该结果能够有效改进向量化程序的加速比.  相似文献   

19.
文章[1]中提出了数组之间的数据融合优化方法,并以IA-32服务器为平台测试了数据融合优化的效果。测试结果表明,在IA-32机器上,数据融合优化在性能代价模型的控制下,能较好地改善具有非连续数据访问特征的应用程序的CACHE利用率。那么,在新一代体系结构IA-64平台上,数据融合优化的效果如何呢?该文分别以IntelIA-32服务器和HPITANIUM服务器为平台,用IntelFORTRAN编译器ifc和efc及自由软件编译器g95分别编译并运行数据融合优化变换前后的程序,获得两种平台上的执行时间及相关的性能数据。测试结果表明,源程序级的数据融合优化不能很好地与IA-64平台上的EFC编译器高级优化配合工作,在O3级优化开关控制下,优化效果是负值。此测试结果进一步表明,编译高级优化如数据预取、循环变换和数据变换等各种优化必须结合体系结构的特点统筹考虑,才能取得好的全局优化效果。该文为研究各种面向IA-32体系结构的编译优化算法在IA-64体系结构上的性能可移植性优化起到抛砖引玉的作用。  相似文献   

20.
Tuning compiler optimizations for rapidly evolving hardware makes porting and extending an optimizing compiler for each new platform extremely challenging. Iterative optimization is a popular approach to adapting programs to a new architecture automatically using feedback-directed compilation. However, the large number of evaluations required for each program has prevented iterative compilation from widespread take-up in production compilers. Machine learning has been proposed to tune optimizations across programs systematically but is currently limited to a few transformations, long training phases and critically lacks publicly released, stable tools. Our approach is to develop a modular, extensible, self-tuning optimization infrastructure to automatically learn the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior and optimizations. In this paper we describe Milepost GCC, the first publicly-available open-source machine learning-based compiler. It consists of an Interactive Compilation Interface (ICI) and plugins to extract program features and exchange optimization data with the cTuning.org open public repository. It automatically adapts the internal optimization heuristic at function-level granularity to improve execution time, code size and compilation time of a new program on a given architecture. Part of the MILEPOST technology together with low-level ICI-inspired plugin framework is now included in the mainline GCC. We developed machine learning plugins based on probabilistic and transductive approaches to predict good combinations of optimizations. Our preliminary experimental results show that it is possible to automatically reduce the execution time of individual MiBench programs, some by more than a factor of 2, while also improving compilation time and code size. On average we are able to reduce the execution time of the MiBench benchmark suite by 11% for the ARC reconfigurable processor. We also present a realistic multi-objective optimization scenario for Berkeley DB library using Milepost GCC and improve execution time by approximately 17%, while reducing compilation time and code size by 12% and 7% respectively on Intel Xeon processor.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号