期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

J. Ramanujam 《The Journal of supercomputing》1995,9(4):365-389

This paper presents an approach to modeling loop transformations using linear algebra. Compound transformations are modeled as integer matrices. The nonsingular linear transformations presented here subsume the class of unimodular transformations. The loop transformations included are the unimodular transformations-reversal, skewing, and permutation- and a new transformation, namelystretching. Nonunimodular transformations (with determinant 1) create holes in the transformed iteration space, rendering code generation difficult. We solve this problem by suitably changing the step size of loops in order to skip these holes when traversing the transformed iteration space. For the class of nonunimodular loop transformations, we present algorithms for deriving the loop bounds, the array access expressions, and the step sizes of loops in the nest. To derive the step sizes, we compute the Hermite normal form of the transformation matrix; the step sizes are the entries on the diagonal of this matrix. We then use the theory of Hessenberg matrices in the derivation of exact loop bounds for nonunimodular transformations. We illustrate the use of this approach in several problems such as the generation of tile sets and distributed-memory code generation. This approach provides a framework for optimizing programs for a variety of architectures.Supported in part by an NSF Young Investigator Award CCR-9457768, an NSF grant CCR-9210422, and by the Louisiana Board of Regents through contract LEQSF (1991–94)-RD-A-09. 相似文献

2.

Translation and Run-Time Validation of Loop Transformations

Lenore?Zuck Email author Amir?Pnueli Benjamin?Goldberg Clark?Barrett Yi?Fang Ying?Hu 《Formal Methods in System Design》2005,27(3):335-360

This paper presents new approaches to the validation of loop optimizations that compilers use to obtain the highest performance from modern architectures. Rather than verify the compiler, the approach of translation validationperforms a validation check after every run of the compiler, producing a formal proof that the produced target code is a correct implementation of the source code. As part of an active and ongoing research project on translation validation, we have previously described approaches for validating optimizations that preserve the loop structure of the code and have presented a simulation-based general technique for validating such optimizations. In this paper, for more aggressive optimizations that alter the loop structure of the code—such as distribution, fusion, tiling, and interchange—we present a set of permutation ruleswhich establish that the transformed code satisfies all the implied data dependencies necessary for the validity of the considered transformation. We describe the extensions to our tool voc-64 which are required to validate these structure-modifying optimizations. This paper also discusses preliminary work on run-time validation of speculative loop optimizations. This involves using run-time tests to ensure the correctness of loop optimizations whose correctness cannot be guaranteed at compile time. Unlike compiler validation, run-time validation must not only determine when an optimization has generated incorrect code, but also recover from the optimization without aborting the program or producing an incorrect result. This technique has been applied to several loop optimizations, including loop interchange and loop tiling, and appears to be quite promising. This research was supported in part by NSF grant CCR-0098299, ONR grant N00014-99-1-0131, and the John von Neumann Minerva Center for Verification of Reactive Systems. 相似文献

3.

Unimodular transformations of non-perfectly nested loops

《Parallel Computing》1997,22(12):1621-1645

A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations. In this framework, an imperfect loop nest is converted to a perfect loop nest using Abu-Sufah's Non-Basic-to-Basic-Loop transformation. Conditions for the legality of this transformation and techniques for their verification are discussed. An iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest. Since the converted loop nest is a perfect loop nest, data dependences can be extracted and optimal transformations can be selected for parallelism and/or locality in the normal manner. To generate the restructured code for a unimodular transformation, a code generation method is provided that produces the restructured code that is free of if statements by construction. 相似文献

4.

Optimized Unrolling of Nested Loops

Vivek Sarkar 《International journal of parallel programming》2001,29(5):545-581

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604). 相似文献

5.

Iterative type analysis and extended message splitting: Optimizing dynamically-typed object-oriented programs

Craig Chambers David Ungar 《LISP and Symbolic Computation》1991,4(3):283-310

Object-oriented languages have suffered from poor performance caused by frequent and slow dynamically-bound procedure calls. The best way to speed up a procedure call is to compile it out, but dynamic binding of object-oriented procedure calls without static receiver type information precludes inlining.Iterative type analysis andextended message splitting are new compilation techniques that extract much of the necessary type information and make it possible to hoist run-time type tests out of loops.Our system compiles code on-the-fly that is customized to the actual data types used by a running program. The compiler constructs a control flow graph annotated with type information by simultaneously performing type analysis and inlining. Extended message splitting preserves type information that would otherwise be lost by a control-flow merge by duplicating all the code between the merge and the place that uses the information. Iterative type analysis computes the types of variables used in a loop by repeatedly recompiling the loop until the computed types reach a fix-point. Together these two techniques enable our SELF compiler to split off a copy of an entire loop, optimized for the common-case types.By the time our SELF compiler generates code for the graph, it has eliminated many dynamically-dispatched procedure calls and type tests. The resulting machine code is twice as fast as that generated by the previous SELF compiler, four times faster than ParcPlace Systems Smalltalk-80, the fastest commercially available dynamically-typed object-oriented language implementation, and nearly half the speed of optimized C. Iterative type analysis and extended message splitting have cut the performance penalty for dynamically-typed object-oriented languages in half.This work has been generously supported by National Science Foundation Presidential Young Investigator Grant #CCR-8657631, and by Sun Microsystems, IBM, Apple Computer, Tandem Computers, NCR, Texas Instruments, the Powell Foundation, and DEC.This paper was originally published in theProceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation (SIGPLAN Notices, 25, 6 (1990) 150–160). 相似文献

6.

An efficient code generation technique for tiled iteration spaces

Goumas G. Athanasaki M. Koziris N. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(10):1021-1034

This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code. 相似文献

7.

Robust hierarchical image representation using non-negative matrix factorisation with sparse code shrinkage preprocessing

B.?Szatmáry Email author G.?Szirtes A.?L?rincz J.?Eggert E.?K?rner 《Pattern Analysis & Applications》2003,6(3):194-200

When analysing patterns, our goals are (i) to find structure in the presence of noise, (ii) to decompose the observed structure into sub-components, and (iii) to use the components for pattern completion. Here, a novel loop architecture is introduced to perform these tasks in an unsupervised manner. The architecture combines sparse code shrinkage with non-negative matrix factorisation, and blends their favourable properties: sparse code shrinkage aims to remove Gaussian noise in a robust fashion; non-negative matrix factorisation extracts substructures from the noise filtered inputs. The loop architecture performs robust pattern completion when organised into a two-layered hierarchy. We demonstrate the power of the proposed architecture on the so-called bar-problem and on the FERET facial database. 相似文献

8.

EIGENVECTORS-BASED PARALLELISATION OF NESTED LOOPS WITH AFFINE DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(3):227-248

Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation. 相似文献

9.

Non‐Local Image Inpainting Using Low‐Rank Matrix Completion

Wei Li Lei Zhao Zhijie Lin Duanqing Xu Dongming Lu 《Computer Graphics Forum》2015,34(6):111-122

In this paper, we propose a highly accurate inpainting algorithm which reconstructs an image from a fraction of its pixels. Our algorithm is inspired by the recent progress of non‐local image processing techniques following the idea of ‘grouping and collaborative filtering’. In our framework, we first match and group similar patches in the input image, and then convert the problem of estimating missing values for the stack of matched patches to the problem of low‐rank matrix completion, and finally obtain the result by synthesizing all the restored patches. In our algorithm, how to accurately perform patch matching process and solve the low‐rank matrix completion problem are key points. For the first problem, we propose a robust patch matching approach, and for the second task, the alternating direction method of multipliers is employed. Experiments show that our algorithm has superior advantages over existing inpainting techniques. Besides, our algorithm can be easily extended to handle practical applications including rendering acceleration, photo restoration and object removal. 相似文献

10.

含有析取语义循环的不变式生成改进方法

潘建东陈立前黄达明孙浩曾庆凯《软件学报》2016,27(7):1741-1756

抽象解释为程序不变式的自动化生成提供了通用的框架,但是该框架下的大多数已有数值抽象域只能表达几何上是凸的约束集.因此,对于包含(所对应的约束集是非凸的)析取语义的特殊程序结构,采用传统数值抽象域会导致分析结果不精确.针对显式和隐式含有析取语义的循环结构,提出了基于循环分解和归纳推理的不变式生成改进方法,缓解了抽象解释分析中出现的语义损失问题.实验结果表明:相比已有方法,该方法能为这种包含析取语义的循环结构生成更加精确的不变式,并且有益于一些安全性质的推理. 相似文献

11.

Run-Time Validation of Speculative Optimizations using CVC.

Clark Barrett Benjamin Goldberg Lenore Zuck 《Electronic Notes in Theoretical Computer Science》2003,89(2):89

Translation validation is an approach for validating the output of optimizing compilers. Rather than verifying the compiler itself, translation validation mandates that every run of the compiler generate a formal proof that the produced target code is a correct implementation of the source code. Speculative loop optimizations are aggressive optimizations which are only correct under certain conditions which cannot be validated at compile time. We propose using an automatic theorem prover together with the translation validation framework to automatically generate run-time tests for such speculative optimizations. This run-time validation approach must not only detect the conditions under which an optimization generates incorrect code, but also provide a way to recover from the optimization without aborting the program or producing an incorrect result. In this paper, we apply the run-time validation technique to a class of speculative reordering transformations and give some initial results of run-time tests generated by the theorem prover CVC. 相似文献

12.

Validation of GCC optimizers through trace generation

Aditya Kanade Amitabha Sanyal Uday P. Khedker 《Software》2009,39(6):611-639

The translation validation approach involves establishing semantics preservation of individual compilations. In this paper, we present a novel framework for translation validation of optimizers. We identify a comprehensive set of primitive program transformations that are commonly used in many optimizations. For each primitive, we define soundness conditions that guarantee that the transformation is semantics preserving. This framework of transformations and soundness conditions is independent of any particular compiler implementation and is formalized in PVS. An optimizer is instrumented to generate the trace of an optimization run in terms of the predefined transformation primitives. The validation succeeds if (1) the trace conforms to the optimization and (2) the soundness conditions of the individual transformations in the trace are satisfied. The first step eliminates the need to trust the instrumentation. The soundness conditions are defined in a temporal logic and therefore the second step involves model checking. Thus the scheme is completely automatable. We have applied this approach to several intraprocedural optimizations of RTL intermediate code in GNU Compiler Collection (GCC) v4.1.0, namely, loop invariant code motion, partial redundancy elimination, lazy code motion, code hoisting, and copy and constant propagation for sample programs written in a subset of the C language. The validation does not require information about program analyses performed by GCC. Therefore even though the GCC code base is quite large and complex, instrumentation could be achieved easily. The framework requires an estimated 21 lines of instrumentation code and 140 lines of PVS specifications for every 1000 lines of the GCC code considered for validation. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

13.

Translation and Run-Time Validation of Optimized Code

Lenore Zuck Amir Pnueli Yi Fang Benjamin Goldberg Ying Hu 《Electronic Notes in Theoretical Computer Science》2002,70(4):179-200

The paper presents approaches to the validation of optimizing compilers. The emphasis is on aggressive and architecture-targeted optimizations which try to obtain the highest performance from modern architectures, in particular EPIC-like micro-processors. Rather than verify the compiler, the approach of translation validation performs a validation check after every run of the compiler, producing a formal proof that the produced target code is a correct implementation of the source code.First we survey the standard approach to validation of optimizations which preserve the loop structure of the code (though they may move code in and out of loops and radically modify individual statements), present a simulation-based general technique for validating such optimizations, and describe a tool, VOC-64, which implements these technique. For more aggressive optimizations which, typically, alter the loop structure of the code, such as loop distribution and fusion, loop tiling, and loop interchanges, we present a set of permutation rules which establish that the transformed code satisfies all the implied data dependencies necessary for the validity of the considered transformation. We describe the necessary extensions to the VOC-64 in order to validate these structure-modifying optimizations.Finally, the paper discusses preliminary work on run-time validation of speculative loop optimizations, that involves using run-time tests to ensure the correctness of loop optimizations which neither the compiler nor compiler-validation techniques can guarantee the correctness of. Unlike compiler validation, run-time validation has not only the task of determining when an optimization has generated incorrect code, but also has the task of recovering from the optimization without aborting the program or producing an incorrect result. This technique has been applied to several loop optimizations, including loop interchange, loop tiling, and software pipelining and appears to be quite promising. 相似文献

14.

Efficient Craig interpolation for linear Diophantine (dis)equations and linear modular equations

Himanshu Jain Edmund M. Clarke Orna Grumberg 《Formal Methods in System Design》2009,35(1):6-39

The use of Craig interpolants has enabled the development of powerful hardware and software model checking techniques. Efficient algorithms are known for computing interpolants in rational and real linear arithmetic. We focus on subsets of integer linear arithmetic. Our main results are polynomial time algorithms for obtaining interpolants for conjunctions of linear Diophantine equations, linear modular equations (linear congruences), and linear Diophantine disequations. We also present an interpolation result for conjunctions of mixed integer linear equations. We show the utility of the proposed interpolation algorithms for discovering modular/divisibility predicates in a counterexample guided abstraction refinement (CEGAR) framework. This has enabled verification of simple programs that cannot be checked using existing CEGAR based model checkers. This paper is an extended version of [14]. This research was sponsored by the Gigascale Systems Research Center (GSRC), Semiconductor Research Corporation (SRC), the National Science Foundation (NSF), the Office of Naval Research (ONR), the Naval Research Laboratory (NRL), the Defense Advanced Research Projects Agency (DARPA), the Army Research Office (ARO), and the General Motors Collaborative Research Lab at CMU. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of GSRC, SRC, NSF, ONR, NRL, DARPA, ARO, GM, or the U.S. government. 相似文献

15.

The BonaFide C Analyzer: automatic loop-level characterization and coverage measurement

Sergio Aldea Diego R. Llanos Arturo Gonzalez-Escribano 《The Journal of supercomputing》2014,68(3):1378-1401

The advent of multicore technologies has increased the interest in parallelization techniques for existing sequential applications. These techniques include the need of detecting loops that are good candidates for parallelization, and classifying all variables of these loops according to their use, a task surprisingly hard to be carried out manually. In this paper, we introduce the BonaFide C Analyzer, an XML-based framework that combines static analysis of source code with profiling information to generate complete reports regarding all loops in a C application, including loop coverage, loop suitability for parallelization, a classification of all variables inside loops based on their accesses, and other hurdles that restrict the parallelization. This information allows to analyze how particular language constructs are used in real-world applications, and helps the programmer to parallelize the code. To show the features of the framework, we present the results of an in-depth loop characterization of C applications that are part of the SPEC CPU2006 benchmark suite. Our study shows that 47.72 % of loops present in the applications analyzed are potentially parallelizable with existent parallel programming models such as OpenMP, while an additional 37.7 % of loops could be run in parallel with the help of runtime speculative parallelization techniques. 相似文献

16.

Fine-grain compilation for pipelined machines

Alexandru Nicolau Keshav Pingali Alexander Aiken 《The Journal of supercomputing》1988,2(3):279-295

Computer architecture design requires careful attention to the balance between the complexity of code scheduling problems and the cost and feasibility of building a machine. In this paper, we show that recently developed software pipelining algorithms produce optimal or near-optimal code for a large class of loops when the target architecture is a clean pipelined parallel machine. The important feature of these machines is the absence of structural hazards. We argue that the robustness of the scheduling algorithms and relatively simple hardware make these machines realistic and cost-effective. To illustrate the delicate balance between architecture and scheduling complexity, we show that scheduling with structural hazards is NP-hard, and that there are machines with simple structural hazards for which vectorization and the software pipelining techniques generate poor code.Supported in part by NSF Grants DCR-8502884, CCR-8704367, ONR Grant N00014-86-K-0215, and the Cornell NSF Supercomputing Center.Supported by NSF Grant CCR-8702668 and an IBM Faculty Development Award. 相似文献

17.

Fixing Letrec: A Faithful Yet Efficient Implementation of Scheme's Recursive Binding Construct

Oscar Waddell Dipanwita Sarkar R. Kent Dybvig 《Higher-Order and Symbolic Computation》2005,18(3-4):299-326

A Scheme letrec expression is easily converted into more primitive constructs via a straightforward transformation given in the Revised⁵ Report. This transformation, unfortunately, introduces assignments that can impede the generation of efficient code. This article presents a more judicious transformation that preserves the semantics of the revised report transformation and also detects invalid references and assignments to left-hand-side variables, yet enables the compiler to generate efficient code. A variant of letrec that enforces left-to-right evaluation of bindings is also presented and shown to add virtually no overhead. A preliminary version of this article was presented at the 2002 Workshop on Scheme and Functional Programming [15]. 相似文献

18.

基于链码的分水岭变换算法 总被引：11，自引：1，他引：11

孙涵任明武《中国图象图形学报》2004,9(9):1025-1031

为了快速准确地进行图像分割，通过对现有分水岭变换算法的分析，并借鉴图像处理中常用的链码思想，提出了基于链码的分水岭变换算法，并首先扩展了传统链码的定义，将其分为指出链码和指入链码；然后提出并阐述了利用链码实现分水岭变换的两个性质；最后给出了基于链码的分水岭变换算法的具体描述，并详细分析了新算法的时间和空间复杂度。实验结果表明，新算法具有较低的时间和空间复杂度，且变换结果更有利于后续的图像理解。相似文献

19.

Using affine closure to find legal reordering transformations

Wayne Kelly William Pugh 《International journal of parallel programming》1995,23(4):303-325

We present a unified framework for applying iteration reordering transformations. This framework is able to represent traditional transformations such as loop interchange, loop skewing and loop distribution as well as compositions of these transformations. Using a unified framework rather than a sequence of ad-hoc transformations makes it easier to analyze and predict the effects of these transformations. Our framework is based on the idea that all reordering transformations can be represented as a mapping from the original iteration space to a new iteration space. An optimizing compiler would use our framework by finding a mapping that both corresponds to a legal transformation and produces efficient code. We present the mapping selection problem as a search problem by decomposing it into a sequence of smaller choices. We then characterize the set of all legal mappings by defining a search tree. As part of this process we use a new operation called affine closure. 相似文献

20.

Precise Data Locality Optimization of Nested Loops

Vincent Loechner Benoît Meister Philippe Clauss 《The Journal of supercomputing》2002,21(1):37-76

A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop. 相似文献