期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization

Thomas M. Conte Burzin A. Patel Kishore N. Menezes J. Stan Cox 《International journal of parallel programming》1996,24(2):187-206

Profile-based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%–4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems. 相似文献

2.

OpenMP Implementation of SPICE3 Circuit Simulator

Tien-Hsiung Weng Ruey-Kuen Perng Barbara Chapman 《International journal of parallel programming》2007,35(5):493-505

In this paper, we describe our experience of creating an OpenMP implementation of the SPICE3 circuit simulator program. Given the irregular patterns of access to dynamic data structures in the SPICE code, a parallelization using current standard OpenMP directives is impossible without major rewriting of the original program. The aim of this work is to present a case study showing the development of a shared memory parallel code with minimum effort. We present two implementations, one with minimal code modification and one without modification to the original SPICE3 program using Intel’s taskq construct. We also discuss the results of the case study in terms of what future compiler tools may be needed to help OpenMP application developers with similar porting goals. Our experiments using SPICE3, based on SRAM model simulation, were compiled by the SUN compiler running on a SunFire V880 UltraSPARC-III 750 MHz and by the Intel icc compiler running on both an IBM Itanium with four CPUs and Intel Xeon of two processors machines. The results are promising. 相似文献

3.

Profile-assisted instruction scheduling

William Y. Chen Scott A. Mahlke Nancy J. Warter Sadun Anik Wen-Mei W. Hwu 《International journal of parallel programming》1994,22(2):151-181

Instruction schedulers for superscalar and VLIW processors must expose sufficient instruction-level parallelism to the hardware in order to achieve high performance. Traditional compiler instruction scheduling techniques typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instruction-level parallelism for the frequent execution scenarios at the expense of the less freuent ones. Profile information identifies these important execution scenarios in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-assisted code scheduling techniques have been incorporated into the IMPACT-I compiler. These techniques are acyclic global scheduling and software pipelining. This paper describes the scheduling algorithms, highlights the modifications required to use profile information, and explains the hardware and compiler support for dealing with hazards that arise from aggressive use of profile information. The effectiveness of these profile-based scheduling techniques is evaluated for a range of superscalar and VLIW processors. 相似文献

4.

Compiler Controlled Prefetching for Multiprocessors Using Low-Overhead Traps and Prefetch Engines

《Journal of Parallel and Distributed Computing》2000,60(5):585-615

In this paper we propose and evaluate a new data-prefetching technique for cache coherent multiprocessors. Prefetches are issued by a functional unit called a prefetch engine which is controlled by the compiler. We let second-level cache misses generate cache miss traps and start the prefetch engine in a trap handler. The trap handler is fast (40–50 cycles) and does not normally delay the program beyond the memory latency of the miss. Once started, the prefetch engine executes on its own and causes no instruction overhead. The only instruction overhead in our approach is when a trap handler completes after data arrives. The advantages of this technique are (1) it exploits static compiler analysis to determine what to prefetch, which is hard to do in hardware, (2) it uses prefetching with very little instruction overhead, which is a limitation for traditional software-controlled prefetching, and (3) it is accurate in the sense that it generates very little useless traffic while maintaining a high prefetching coverage. We also study whether one could emulate the prefetch engine in software, which would not require any additional hardware beyond support for generating cache miss traps and ordinary prefetch instructions. In this paper we present the functionality of the prefetch engine and a compiler algorithm to control it. We evaluate our technique on six parallel scientific and engineering applications using an optimizing compiler with our algorithm and a simulated multiprocessor. We find that the prefetch engine removes up to 67% of the memory access stall time at an instruction overhead less than 0.42%. The emulated prefetch engine removes in general less stall time at a higher instruction overhead. 相似文献

5.

Construction of speculative optimization algorithms

A. A. Belevantsev S. S. Gaisaryan V. P. Ivannikov 《Programming and Computer Software》2008,34(3):138-153

In modern processors, instructions to perform operations are often produced before it becomes known that this is required. Such an expedient, which is called speculative execution, helps to reveal parallelism at the instruction level. In the EPIC architectures, the speculative execution is completely controlled by the compiler, which makes it possible to avoid using complex hardware mechanisms for supporting speculative instruction production. Moreover, the idea of the speculative execution can be used by the compiler in machine-independent optimizations. The paper describes a scheme of construction of the speculative optimization that is based on the selection of properties of the control flow and data flow that are important from the optimization standpoint and on the estimation of the probabilities of their fulfillment. The probabilities found are used for searching and constructing advantageous speculative and bookkeeping transformations. For optimizations that include only speculative movements of instructions upwards along the control flow graph, on the basis of the suggested scheme, a method has been developed that includes algorithms for finding probabilities of data and control dependences, for estimating benefit of speculative movements, and for constructing a recovery code. On the basis of this method, an algorithm for the speculative scheduling of instructions for the Intel Itanium architecture has been developed and implemented. Specific features of its implementation and experimental results are described. 相似文献

6.

CoDBT: A multi-source dynamic binary translator using hardware–software collaborative techniques

Haibing Guan Bo Liu Zhengwei Qi Yindong Yang Hongbo Yang Alei Liang 《Journal of Systems Architecture》2010,56(10):500-508

For implementing a dynamic binary translation system, traditional software-based solutions suffer from significant runtime overhead and are not suitable for extra complex optimization. This paper proposes using hardware–software collaboration techniques to create an high efficient dynamic binary translation system, CoDBT, which emulates several heterogeneous ISAs (Instruction Set Architectures) on a host processor without changing to the existing processor. We analyze the major performance bottlenecks via evaluating overhead of a pure software-solution DBT. Guidelines are provided for applying a suitable hardware–software partition process to CoDBT, as are algorithms for designing hardware-based binary translator and code cache management. An intermediate instruction set is introduced to make multi-source translation more practicable and scalable. Meantime, a novel runtime profiling strategy is integrated into the infrastructure to collect program hot spots information to supporting potential future optimizations. The advantages of using co-design as an implementation approach for DBT system are assessed by several SPEC benchmarks. Our results demonstrate that significant performance improvements can be achieved with appropriate hardware support choices. CoDBT could be an efficient and cost-effective solution for situations where the usual methods of performance acceleration for dynamic binary translation are inappropriate. 相似文献

7.

配置流驱动计算体系结构指导下的ASIP设计 总被引：1，自引：0，他引：1

李勇王志英赵学秘岳虹《计算机研究与发展》2007,44(4):714-721

为了兼顾嵌入式处理器设计中的灵活性与高效性,提出配置流驱动计算体系结构.在体系结构设计中将软/硬件界面下移,使功能单元之间的互连网络对编译器可见,并由编译器来完成传输路由,从而支持复杂但更为高效的互连网络.在该体系结构指导下,提出一种支持段式可重构互连网络的专用指令集处理器(ASIP)设计方法.该方法应用到密码领域的3类ASIP设计中表明,与简单总线互连相比,在不影响性能的前提下,可平均节约53%的互连功耗和38.7%的总线数量,从而达到减少总线数量、降低互连功耗的目的. 相似文献

8.

非线性规律访存操作的数据预取技术

吴佳骏冯晓兵张兆庆《计算机研究与发展》2007,44(2):355-360

编译器在静态分析方式下很难对程序的非线性规律访存操作进行正确的数据预取 .但采用profiling技术可以得到程序运行时候的访存规律,利用这些信息可以精确地插入数据预取指令 .基于stride profiling技术,提出了新的信息收集类型stride iterative,更精确地反映程序执行时访存指令的实际行为,并结合别名分析的结果调整对同一cache行的数据预取,得到比普通数据预取更好的预取性能 .安腾2上运行CPU2000的12个整型测试例子平均有8.54%的性能提升,其中mcf性能提升达到了77.87%. 相似文献

9.

编译器中的edge profiling设计和实现 总被引：4，自引：0，他引：4

董希谦张兆庆《计算机科学》2003,30(1):46-48

1.简介这篇论文是在从事ORC(Open Research Compiler)编译器的过程中完成的。ORC是中科院计算所同Intel公司的合作项目,为Intel的新一代64位芯片开发开放源代码的先进编译器。ORC的实现目标:为研究机构和团体提供一个可靠稳定的、开放源代码的编译基础结构,方便其在ORC的基础上进行编译器和体系结构的研究;与当前其他IA-64编译器相比,要有比较高的性能。代码优化的一个重要目标是为了减少程序的执行时间。相似文献

10.

面向IXP网络处理器的内联优化 总被引：1，自引：0，他引：1

汤伟吴承勇张兆庆《计算机科学》2006,33(4):250-252

内联优化是一种有效的编译优化技术，它通过将函数体直接嵌入到调用点来消除函数调用开销。然而，网络处理器特殊的体系结构对内联优化提出了新的要求，需要新的技术辅助传统内联优化来更好地适应这种特殊的体系结构。本文描述了如何利用关键路径提取技术和迭代编译技术对传统内联优化技术进行扩充和改造，来更好地适应IXP体系结构。实验数据表明，改进后的内联优化能够有效地提高网络系统的性能。相似文献

11.

基于值剖视的编译优化

下载免费PDF全文

孔凡金黄春《计算机工程》2011,37(6):58-60

介绍在GCC编译器中利用值剖视识别收集变量的不变特征信息并指导代码优化工作的方法。NPB基准测试程序的测试结果表明，GCC基于值剖视的优化引入的开销小，与边剖视一起使用时能获得较好的优化效果，在不同程序间显示出一定的优化针对性和局限性，值剖视信息的类型与数量、优化种类等存在较大的改进空间。相似文献

12.

指导cache静态划分的程序性能profiling优化技术

贾耀仓武成岗张兆庆《计算机研究与发展》2012,49(1):93-102

对于共享cache的多核处理器,如何管理好各个核对cache的利用,对于充分发挥多核处理器性能是很关键的问题.目前采用的cache替换方法程序间会出现性能干扰,cache静态划分技术则是通过为同时运行的程序分配不同的空间来解决性能干扰问题.为了给程序分配合适大小的cache空间,需要对程序进行性能profiling,即事先多遍运行收集程序在各种cache容量下的性能数据,这种性能profiling方法开销巨大,影响实用.为了解决性能profiling需要多遍运行程序的问题,提出了只需单遍运行的程序性能profiling优化技术.该技术利用在线的phase分析技术识别程序的运行阶段,避免对相同阶段的重复profiling;同时分析程序各phase的性能同cache容量变化的关系趋势,对于性能不敏感的容量变化则不进行profiling,降低开销.在程序运行结束后通过程序各phase在cache各种容量下的性能来估计程序在各容量下的整体性能,以指导cache静态划分.实验表明,该技术的开销仅为7%,而该方法指导的cache划分比未划分时有8%的性能改进,同多遍运行的程序性能profiling指导的cache划分性能相比仅有1%的下降. 相似文献

13.

保证Java精确异常的软件流水线技术 总被引：1，自引：0，他引：1

倪奇智张为华臧斌宇朱传琪《计算机应用与软件》2008,25(2):21-23

Java对精确异常的支持严重限制了JIT编译器的动态优化的能力.目前已经有不少在精确异常存在下的优化技术,但它们都是针对代码块内部顺序指令的调度算法,依然没有在软件流水线这样循环级别做带精确异常的优化的算法.针对存在精确异常要求的Java程序,提出了一种软件流水线的算法,并以安腾作为底层平台对该算法进行了测试,实验结果显示该算法在保证Java精确异常要求的情况下能够大幅度提高Java程序的性能. 相似文献

14.

动态二进制翻译中动态优化的成本与收益分析

孙光辉王丽娟《计算机时代》2010,(2):4-5

传统的静态编译器优化存在着各种限制,为此,提出了一种运行期动态优化的对策。在程序的执行过程中,持续检测程序运行的profile信息,并根据这些信息对程序代码进行优化变换,创建并运行程序代码的优化版本。这种运行期动态优化操作是直接针对程序的二进制代码的,不针对程序语言或编译器。这不仅带来优化的透明性,还使得老版本的源代码即遗留代码也可以从优化技术中获得性能提升。相似文献

15.

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

《Parallel Computing》2007,33(10-11):700-719

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. We investigate user-level schedulers that dynamically “rightsize” the dimensions and degrees of parallelism on the cell broadband engine. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. We evaluate recently introduced schedulers for event-driven execution and utilization-driven dynamic multi-grain parallelization on Cell. We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm. We evaluate S-MGPS on an IBM Cell BladeCenter with two realistic bioinformatics applications that infer large phylogenies. S-MGPS performs within 2–10% of the optimal scheduling algorithm in these applications, while exhibiting low overhead and little sensitivity to application-dependent parameters. 相似文献

16.

A framework for exploiting task and data parallelism on distributedmemory multicomputers

Ramaswamy S. Sapatnekar S. Banerjee P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(11):1098-1116

Distributed Memory Multicomputers (DMMs), such as the IBM SP-2, the Intel Paragon, and the Thinking Machines CM-5, offer significant advantages over shared memory multiprocessors in terms of cost and scalability. Unfortunately, the utilization of all the available computational power in these machines involves a tremendous programming effort on the part of users, which creates a need for sophisticated compiler and run-time support for distributed memory machines. In this paper, we explore a new compiler optimization for regular scientific applications-the simultaneous exploitation of task and data parallelism. Our optimization is implemented as part of the PARADIGM HPF compiler framework we have developed. The intuitive idea behind the optimization is the use of task parallelism to control the degree of data parallelism of individual tasks. The reason this provides increased performance is that data parallelism provides diminishing returns as the number of processors used is increased. By controlling the number of processors used for each data parallel task in an application and by concurrently executing these tasks, we make program execution more efficient and, therefore, faster 相似文献

17.

Rewriting executable files to measure program behavior

James R. Larus Thomas Ball 《Software》1994,24(2):197-218

Inserting instrumentation code in a program is an effective technique for detecting, recording, and measuring many aspects of a program's performance. Instrumentation code can be added at any stage of the compilation process by specially-modified system tools such as a compiler or linker or by new tools from a measurement system. For several reasons, adding instrumentation code after the compilation process—by rewriting the executable file—presents fewer complications and leads to more complete measurements. This paper describes the difficulties in adding code to executable files that arose in developing the profiling and tracing tools qp and qpt. The techniques used by these tools to instrument programs on MIPS and SPARC processors are applicable in other instrumentation systems running on many processors and operating systems. In addition, many difficulties could have been avoided with minor changes to compilers and executable file formats. These changes would simplify this approach to measuring program performance and make it more generally useful. 相似文献

18.

反投机技术研究

孙维新赵荣彩苏铭齐宁《计算机工程与应用》2007,43(1):74-78

投机优化技术作为一种先进的现代编译技术,有效地提高了指令执行的并行性。然而,在逆向工程中,有时要实现代码的跨平台移植,而投机优化技术又受硬件平台的制约;有时需要优化代码的结构,使程序的逻辑结构易于理解;这些都要求消除这种与硬件息息相关的优化技术。论文基于IA-64平台,提出了一种反投机处理算法,对ICC编译器编译后的可执行二进制代码进行处理,消除代码中的投机优化,将其转换成等价的没有投机优化的指令序列,这样使反投机后的代码更容易理解,而且在逆向工程中摆脱了硬件的限制。测试表明该反投机技术可以对ICC编译后的代码进行有效处理。相似文献

19.

Compiler techniques for the distribution of data and computation

Navarro A. Zapata E. Padua D. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):545-562

This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines. 相似文献

20.

Profile-based dynamic pipeline scaling

Kuan-Wei Cheng Tzong-Yen Lin Rong-Guey Chang 《The Journal of supercomputing》2009,48(2):210-226

Low power has played an increasingly important role for embedded systems. To save power, lowering voltage and frequency is very straightforward and effective; therefore, dynamic voltage scaling (DVS) has become a prevalent low-power technique. However, DVS makes no effect on power saving when the voltage reaches a lower bound. Fortunately, a technique called dynamic pipeline scaling (DPS) can overcome this limitation by switching pipeline modes at low-voltage level. Approaches proposed in previous work on DPS were based on hardware support. From viewpoint of compiler, little has been addressed on this issue. This paper presents a DPS optimization technique at compiler time to reduce power dissipation. The useful information of an application is exploited to devise an analytical model to assess the cost of enabling DPS mechanism. As a consequence, we can determine the switching timing between pipeline modes at compiler time without causing significant run-time overhead. The experimental result shows that our approach is effective in reducing energy consumption.

Rong-Guey Chang (Corresponding author)Email:

相似文献