期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Programming for different memory consistency models

Kourosh Gharachorloo Sarita V. Adve Anoop Gupta John L. Hennessy Mark D. Hill 《Journal of Parallel and Distributed Computing》1992,15(4)

The memory consistency model, or memory model, supported by a shared-memory multiprocessor directly affects its performance. The most commonly assumed memory model is sequential consistency (SC). While SC provides a simple model for the programmer, it imposes rigid constraints on the ordering of memory accesses and restricts the use of common hardware and compiler optimizations. To remedy the shortcomings of SC, several relaxed memory models have been proposed in the literature. These include processor consistency (PC), weak ordering (WO), release consistency (RCsc/RCpc), total store ordering (TSO), and partial store ordering (PSO). While the relaxed models provide the potential for higher performance, they present a more complex model for programmers when compared to SC. Our previous research has addressed this tradeoff by taking a programmer-centric approach. We have proposed memory models (DRF0, DRF1, PL) that allow the programmer to reason with SC, but require certain information about the memory accesses. This information is used by the system to relax the ordering among memory accesses while still maintaining SC for the programmer. Our previous models formalized the information that allowed optimizations associated with WO and RCsc to be used. This paper extends the above approach by defining a new model, PLpc, that allows optimizations of the TSO, PSO, PC, and RCpc models as well. Thus, PLpc provides a unified programming model that maintains the ease of reasoning with SC while providing for efficiency and portability across a wide range of proposed system designs. 相似文献

2.

Implementing sequentially consistent programs on processor consistent platforms

Lisa Higham Jalal Kawash 《Journal of Parallel and Distributed Computing》2008

Designers of distributed algorithms typically assume strong memory consistency guarantees, but system implementations provide weaker guarantees for better performance and scalability. This motivates the study of how to implement programs designed for sequential consistency on platforms with weaker consistency models. Typically, such implementations are impossible using only read and write operations to shared variables. One variant of processor consistency originally proposed by Goodman and called here PC-G is an exception because it provides just enough consistency to implement mutual exclusion using only reads and writes. This paper investigates the existence of compilers to convert arbitrary programs that use shared read/write variables with sequentially consistent memory semantics, to programs that use read/write variables with PC-G consistency semantics. We first provide a simple program transformation, and prove that it correctly compiles any 2-process program to a PC-G memory system, while preserving wait-freedom. We next prove that even a substantial generalization of this transformation cannot be a compiler for even a very restricted class of 3-process programs. Even though our program transformation is not a general compiler for three or more processes, it does correctly transform some specific n-process programs. In particular, for the special case of the (necessarily randomized) Test&Set algorithm of Tromp and Vitanyi, our transformation extends to any number of processes, thus providing the first algorithm for expected wait-free Test&Set on any weak memory system, using only read/write variables. 相似文献

3.

Verification of STM on relaxed memory models

Rachid Guerraoui Thomas A. Henzinger Vasu Singh 《Formal Methods in System Design》2011,39(3):297-331

相似文献

4.

A unified formalization of four shared-memory models

Adve S.V. Hill M.D. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(6):613-624

The authors present a data-race-free-1, shared-memory model that unifies four earlier models: weak ordering, release consistency (with sequentially consistent special operations), the VAX memory model, and data-race-free-0. Data-race-free-1 unifies the models of weak ordering, release consistency, the VAX, and data-race-free-0 by formalizing the intuition that if programs synchronize explicitly and correctly, then sequential consistency can be guaranteed with high performance in a manner that retains the advantages of each of the four models. Data-race-free-1 expresses the programmer's interface more explicitly and formally than weak ordering and the VAX, and allows an implementation not allowed by weak ordering, release consistency, or data-race-free-0. The implementation proposal for data-race-free-1 differs from earlier implementations by permitting the execution of all synchronization operations of a processor even while previous data operations of the processor are in progress. To ensure sequential consistency, two sychronizing processors exchange information to delay later operations of the second processor that conflict with an incomplete data operation of the first processor 相似文献

5.

A Framework of Memory Consistency Models 总被引：2，自引：1，他引：2

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin 《计算机科学技术学报》1998,13(2):110-124

相似文献

6.

EDGE结构上一种通过超块重组加速单线程应用的方法

魏学超安虹毛梦捷《小型微型计算机系统》2012,(10):2249-2254

Explicit Data Graph Execution(EDGE)ISA是一种专门为类数据流驱动的分片式众核处理器而设计的指令集体系结构.相较于传统的采用控制流驱动的处理器,EDGE结构以超块(Hyperblock)而不是单个指令作为其执行单位,在超块内部实现数据流执行,超块之间按照推测序保持控制流执行,有利于挖掘指令级并行性.但是,EDGE编译器按照程序的串行执行顺序组织超块,超块间和超块内部受限于数据依赖,削弱了整个程序运行时的潜在数据级并行性和线程级并行性,不利于发挥EDGE分片式结构的优势.本文通过分析EDGE编译器超块组织的特点,结合EDGE结构特有的执行模型,提出一种普适性的超块组织框架来模拟EDGE结构上多线程运行的效果,进一步挖掘EDGE结构运行串行单线程程序时的指令级并行性.本文选用TRIPS微处理器作为EDGE结构的实例处理器,利用矩阵乘法等三个实验验证了我们所提出的框架的可行性,实验结果表明这些应用在TRIPS上获得了较好的性能提升. 相似文献

7.

Support for Efficient Programming on the SB-PRAM

Thomas Grün Thomas Rauber Jochen Röhrig 《International journal of parallel programming》1998,26(3):209-240

The SB-PRAM is a shared-memory parallel computer that has been designed according to the PRAM model from theoretical computer science. The SB-PRAM realizes a concurrent-read, concurrent-write PRAM where each processor can access the global memory in unit time. This article describes the programming environment of the SB-PRAM that enables a programmer to develop efficient and portable programs without dealing with architectural details of the machine. In particular, we discuss compiler and operating system issues and show that the runtime functions of the P4 environment and several parallel data structures can be implemented very efficiently by using special features of the SB-PRAM. In contrast to other parallel machines, the synchronization of processors and the management of concurrent accesses to the global memory only require a few machine instructions independent of the number of processors participating in the operation. This efficient implementation of the runtime system is the basis for good performance of many challenging applications. 相似文献

8.

Exploiting parallelism on a fine-grained MIMD architecture based upon channel queues

Rajiv Gupta Sunah Lee 《International journal of parallel programming》1992,21(3):169-192

We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh. 相似文献

9.

Multiprocessors should support simple memory consistency models

Hill M.D. 《Computer》1998,31(8):28-34

In the future, many computers will contain multiple processors, in part because the marginal cost of adding a few additional processors is so low that only minimal performance gain is needed to make the additional processors cost effective. Intel, for example, now makes cards containing four Pentium Pro processors that can easily be incorporated into a system. Multiple processor cards like Intel's will help multiprocessing spread from servers to the desktop. But how will these multiprocessors be programmed? The evolution of the programming model is already under way. One important function of the programming model is to describe how memory operates. For a multiprocessor, a reasonable model is sequential consistency (SC), which makes a multiprocessor behave like a multitasking uniprocessor. Nevertheless, many commercial multiprocessors support more relaxed memory models. The author argues that multiprocessors should support SC because-with speculative execution, relaxed models do not provide sufficient additional performance to justify exposing their complexity to the authors of low level software 相似文献

10.

Natural instruction level parallelism-aware compiler for high-performance QueueCore processor architecture

Ben Abdallah Abderazek Masashi Masuda Arquimedes Canedo Kenichi Kuroda 《The Journal of supercomputing》2011,57(3):314-338

This work presents a static method implemented in a compiler for extracting high instruction level parallelism for the 32-bit QueueCore, a queue computation-based processor. The instructions of a queue processor implicitly read and write their operands, making instructions short and the programs free of false dependencies. This characteristic allows the exploitation of maximum parallelism and improves code density. Compiling for the QueueCore requires a new approach since the concept of registers disappears. We propose a new efficient code generation algorithm for the QueueCore. For a set of numerical benchmark programs, our compiler extracts more parallelism than the optimizing compiler for an RISC machine by a factor of 1.38. Through the use of QueueCore’s reduced instruction set, we are able to generate 20% and 26% denser code than two embedded RISC processors. 相似文献

11.

Data-race and concurrent-write freedom are undecidable

《Computer Languages, Systems and Structures》2003,29(1-2):1-13

In a distributed shared memory system, sequential consistency is often assumed as the model for the memory, because it is a natural extension from multitasking in uniprocessor systems. Weaker consistency models allow greater concurrency, but programming is harder, because programs may produce unexpected results.Data-race-free (DRF) and concurrent-write-free (CWF) programs have the same set of possible executions both under a sequentially consistent memory and under some other, weaker model, memories. They can be written for a sequential memory and run unchanged under such a weaker-model memory. Since the sets of possible executions are the same, the run will only produce results that are possible under sequential consistency.This article proves the undecidability of both classes of concurrent programs in a language with if statements, loops, barriers, dynamic process creation, dynamic storage, and recursive data structures, under many models weaker than sequential consistency. Moreover, the article also proves that methods that only add synchronization statements to programs written for sequential consistency must produce some conservatively DRF or CWF programs. 相似文献

12.

A protocol-centric approach to on-the-fly race detection

Perkovic D. Keleher P.J. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(10):1058-1072

We present the design and evaluation of a new data-race-detection technique. Our technique executes at runtime rather than post-mortem, and handles unmodified shared-memory applications that run on top of CVM, a software distributed shared memory system. We do not assume explicit associations between synchronization and shared data, and require neither compiler support nor program source. Instead, we use a binary code re-writer to instrument instructions that may access shared memory. The most novel aspect of our system is that we are able to use information from the underlying memory system implementation in order to reduce the number of comparisons made at runtime. We present an experimental evaluation of our techniques by using our system to look for data races in five common shared-memory programs. We quantify the effect of several optimizations to the basic technique: data flow analysis, instrumentation batching, runtime code modification, and instrumentation inlining. Our system correctly found races in three of the five programs, including two from a standard benchmark suite. The slowdown of this debugging technique averages less than 2.5 for our applications. 相似文献

13.

Design and implementation of a queue compiler

Arquimedes Canedo Ben A. Abderazek Masahiro Sowa 《Microprocessors and Microsystems》2009,33(2):129-138

Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their operands making the programs free of false dependencies. Compiling for a queue machine differs from traditional compilation methods for register machines. The queue compiler is responsible for scheduling the program in level-order manner to expose natural parallelism and calculating instructions relative offset values to access their operands. This paper describes the phases and data structures used in the queue compiler to compile C programs into assembly code for the QueueCore, an embedded queue processor. Experimental results demonstrate that our compiler produces good code in terms of parallelism and code size when compared to code produced by a traditional compiler for a RISC processor. 相似文献

14.

Compiler optimizations for processors with SIMD instructions

Ivan Pryanishnikov Andreas Krall Nigel Horspool 《Software》2007,37(1):93-113

To achieve maximum efficiency, modern embedded processors for media applications exploit single instruction multiple data (SIMD) instructions. SIMD instructions provide a form of vectorization where a large machine word is viewed as a vector of subwords and the same operation is performed on all subwords in parallel. Systematic usage of SIMD instructions can significantly improve program performance. With C becoming the dominant language for programming embedded devices, there is a clear need for C compilers that use SIMD instructions whenever appropriate. However, SIMD instructions typically require each memory access to be aligned with the instruction's data access size. Therefore an important problem in designing the compiler is to determine whether a C pointer is aligned, i.e. whether it refers to the beginning of a machine word. In this paper, we describe our SIMD generation algorithm and present an analysis method which determines the alignment of pointers at compile time. The alignment information is used to reduce the number of dynamic alignment checks and the overhead incurred by them. Our method uses an interprocedural analysis which propagates pointer alignment information in function bodies and through function calls. The effectiveness of our method is supported by experimental results which show that in typical programs the alignments of about 50% of the pointers can be statically determined. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

15.

奇异数据类型的编译支持

下载免费PDF全文

胡定磊陈书明刘春林《计算机工程》2007,33(3):29-31

专用处理器，如DSP等，因主要支持特定应用，其指令集往往只支持有限的数据类型。在采用高级语言为其编程时，若采用了处理器不支持的奇异数据类型，编译器必须在保持语义的前提下将其转化为处理器支持的一段指令。该文提出了一种在VLIW DSP编译器中实现对奇异数据类型的处理的方法，包括对含有奇异数据类型的中间代码的注释、调度依赖关系的计算、寄存器分配的改进。该类方法对编译器的改动相对较小，效率较高。相似文献

16.

Imagine流处理器上流的优化组织方法

杨学军曾丽芳邓宇唐玉华《计算机学报》2008,31(7)

流应用的特点以及传统处理器在处理流应用上的不足,使得支持数据并行的流处理器的设计成为当前体系结构研究领域的一个热点.文中针对Imagine流处理器体系结构的特点,提出了流分割和流压缩两种流的优化组织方法.模拟结果表明,流分割和流压缩使得流应用程序能充分利用Imagine的并行结构、流水结构和多级带宽存储结构,从而减少流程序的执行时间. 相似文献

17.

Compilation techniques for a reconfigurable LIW architecture

Rajiv Gupta Mary Lou Soffa 《The Journal of supercomputing》1989,3(4):271-304

Matching an application to an architecture in structure and size is a way of achieving higher computation speed. This paper presents a combination of a compiler and a reconfigurable long instruction word (RLIW) architecture as an approach to the matching problem. Configurations suitable for the execution of different parts of a program are determined by a compiler, and code is generated for both reconfiguring the hardware and performing the computation. The RLIW machine, consisting of multiple processing and global data memory modules, effectively utilizes the fine-grained parallelism detected in programs by a compiler. The long word instructions control the operation of processing and memory modules in the system. To reduce the data transfer between processing modules and data memory modules, we provide reconfigurable interconnections among the processing modules which permit direct communication. The compiler uses new techniques, including region scheduling, generation of code for reconfiguration of the system, and memory allocation techniques, to achieve improved performance. Algorithms for packing operations into long word instructions and techniques for effectively assigning memory modules to the operands required by an instruction are developed. Results of the experiments to evaluate the system indicate that speedups of 60–300% can be obtained for both scientific and nonscientific programs. The reconfigurable architecture is responsible for much of the speedup. Also, the results indicate that the major problem of memory bottleneck faced in designing parallel systems is successfully attacked.This paper represents work done while the author was at the University of Pittsburgh 相似文献

18.

Static scheduling for barrier MIMD architectures

Henry G. Dietz Abderrazek Zaafrani Matthew T. O'Keefe 《The Journal of supercomputing》1992,5(4):263-289

In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented. 相似文献

19.

What is Itanium Memory Consistency from the Programmer's Point of View?

Lisa Higham LillAnne Jackson Jalal Kawash 《Electronic Notes in Theoretical Computer Science》2007,174(9):63

相似文献

20.

A scalable implementation of barrier synchronization using an adaptive combining tree

Rajiv Gupta Charles R. Hill 《International journal of parallel programming》1989,18(3):161-180

Barrier synchronization is commonly used for synchronizing processors prior to a join operation and to enforce data dependencies during the execution of parallelized loops. Simple software implementations of barrier synchronization can result in memory hot-spots, especially in large scale shared-memory multiprocessors containing hundreds of processors and memory modules communicating through an interconnection network. A software combining tree can be used to substantially reduce memory contention due to hot-spots. However, such an implementation results inO(logn) latency in recognition of barrier synchronization, wheren is the number of processors. In this paper anadaptive software combining tree is used to implement a scalable barrier withO(1) recognition latency. The processors that arrive early at the barrier adapt the combining tree so that it has a structure appropriate for reducing the latency for the processors that arrive later. We also show how adaptive combining trees can be used to implement the fuzzy barrier. The fuzzy barrier mechanism reduces the idling of processors at the barriers by allowing the processors to execute useful instructions while they are waiting at the barrier. 相似文献