期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Najjar W. Jezouin J.-L. Gaudiot J.-L. 《Design & Test of Computers, IEEE》1987,4(6):41-44

The availability of commercial multiprocessors gives a new importance to distributed simulation. However, while existing techniques exploit some levels of parallelism in the discrete-event simulation algorithm, they are usually targeted to specific applications and architectures. In the general scheme presented here, the algorithm is executed in parrallel using a compilation strategy that can be implemented on shared-memory multiprocessors. 相似文献

2.

Special issue on data-flow computing Guest editors' introduction

Lubomir Bic Jean-Luc Gaudiot 《Journal of Parallel and Distributed Computing》1990,10(4)

相似文献

3.

Pinned OS/Services: A Case Study of XML Parsing on Intel SCC

下载免费PDF全文

唐洁 Pollawat Thanarungroj 刘晨刘少山古志民 Jean-Luc Gaudiot 《计算机科学技术学报》2013,28(1):3-13

Nowadays,we are heading towards integrating hundreds to thousands of cores on a single chip.However,traditional system software and middleware are not well suited to manage and provide services at such large scale.To improve the scalability and adaptability of operating system and middleware services on future many-core platform,we propose the pinned OS/services.By porting each OS and runtime system(middleware) service to a separate core(special hardware acceleration),we expect to achieve maximal performance gain and energy efficiency in many-core environments.As a case study,we target on XML(Extensible Markup Language),the commonly used data transfer/store standard in the world.We have successfully implemented and evaluated the design of porting XML parsing service onto Intel 48-core Single-Chip Cloud Computer(SCC) platform.The results show that it can provide considerable energy saving.However,we also identified heavy performance penalties introduced from memory side,making the parsing service bloated.Hence,as a further step,we propose the memory-side hardware accelerator for XML parsing.With specified hardware design,we can further enhance the performance gain and energy efficiency,where the performance can be improved by 20% with 12.27% energy reduction. 相似文献

4.

Design and Implementation of an Agent Home Scheme Strategy for Prefetch-Based DSM Systems

Hsiao-Hsi Wang Kuan-Ching Li Ssu-Hsuan Lu Chun-Chieh Yang Jean-Luc Gaudiot 《International journal of parallel programming》2008,36(6):521-542

In recent years, cluster computing has been widely investigated and there is no doubt that it can provide a cost-effective computing infrastructure by aggregating computational power, communication, and storage resources. Moreover, it is also considered to be a very attractive platform for low-cost supercomputing. Distributed shared memory (DSM) systems utilize the physical memory of each computing node interconnected in a private network to form a global virtual shared memory. Since this global shared memory is distributed among the computing nodes, accessing the data located in remote computing nodes is an absolute necessity. However, this action will result in significant remote memory access latencies which are major sources of overhead in DSM systems. For these reasons, in order to increase overall system performance and decrease this overhead, a number of strategies have been devised. Prefetching is one such approach which can reduce latencies, although it always increases the workload in the home nodes. In this paper, we propose a scheme named Agent Home Scheme. Its most noticeable feature, when compared to other schemes, is that the agent home distributes the workloads of each computing nodes when sending data. By doing this, we can reduce not only the workload of the home nodes by balancing the workload for each node, but also the waiting time. Experimental results show that the proposed method can obtain about 20% higher performance than the original JIAJIA, about 18% more than History Prefetching Strategy (HPS), and about 10% higher than Effective Prefetch Strategy (EPS). 相似文献

5.

The Impact of Speculative Execution on SMT Processors

Dongsoo Kang Chen Liu Jean-Luc Gaudiot 《International journal of parallel programming》2008,36(4):361-385

By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper). Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines, the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed workloads we simulate. This paper is an extended version of the paper, “Speculation Control for Simultaneous Multithreading,” which appeared in the Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004. 相似文献

6.

Tolerating Radiation-Induced Transient Faults in Modern Processors

Xiaobin Li Jean-Luc Gaudiot 《International journal of parallel programming》2010,38(2):85-116

As MOS device sizes continue shrinking, lower charges, for example those charges carried by single ionizing particles of naturally occurring radiation, are sufficient to upset the functioning of complex modern microprocessors. In order to handle these inevitable errors, designs should include fault-tolerant features so that the processors can continue to correctly perform despite the occurrence of errors. The main goal of this work is to develop architecture mechanisms to protect processors against the effect of such radiation-induced transient faults. It should first be noted that, from a program execution perspective, many faults manifest themselves as control flow errors that cause processors to violate the correct sequencing of instructions. We present here at first a basic compile-time signature assignment algorithm and describe a novel approach to improve the fault detection coverage of the basic algorithm. Moreover, to allow the processor to efficiently check the run-time sequence and detect control flow errors, we introduce an on-chip assigned-signature checker which is capable of executing three additional instructions (SIC, SIJ, SIJC). Second, since the very concept of simultaneous multi-threading (SMT) provides the necessary redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby effecting a completely error-free execution. This paper has focused on two crucial implementation issues introduced by this scheme: (1) the design trade-off between the fault detection coverage versus design costs; (2) the possible occurrence of deadlock situations. 相似文献

7.

Block scheduling of iterative algorithms and graph-level priorityscheduling in a simulated data-flow multiprocessor

Evripidou P. Gaudiot J.-L. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(4):398-413

Iterative methods for solving linear systems are discussed. Although these methods are inherently highly sequential, it is shown that much parallelism could be exploited in a data-flow system by scheduling the iterative part of the algorithms in blocks and by looking ahead across several iterations. This approach is general and will apply to other iterative and loop-based problems. It is also demonstrated by simulation that relying solely on data-driven scheduling of parallel and unrolled loops results in low resource utilization and poor performance. A graph-level priority scheduling mechanism has been developed that greatly improves resource utilization and yields higher performance 相似文献

8.

SMT layout overhead and scalability

Burns J. Gaudiot J.-L. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(2):142-155

Simultaneous Multi-Threading (SMT) is a hardware technique that increases processor throughput by issuing instructions simultaneously from multiple threads. However, while SMT can be added to an existing microarchitecture with relatively low overhead, this additional chip area could be used for other resources such as more functional units, larger caches, or better branch predictors. How large is the SMT overhead and at what point does SMT no longer pay off for maximum throughput compared to adding other architecture features? This paper evaluates the silicon overhead of SMT by performing a transistor/interconnect-level analysis of the layout. We discuss microarchitecture issues that impact SMT implementations and show how the Instruction Set Architecture (ISA) and microarchitecture can have a large effect on the SMT overhead and performance. Results show that SMT yields large performance gains with small to moderate area overhead 相似文献

9.

Implementing regularly structured neural networks on the DREAMmachine

Shams S. Gaudiot J.-L. 《Neural Networks, IEEE Transactions on》1995,6(2):407-421

High-throughput implementations of neural network models are required to transfer the technology from small prototype research problems into large-scale "real-world" applications. The flexibility of these implementations in accommodating for modifications to the neural network computation and structure is of paramount importance. The performance of many implementation methods today is greatly dependent on the density and the interconnection structure of the neural network model being implemented. A principal contribution of this paper is to demonstrate an implementation method which exploits maximum amount of parallelism from neural computation, without enforcing stringent conditions on the neural network interconnection structure, to achieve this high implementation efficiency. We propose a new reconfigurable parallel processing architecture, the Dynamically Reconfigurable Extended Array Multiprocessor (DREAM) machine, and an associated mapping method for implementing neural networks with regular interconnection structures. Details of the system execution rate calculation as a function of the neural network structure are presented. Several example neural network structures are used to demonstrate the efficiency of our mapping method and the DREAM machine architecture on implementing diverse interconnection structures. We show that due to the reconfigurable nature of the DREAM machine, most of the available parallelism of neural networks can be efficiently exploited. 相似文献

10.

Design and evaluation of a hierarchical decoupled architecture

Won W. Ro Stephen P. Crago Alvin M. Despain Jean-Luc Gaudiot 《The Journal of supercomputing》2006,38(3):237-259

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%. 相似文献