期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor 总被引：1，自引：0，他引：1

Michael Gschwind 《International journal of parallel programming》2007,35(3):233-262

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006. 相似文献

2.

A comparative simulation study on the power–performance of multi-core architecture

Vijayalakshmi Saravanan Alagan Anpalagan D. P. Kothari Isaac Woungang Mohammad S. Obaidat Fellow of IEEE Fellow of SCS 《The Journal of supercomputing》2014,70(1):465-487

Nowadays, multi-core processor is the main technology used in desktop PCs, laptop computers and mobile hardware platforms. As the number of cores on a chip keeps increasing, it adds up the complexity and impacts more on both power and performance of a processor. In multi-processors, the number of cores and various parameters, such as issue-width, number of instructions and execution time, are key design factors to balance the amount of thread-level parallelism and instruction-level parallelism. In this paper, we perform a comprehensive simulation study that aims to find the optimum number of processor cores in desktop/laptop computing processor models with shallow pipeline depth. This paper also explores the trade-off between the number of cores and different parameters used in multi-processors in terms of power–performance gains and analyzes the impact of 3D stacking on the design of simultaneous multi-threading and chip multiprocessing. Our analysis shows that the optimum number of cores varies with different classes of workloads, namely: SPEC2000, SPEC2006 and MiBench. Simulation study is presented using architectures with shorter pipeline depth, showing that (1) the optimum number of cores for power–performance is 8, (2) the optimum number of threads in the range [2, 4], and (3) for beyond 32 cores, multi-core processors are no longer efficient in terms of performance benefits and overall power consumption. 相似文献

3.

A single-chip multiprocessor

Nayfeh B.A. Olukotun K. 《Computer》1997,30(9):79-85

Presents the case for billion-transistor processor architectures that will consist of chip multiprocessors (CMPs): multiple (four to 16) simple, fast processors on one chip. In their proposal, each processor is tightly coupled to a small, fast, level-one cache, and all processors share a larger level-two cache. The processors may collaborate on a parallel job or run independent tasks (as in the SMT proposal). The CMP architecture lends itself to simpler design, faster validation, cleaner functional partitioning, and higher theoretical peak performance. However for this architecture to realize its performance potential, either programmers or compilers will have to make code explicitly parallel. Old ISAs will be incompatible with this architecture (although they could run slowly on one of the small processors) 相似文献

4.

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors 总被引：1，自引：1，他引：0

下载免费PDF全文

王惊雷薛一波王海霞李崇民汪东升《计算机科学技术学报》2010,25(2):257-266

As the number of cores in chip multiprocessors(CMPs) increases,cache coherence protocol has become a key issue in integration of chip multiprocessors.Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles:design complexity,performance and scalability.This paper proposes Cache Coherent Network on Chip(CCNoC),a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and ... 相似文献

5.

Nahalal: Cache Organization for Chip Multiprocessors

Guz Z. Keidar I. Kolodny A. Weiser U.C. 《Computer Architecture Letters》2007,6(1):21-24

This paper addresses cache organization in chip multiprocessors (CMPs). We show that in CMP systems it is valuable to distinguish between shared data, which is accessed by multiple cores, and private data accessed by a single core. We introduce Nahalal, an architecture whose novel floorplan topology partitions cached data according to its usage (shared versus private data), and thus enables fast access to shared data for all processors while preserving the vicinity of private data to each processor. Nahalal exhibits significant improvements in cache access latency compared to a traditional cache design. 相似文献

6.

The Stanford Hydra CMP 总被引：5，自引：0，他引：5

Hammond L. Hubbert B.A. Siu M. Prabhu M.K. Chen M. Olukolun K. 《Micro, IEEE》2000,20(2):71-84

The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software 相似文献

7.

多核Cache稀疏目录性能提升方法综述

吴健虢陈海燕刘胜邓让钰陈俊杰《计算机工程与科学》2019,41(3):385-392

受限于功耗,十多年前通用微处理器就停止追求更高的主频转而向集成更多处理器核的方向发展;同时,随着晶体管密度按摩尔定律不断提高,单片可集成的处理器核数成倍增长,片上多核、众核处理器已成为高性能微处理器发展的主流。未来千核级通用众核处理器支持共享存储编程模型是一种必然趋势,但传统的Cache一致性目录结构面临着查找延迟高、目录项替换频繁以及硬件代价和功耗可扩展性有限等问题。稀疏目录实现了传统目录结构硬件开销与一致性维护效率的折衷,被认为是众核处理器维护Cache一致性的一种高能效、可扩展结构。综述了近年来提高稀疏目录性能的相关研究与方法,并对其在面积、访问延迟、功耗和实现复杂性等方面进行分析,归纳出这些方法各自的优点和存在的不足,对创新设计未来高性能众核处理器共享存储体系结构具有一定的参考价值。相似文献

8.

REPLICA MBTAC: multithreaded dual-mode processor

Martti Forsell Jussi Roivainen Ville Leppänen 《The Journal of supercomputing》2018,74(5):1911-1933

Prevailing trend in design of chip multiprocessors (CMP) has been that single-core processors are replicated. Therefore, they typically define asynchronous computational model, require heavily locality-aware memory allocation, and present high overheads in intercommunication. This kind of properties make parallel programming very challenging and prone to errors. We introduce our new dual-mode MultiBunched/Threaded Architecture with Chaining (MBTAC) processor core, the main building block of the REPLICA CMP. It provides a modern, sophisticated way for writing general purpose parallel programs backed up by native execution capabilities/realization of key concepts. These include support for cost-efficient machine instruction-level synchronization and uniform shared global memory for enabling easy-to-program memory allocation of data structures and data movement. MBTAC makes use of low-overhead thread-context switching solution; it has parallel computing savvy functional unit organization to exploit inter-thread instruction-level parallelism and highly efficient multioperations. To evaluate the goodness of our proposal, we implemented three MBTAC constellations featuring up to 2048 parallel threads on FPGA, compared it with respect to DLX and Intel’s Core i7 processors. The results point toward high performance in communication-intensive problems, simplified parallel programmability, and regular, implementation-friendly structure. 相似文献

9.

一种基于容量复用的异构CMP Cache

高翔章隆兵胡伟武《计算机研究与发展》2008,45(5):877-885

多核环境下的Cache设计技术受到线延时和应用等多方面因素影响,私有和共享方案都存在各自的不足.提出了一种异构的CMP Cache结构,采用两类具有不同Cache层次的结点组成多核芯片,设计了基于间接索引的Cache容量复用等技术,提供了容量有效且访问迅速的片上存储层次.在全系统环境下对SPEC CPU2000,SPLASH2等程序的评测结果表明,异构CMP Cache结构能够适应各类应用的需要,对单进程和多线程应用平均性能提高分别可达16%和9%.异构CMP Cache同时具有硬件设计简单的特点,具有较好的工程可实现性,其设计思想将应用在未来的龙芯多核处理器设计中. 相似文献

10.

The Power PC 601 microprocessor

Becker M.C. Allen M.S. Moore C.R. Muhich J.S. Tuttle D.P. 《Micro, IEEE》1993,13(5):54-68

The PowerPC 601 microprocessor, the first of a family of processors based on the PowerPC architecture, is described. The general-purpose processor contains a 32-Kb cache and a superscalar machine organization that allows dispatch and execution of up to three instructions each clock cycle. The bus interface and storage control mechanisms can be configured for a wide range of system designs, from low-cost desktop personal computers to high-performance multi-processor systems. The PowerPC architecture, machine organization, chip packaging technology, and performance are discussed 相似文献

11.

A continuation-based noninterruptible multithreading processor architecture

Satoshi Amamiya Makoto Amamiya Ryuzo Hasegawa Hiroshi Fujita 《The Journal of supercomputing》2009,47(2):228-252

Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation. 相似文献

12.

Wimpy or brawny cores: A throughput perspective

Xiangyang Liang Minh Nguyen Hao Che 《Journal of Parallel and Distributed Computing》2013

In this paper, we conduct a coarse-granular comparative analysis of wimpy (i.e., simple) fine-grain multicore processors against brawny (i.e., complex) simultaneous multithreaded (SMT) multicore processors for server applications with strong request-level parallelism. We explore a large design space along multiple dimensions, including the number of cores, the number of threads, and a wide range of workloads. 相似文献

13.

High-level dataflow design of signal processing systems for reconfigurable and multicore heterogeneous platforms

Endri Bezati Richard Thavot Ghislain Roquier Marco Mattavelli 《Journal of Real-Time Image Processing》2014,9(1):251-262

The potential computational power of today multicore processors has drastically improved compared to the single processor architecture. Since the trend of increasing the processor frequency is almost over, the competition for increased performance has moved on the number of cores. Consequently, the fundamental feature of system designs and their associated design flows and tools need to change, so that, to support the scalable parallelism and the design portability. The same feature can be exploited to design reconfigurable hardware, such as FPGAs, which leads to rethink the mapping of sequential algorithms to HDL. The sequential programming paradigm, widely used for programming single processor systems, does not naturally provide explicit or implicit forms of scalable parallelism. Conversely, dataflow programming is an approach that naturally provides parallelism and the potential to unify SW and HDL designs on heterogeneous platforms. This study describes a dataflow-based design methodology aiming at a unified co-design and co-synthesis of heterogeneous systems. Experimental results on the implementation of a JPEG codec and a MPEG 4 SP decoder on heterogeneous platforms demonstrate the flexibility and capabilities of this design approach. 相似文献

14.

An analytical study of resource division and its impact on power and performance of multi-core processors

Saravanan Vijayalakshmi Alagan Anpalagan D. P. Kothari Isaac Woungang Mohammad S. Obaidat 《The Journal of supercomputing》2014,68(3):1265-1279

The study and development of chip multi-processors (CMPs) are of utmost importance for the creation of future technologies. Devising a theoretical analysis of the micro-architecture model for the power/performance on CMPs is still a challenge. This paper addresses this problem by (1) introducing an analytical model for measuring the power and performance of a processor quantitatively, (2) analyzing the effects of resource division on power consumption and performance when executing a given benchmark, and (3) predicting the optimum number of cores to run the benchmark on. Our proposed analytically derived results show that in order to achieve power/performance gains, the optimum number of cores must be between 8 and 16. 相似文献

15.

Accelerating Sequential Applications on CMPs Using Core Spilling

Cong J. Han Guoling Jagannathan A. Reinman G. Rutkowski K. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1094-1107

Chip multiprocessors (CMPs) provide a scalable means of exploiting thread-level parallelism for multitasking or multithreaded applications. However, single-threaded applications will have difficulty dynamically leveraging the statically partitioned resources in a CMP. Such sequential applications may be difficult to statically decompose into threads or may simply be-a legacy code where recompilation is not possible or cost-effective. We present a novel approach to dynamically accelerate the performance of sequential application(s) on multiple cores. Execution is allowed to spill from one core to another when resources on one core have been exhausted. We propose two techniques to enable low-overhead migration between cores: prespilling and locality-based filtering. We develop and analyze an arbitration mechanism to intelligently allocate cores among a set of sequential applications on a CMP. On average, core spilling on an eight-core CMP can accelerate single-threaded performance by 35 percent. We further explore an eight-core CMP running a multiple application workload composed of the entire SPEC 2000 benchmark suite in various combinations and arrival times. Using core spilling to accelerate the current set of running applications in cases where there are idle cores, we achieve up to a 40 percent improvement in performance. 相似文献

16.

Performance models for network processor design

Wolf T. Franklin M.A. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(6):548-561

To provide a variety of new and advanced communications services, computer networks are required to perform increasingly complex packet processing. This processing typically takes place on network routers and their associated components. An increasingly central component in router design is a chip-multiprocessor (CMP) referred to as "network processor" or NP. In addition to multiple processors, NPs have multiple forms of on-chip memory, various network and off-chip memory interfaces, and other specialized logic components such as CAMs (content addressable memories). The design space for NPs (e.g., number of processors, caches, cache sizes, etc.) is large due to the diverse workload, application requirements, and system characteristics. System design constraints relate to the maximum chip area and the power consumption that are permissible while achieving defined line rates and executing required packet functions. In this paper, an analytic performance model that captures the processing performance, chip area, and power consumption for a prototypical NP is developed and used to provide quantitative insights into system design trade offs. The model, parameterized with a networking application benchmark, provides the basis for the design of a scalable, high-performance network processor and presents insights into how best to configure the numerous design elements associated with NPs. 相似文献

17.

How many cores do we need to run a parallel workload: A test drive of the Intel SCC platform?

Chen Liu Pollawat Thanarungroj Jean-Luc Gaudiot 《Journal of Parallel and Distributed Computing》2014

As semiconductor manufacturing technology continues to improve, it is possible to integrate more and more transistors onto a single processor. Many-core processor design has resulted in part from the search to utilize this enormous transistor real estate. The Single-Chip Cloud Computer (SCC) is an experimental many-core processor created by Intel Labs. In this paper we present a study in which we analyze this innovative many-core system by running several workloads with distinctive parallelism characteristics. We investigate the effect on system performance by monitoring specific hardware performance counters. Then, we experiment on varying different hardware configuration parameters such as number of cores, clock frequency and voltage levels. We execute the chosen workloads and collect the timing, power consumption and energy consumption information on such a many-core research platform. Thus, we can comprehensively analyze the behavior and scalability of the Intel SCC system with the introduced workload in terms of performance and energy consumption. Our results show that the profiled parallel workload execution has a communication bottleneck on the Intel SCC system. Moreover, our results indicate that we should carefully choose the number of cores to execute different workloads in order to yield a balance between execution performance and energy efficiency for different applications. 相似文献

18.

Continual flow pipelines: achieving resource-efficient latency tolerance

《Micro, IEEE》2004,24(6):62-73

With the natural trend toward integration, microprocessors are increasingly supporting multiple cores on a single chip. To keep design effort and costs down, designers of these multicore microprocessors frequently target an entire product range, from mobile laptops to high-end servers. This article discusses a continual flow pipeline (CFP) processor. Such processor architecture can sustain a large number of in-flight instructions (commonly referred to as the instruction window and comprising all instructions renamed but not retired) without requiring the cycle-critical structures to scale up. By keeping these structures small and making the processor core tolerant of memory latencies, a CFP mechanism enables the new core to achieve high single-thread performance, and many of these new cores can be placed on a chip for high throughput. The resulting large instruction window reveals substantial instruction-level parallelism and achieves memory latency tolerance, while the small size of cycle-critical resources permits a high clock frequency 相似文献

19.

Resources Snapshot Model for Concurrent Transactions in Multi-Core Processors

下载免费PDF全文

赵雷杨季文《计算机科学技术学报》2013,28(1):106-118

Transaction parallelism in database systems is an attractive way of improving transaction performance.There exists two levels of transaction parallelism,inter-transaction level and intra-transaction level.With the advent of multicore processors,new hopes of improving transaction parallelism appear on the scene.The greatest execution efficiency of concurrent transactions comes from the lowest dependencies of them.However,the dependencies of concurrent transactions stand in the way of exploiting parallelism.In this paper,we present Resource Snapshot Model(RSM) for resource modeling in both levels.We propose a non-restarting scheduling algorithm in the inter-transaction level and a processor assignment algorithm in the intra-transaction level in terms of multi-core processors.Through these algorithms,execution performance of transaction streams will be improved in a parallel system with multiple heterogeneous processors that have different number of cores. 相似文献

20.

High Performance General-Purpose Microprocessors: Past and Future

下载免费PDF全文

Wei-Wu Hu Rui Hou Jun-Hua Xiao and Long-Bin Zhang 《计算机科学技术学报》2006,21(5):631-640

It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors （CMPs）. Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced. 相似文献