期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

黄光奇李子木周兴铭窦勇《计算机学报》2001,24(12):1318-1323

随着半导体工艺技术的飞速发展,单芯片多处理器（Single-Chip Multiprocessor,SCMP)结构将是一条提高处理器性能的有效途径。该文在分析SCMP结构的特点的基础上,提出了SCMP的一种结构实现：共享多端口数据Cache结构（Shared Multi-Ported Data Cache Architecture,SMPDCA).SMPDCA结构具有三个突出的优点：最小的通信延迟、没有Cache一致性维护开销和数据Cache命中率提高。模拟结果表明,与数据Cache私有的结构相比,SMPDCA结构的煅出优点使得应用程序的性能得到了明显的提高,特别是对于改善处理器之间的通信与交互比较多的应用程序的性能具有最为明显的效果。相似文献

2.

SCMP: A Single-Chip Message-Passing Parallel Computer

Baker James M. Gold Brian Bucciero Mark Bennett Sidney Mahajan Rajneesh Ramachandran Priyadarshini Shah Jignesh 《The Journal of supercomputing》2004,30(2):133-149

As technology improves and transistor feature sizes continue to shrink, the effects of on-chip interconnect wire latencies on processor clock speeds will become more important. In addition, as we reach the limits of instruction-level parallelism that can be extracted from application programs, there will be an increased emphasis on thread-level parallelism. To continue to improve performance, computer architects will need to focus on architectures that can efficiently support thread-level parallelism while minimizing the length of on-chip interconnect wires. The SCMP (Single-Chip Message-Passing) parallel computer system is one such architecture. The SCMP system includes up to 64 processors on a single chip, connected in a 2-D mesh with nearest neighbor connections. Memory is included on-chip with the processors and the architecture includes hardware support for communication and the execution of parallel threads. Since there are no global signals or shared resources between the processors, the length of the interconnect wires will be determined by the size of the individual processors, not the size of the entire chip. Avoiding long interconnect wires will allow the use of very high clock frequencies, which, when coupled with the use of multiple processors, will offer tremendous computational power. 相似文献

3.

Compressed tag architecture for low-power embedded cache systems

Jong Wook Kwak Young Tae Jeon 《Journal of Systems Architecture》2010,56(9):419-428

Processors in embedded systems mostly employ cache architectures in order to alleviate the access latency gap between processors and memory systems. Caches in embedded systems usually occupy a major fraction of the implemented chip area. The power dissipation of cache system thus constitutes a significant fraction of the power dissipated by the entire processor in embedded systems. In this paper, we propose the compressed tag architecture to reduce the power dissipation of the tag store in cache systems. We introduce a new tag-matching mechanism by using a locality buffer and a tag compression technique. The main power reduction feature of our proposal is the use of small tag space matching instead of full tag matching, with modest additional hardware costs. The simulation results show that the proposed model provides a power and energy-delay product reduction of up to 27.8% and 26.5%, respectively, while still providing a comparable level of system performance to regular cache systems. 相似文献

4.

A consistency-free memory architecture for sort-last parallel rendering processors

《Journal of Systems Architecture》2007,53(5-6):272-284

Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer. 相似文献

5.

嵌入式处理器中的Sector Cache:性能与面积的折衷

左琦付宇卓鲁欣《小型微型计算机系统》2006,27(1):166-170

Sector Cache曾经被用于一些最早使用Cache技术的计算机系统中．虽然Sector Cache的性能略差于普通Cache，但同样Cache容量下Sector结构所需的标记位明显少于普通结构．由于嵌入式处理器对芯片面积的要求非常严格，Sector Cache的优点在嵌入式处理器中就更为明显．本文用基于仿真的方法详细分析了Sector结构的Cache在嵌入式应用环境下的性能．仿真结果表明，合理使用Sector结构可以以较小的性能代价有效地减少标记位数量．因此，采用Sector Cache就可以在满足性能要求的前提下尽可能减小Cache控制器的面积．本文认为Sector Cache是嵌入式处理器设计者进行性能／面积折衷有效手段．相似文献

6.

A NUCA Substrate for Flexible CMP Cache Sharing 总被引：1，自引：0，他引：1

Jaehyuk Huh J. Changkyu Kim C. Shafi H. Lixin Zhang L. Burger D. Keckler S.W. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1028-1040

We propose an organization for the on-chip memory system of a chip multiprocessor in which 16 processors share a 16-Mbyte pool of 64 level-2 (L2) cache banks. The L2 cache is organized as a nonuniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support a spectrum of degrees of sharing: unshared, in which each processor owns a private portion of the cache, thus reducing hit latency, and completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We measure the optimal degree of sharing for different cache bank mapping policies and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of 2 or 4 works best across a suite of commercial and scientific parallel workloads. We demonstrate that migratory dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased complexity, especially as per-application cache partitioning strategies are applied. We also evaluate the energy efficiency of each design point in terms of network traffic, bank accesses, and external memory accesses. 相似文献

7.

Nahalal: Cache Organization for Chip Multiprocessors

Guz Z. Keidar I. Kolodny A. Weiser U.C. 《Computer Architecture Letters》2007,6(1):21-24

This paper addresses cache organization in chip multiprocessors (CMPs). We show that in CMP systems it is valuable to distinguish between shared data, which is accessed by multiple cores, and private data accessed by a single core. We introduce Nahalal, an architecture whose novel floorplan topology partitions cached data according to its usage (shared versus private data), and thus enables fast access to shared data for all processors while preserving the vicinity of private data to each processor. Nahalal exhibits significant improvements in cache access latency compared to a traditional cache design. 相似文献

8.

A superlinear speedup region for matrix multiplication

Marjan Gusev Sasko Ristov 《Concurrency and Computation》2014,26(11):1847-1868

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

9.

An architecture for high-performance scalable shared-memory multiprocessors exploiting on-chip integration

Acacio M.E. Gonzalez J. Garcia J.M. Duato J. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(8):755-768

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale, presenting a novel node architecture aimed at reducing the long L2 miss latencies and the memory overhead of using directories that characterize cc-NUMA machines and limit their scalability. Our proposal replaces the traditional directory with a novel three-level directory architecture, as well as it adds a small shared data cache to each of the nodes of a multiprocessor system. Due to their small size, the first-level directory and the shared data cache are integrated into the processor chip in every node, which enhances performance by saving accesses to the slower main memory. Scalability is guaranteed by having the second and third-level directories out of the processor chip and using compressed data structures. A taxonomy of the L2 misses, according to the actions performed by the directory to satisfy them, is also presented. Using execution-driven simulations, we show that significant latency reductions can be obtained by using the proposed node architecture, which translates into reductions of more than 30 percent in several cases in the application execution time. 相似文献

10.

A single-chip multiprocessor for multimedia: the MVP 总被引：2，自引：0，他引：2

Guttag K. Gove R.J. Van Aken J.R. 《Computer Graphics and Applications, IEEE》1992,12(6):53-64

The multimedia video processor (MVP) architecture, which incorporates a variety of parallel processing techniques to deliver very high performance to a wide range of imaging and graphics applications, is described. The MVP combines, on a single semiconductor chip, multiple fully programmable processors with multiple data streams connected to shared RAMs through a crossbar network. Each of the independent processors can execute many operations in parallel every cycle. The architecture is scalable and supports different numbers of processors to meet the cost and performance requirements of different markets. MVP's target environment and the development of MVP are outlined 相似文献

11.

The Power PC 601 microprocessor

Becker M.C. Allen M.S. Moore C.R. Muhich J.S. Tuttle D.P. 《Micro, IEEE》1993,13(5):54-68

The PowerPC 601 microprocessor, the first of a family of processors based on the PowerPC architecture, is described. The general-purpose processor contains a 32-Kb cache and a superscalar machine organization that allows dispatch and execution of up to three instructions each clock cycle. The bus interface and storage control mechanisms can be configured for a wide range of system designs, from low-cost desktop personal computers to high-performance multi-processor systems. The PowerPC architecture, machine organization, chip packaging technology, and performance are discussed 相似文献

12.

SoMMA: A software-managed memory architecture for multi-issue processors

《Microprocessors and Microsystems》2020

Embedded processors rely on the efficient use of instruction-level parallelism to answer the performance and energy needs of modern applications. Though improving performance is the primary goal for processors in general, it might lead to a negative impact on energy consumption, a particularly critical constraint for current systems. In this paper, we present SoMMA, a software-managed memory architecture for embedded multi-issue processors that can reduce energy consumption and energy-delay product (EDP), while still providing an increase in memory bandwidth. We combine the use of software-managed memories (SMM) with the data cache, and leverage the lower energy access cost of SMMs to provide a processor with reduced energy consumption and EDP. SoMMA also provides a better overall performance, as memory accesses can be performed in parallel, with no cost in extra memory ports. Compiler-automated code transformations minimize the programmer's effort to benefit from the proposed architecture. The approach shows average speedups of 1.118x and 1.121x, while consuming up to 11% and 12.8% less energy when comparing two modified ρVEX processors and their baselines, at full-system level comparisons. SoMMA also shows reduction of up to 41.5% on full-system EDP, maintaining the same processor area as baseline processors. 相似文献

13.

A software approach to avoiding spatial cache collisions inparallel processor systems

Wong D.C. Davis E.W. Young J.O. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(6):601-608

In parallel processor systems, the performance of individual processors is a key factor in overall performance. Processor performance is strongly affected by the behavior of cache memory in that high hit rates are essential for high performance. Hit rates are lowered when collisions on placing lines in the cache force a cache line to be replaced before it has been used to best effect. Spatial cache collisions occur if data structures and data access patterns are misaligned. We describe a mathematical scheme to improve alignment and enhance performance in applications which have moderate-to-large numbers of arrays, where various dimensionalities are involved in localized computation and array access patterns are sequential. These properties are common in many computational modeling applications. Furthermore, the scheme provides a single solution when an application is targeted to run on various numbers of processors in power-of-two sizes. The applicability of the proposed scheme is demonstrated on testbed code for an air quality modeling problem 相似文献

14.

A star network approach in heterogeneous multiprocessors system on chip

Chao Wang Xi Li Junneng Zhang Xuehai Zhou Aili Wang 《The Journal of supercomputing》2012,62(3):1404-1424

Multiprocessor System on Chip (MPSoC) platform plays a vital role in parallel processor architecture design. However, with the growing number of processors, interconnect on chip is becoming one of the major bottlenecks of MPSoC architecture. In this paper, we propose a star network based on peer to peer links on FPGA. The star network utilizes fast simplex links (FSL) as basic structure to connect the scheduler with heterogeneous processing elements, including processors and hardware IP cores. Blocking and nonblocking application interfaces are provided for high level programming. We built a prototype system on FPGA to evaluate the transfer time and hardware cost of the proposed star network architecture. Experiment results demonstrated that the average transfer time for each word could be reduced to 7 cycles, which achieves 14× speedup against state-of-the-art shared memory literatures. Moreover, the star network cost only 1.2?% Flip Flops and 2.45?% LUTs of a single FPGA. 相似文献

15.

Pixel processing in a memory controller

Donovan W. Sabella P. Kabir I. Hsieh M.M. 《Computer Graphics and Applications, IEEE》1995,15(1):51-61

The SX-a programmable pixel processor implemented in a workstation memory controller chip-aims to perform as well as low-end 2D and 3D graphics processors and to surpass low-end imaging accelerators. The following features help accomplish this goal: large internal register set; vectorized RISC-like instruction set; fast access to both main and video memory; fast pixel operations; free operations; unpolluted cache; and single-chip solution. We describe the workstation configuration we used for our tests and the SX processor architecture, followed by the SX instruction set and sample algorithms. Then we present SX performance results for a wide range of operations 相似文献

16.

嵌入式处理器中降低Cache缺失代价设计方法研究 总被引：2，自引：0，他引：2

黄海林许彤范东睿唐志敏《小型微型计算机系统》2006,27(11):2077-2081

以龙芯1号处理器为研究对象，探讨了嵌入式处理器中降低Cache缺失代价的设计方法．通过分析处理器的结构特征，本文实现了在关键字优先基础上一次缺失下命中的非阻塞数据Cache，可以将处理器平均性能提高3．9％,同时利用局部性原理，在关键字优先非阻塞数据Cache的基础上，本文提出了一种类非阻塞的指令Cache设计方法，可以降低指令Cache的缺失代价，以较小的实现代价进一步将处理器平均性能提高7．7％．通过本文的工作，可以同时降低指令Cache和数据Cache的缺失代价，处理器的平均性能提高了11．6％．相似文献

17.

The Stanford Hydra CMP 总被引：5，自引：0，他引：5

Hammond L. Hubbert B.A. Siu M. Prabhu M.K. Chen M. Olukolun K. 《Micro, IEEE》2000,20(2):71-84

The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software 相似文献

18.

Architectural support for thread communications in multi-core processors

Sevin Varoglu Stephen Jenks 《Parallel Computing》2011,37(1):26-41

In the ongoing quest for greater computational power, efficiently exploiting parallelism is of paramount importance. Architectural trends have shifted from improving single-threaded application performance, often achieved through instruction level parallelism (ILP), to improving multithreaded application performance by supporting thread level parallelism (TLP). Thus, multi-core processors incorporating two or more cores on a single die have become ubiquitous. To achieve concurrent execution on multi-core processors, applications must be explicitly restructured to exploit parallelism, either by programmers or compilers. However, multithreaded parallel programming may introduce overhead due to communications among threads. Though some resources are shared among processor cores, current multi-core processors provide no explicit communications support for multithreaded applications that takes advantage of the proximity between cores. Currently, inter-core communications depend on cache coherence, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we explore two approaches to improve communications support for multithreaded applications. Prepushing is a software controlled data forwarding technique that sends data to destination’s cache before it is needed, eliminating cache misses in the destination’s cache as well as reducing the coherence traffic on the bus. Software Controlled Eviction (SCE) improves thread communications by placing shared data in shared caches so that it can be found in a much closer location than remote caches or main memory. Simulation results show significant performance improvement with the addition of these architecture optimizations to multi-core processors. 相似文献

19.

双簇结构DSP的数据Cache优化

下载免费PDF全文

马鹏勇陈书明孙锁林《计算机工程与科学》2008,30(9):119-121

数字信号处理常常包含大量数据运算,这使得数据Cache成为影响其性能的关键因素。特别是对于我们研制的双簇VLIW结构YHFrDSP系列处理器,Cache的失效会导致整个内核八条流水线同时停顿。所以,减小Cache失效延迟能给处理器性能带来显著的提升。本文研究的主要问题是如何针对一级数据Cache的读失效操作进行优化,从四个方面进行, 分别为提前发读请求、请求字优先、合并并行失效读和后台处理Snooping。模拟结果表明,采用这些优化措施后,处理器的性能提高了8．36％。相似文献

20.

Large scale scientific computing - future directions

G.S. Patterson 《Computer Physics Communications》1982,26(3-4):217-225

Every new generation of scientific computers has opened up new areas of science for exploration through the use of more realistic numerical models or the ability to process ever larger amounts of data. Concomitantly, scientists, because of the success of past models and the wide range of physical phenomena left unexplored, have pressed computer designers to strive for the maximum performance that current technology will permit. This encompasses not only increased processor speed, but also substantial improvements in processor memory, I/O bandwidth, secondary storage and facilities to augment the scientist's ability both to program and to understand the results of a computation. Over the past decade, performance improvements for scientific calculations have come from algoeithm development and a major change in the underlying architecture of the hardware, not from significantly faster circuitry. It appears that this trend will continue for another decade. A future archetectural change for improved performance will most likely be multiple processors coupled together in some fashion. Because the demand for a significantly more powerful computer system comes from users with single large applications, it is essential that an application be efficiently partitionable over a set of processors; otherwise, a multiprocessor system will not be effective. This paper explores some of the constraints on multiple processor architecture posed by these large applications. In particular, the trade-offs between large numbers of slow processors and small numbers of fast processors is examined. Strategies for partitioning range from partitioning at the language statement level (in-the-small) and at the program module level (in-the-large). Some examples of partitioning in-the-large are given and a strategy for efficiently executing a partitioned program is explored. 相似文献