期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Exploring grouped coherence for clustered hierarchical cache

Sensen Hu Feng Shi Weixing Ji Xu Chen Shahnawaz Talpur 《The Journal of supercomputing》2017,73(9):4137-4157

The industry trends for processors are toward integrating an increasing number of cores into a single chip. Researchers have to deal with frequent data migration across network-on-chip and the increasing on-chip traffic. The innovation from flat to hierarchy is probably a natural design methodology for scalable systems (Martin et al. in Commun ACM, 55(7):78–89, 2012. doi: 10.1145/2209249.2209269). Unfortunately, the alternative of hierarchical directory protocol inevitably leads to on-chip traffic overhead, protocol complexity and access latency. In this paper, we target hierarchical cache coherence protocol to overcome the potentially high cost of maintaining cache coherence in current multicore processors. We propose a novel vertical caching protocol combined with grouped coherence, in which the coherence domain expand on demand. More specifically, its design philosophy is to provide a ‘best-effort’ single-copy delivery which allows the shared data only in the first common shared level. Compared to the previous hierarchical protocol, our proposal is able to achieve the performance improvement of 9.9% in the 16-core system and 13.4% in the 64-core system as well as an on-chip traffic reduction of about 10.8% in the 16-core system and 15.9% in the 64-core system, respectively. 相似文献

2.

DEAM: Decoupled,Expressive, Area-Efficient Metadata Cache

下载免费PDF全文

Peng Liu Lei Fang Michael C. Huang 《计算机科学技术学报》2014,29(4):679-691

Chip multiprocessor presents brand new opportunities for holistic on-chip data and coherence management solutions. An intelligent protocol should be adaptive to the fine-grain accessing behavior. And in terms of storage of metadata, the size of conventional directory grows as the square of the number of processors, making it very expensive in large-scale systems. In this paper, we propose a metadata cache framework to achieve three goals: 1) reducing the latency of data access and coherence activities, 2) saving the storage of metadata, and 3) providing support for other optimization techniques. The metadata is implemented with compact structures and tracks the dynamically changing access pattern. The pattern information is used to guide the delegation and replication of decoupled data and metadata to allow fast access. We also use our metadata cache as a building block to enhance stream prefetching. Using detailed execution-driven simulation, we demonstrate that our protocol achieves an average speedup of 1.12X compared with a shared cache protocol with 1/5 of the storage of metadata. 相似文献

3.

多核处理器Cache一致性协议关键技术研究

黄安文张民选《计算机工程与科学》2009,31(Z1)

多核处理器规模的不断扩大和核间通信机制的日益复杂,使得Cache一致性维护变得更加困难。本文从多核处理器Cache一致性问题的产生背景出发,分析监听协议、目录协议、Token协议和Hammer协议的实现机制以及在多核环境中的优缺点,分别从一致性协议与片上互连结构协同设计、面向低功耗应用的协议优化策略、Cache一致性协议验证及容错机制等角度考虑,对未来多核处理器Cache一致性协议设计的发展趋势和技术挑战进行详细分析与讨论。相似文献

4.

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Alberto Ros Ricardo Fernández-Pascual Manuel E. Acacio José M. García 《Journal of Parallel and Distributed Computing》2008

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails. 相似文献

5.

Two-level caches tuning technique for energy consumption in reconfigurable embedded MPSoC

《Journal of Systems Architecture》2013,59(8):656-666

In order to meet the ever-increasing computing requirement in the embedded market, multiprocessor chips were proposed as the best way out. In this work we investigate the energy consumption in these embedded MPSoC systems. One of the efficient solutions to reduce the energy consumption is to reconfigure the cache memories. This approach was applied for one cache level/one processor architecture, but has not yet been investigated for multiprocessor architecture with two level caches. The main contribution of this paper is to explore two level caches (L1/L2) multiprocessor architecture by estimating the energy consumption. Using a simulation platform, we first built a multiprocessor architecture, and then we propose a new algorithm that tunes the two-level cache memory hierarchy (L1 and L2). The tuning caches approach is based on three parameters: cache size, line size, and associativity. To find the best cache configuration, the application is divided into several execution intervals. And then, for each interval, we generate the best cache configuration. Finally, the approach is validated using a set of open source benchmarks; Spec 2006, Splash-2, MediaBench and we discuss the performance in terms of speedup and energy reduction. 相似文献

6.

Filtering directory lookups in CMPs

A. Bosque V. Viñals P. Ibáñez J.M. Llaber?´aAuthor vitae 《Microprocessors and Microsystems》2011,35(8):695-707

Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that a big fraction of directory lookups cause a miss, because the block looked up is not allocated in any local cache. To reduce the number of directory lookups and therefore the power consumption, we propose to add a filter before the directory access.We introduce two filter implementations. In the first one, filtering information is explicitly kept in the shared cache for every block. In the second one, filtering information is decoupled from the shared cache organization, so the filter size does not depend on the shared cache size.We evaluate our filters in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with write-through local caches and a shared cache. We show that, for SPLASH2 benchmarks, the proposed filters reduce the number of directory lookups performed by 60% while power consumption is reduced by ∼28%. For Specweb2005, the number of directory lookups performed is reduced by 68% (44%), while directory power consumption is reduced by 19% (9%) using the first (second) filter implementation. 相似文献

7.

环连接CMP的缓存一致性协议

曹非刘志勇《计算机研究与发展》2009,46(Z2)

片上多核处理器(CMP)已经成为处理器发展的方向,处理器设计的重点也转到了互连网络和存储层次结构方面,其中的一个关键问题是如何维护各处理器各级缓存(Cache)的一致性,该问题在传统的共享存储多处理器中使用Cache一致性协议来解决,而CMP相对于传统的多处理器结构具有更高的片上互连带宽和速度,给Cache一致协议提出了新的要求,也提供了新的改进机会.传统的总线侦听协议存在可扩展性不足和不必要的广播、侦听过多的缺点,而目录协议则存在失效间接延时大和复杂度高、验证困难等问题.环形连接的可扩展性好于总线结构,而其实现复杂度也远小于通常目录协议所使用的包交换点到点网络.将基于环的侦听协议应用于CMP;并考虑利用环的顺序性取消原有协议中冲突引起的重发操作,消除可能的饥饿、死锁和活锁等情况,增加协议的稳定性,同时减少消息流量和功耗;利用片上互连延时短的特点,将侦听结果和侦听请求同时传播,使得处理器可以根据侦听结果来对侦听请求进行选择性的侦听操作,可减少不必要的侦听操作,降低功耗. 相似文献

8.

A Lock-Based Cache Coherence Protocol for Scope Consistency 总被引：5，自引：2，他引：5

下载免费PDF全文

Hu Weiwu Shi Weisong Tang Zhimin Li Ming 《计算机科学技术学报》1998,13(2):97-109

Directory protocols are widely adopted to maintain cache coherence of distributed shared memory multiprocessors.Although scalable to a certain extent,directory protocols are complex enough to prevent it from being used in very large scale multiprocessors with tens of thousands of nodes.his paper proposes a lock-based cache coherence protocol for scope consistency.In does not rely on directory information to maintain cache coherence.Instead,cache coherence is maintained through requiring the releasing processor of a lock to stroe all write-notices generated in the associated critical section to the lock and the acquiring processor invalidates or updates its locally cached data copies according to the write notices of the lock.To evaluate the performance of the lock-based cache coherence protocol,a software SDM system named JIAJIA is built on network of workstations.Besides the lock-based cache coherence protocol,JIAJIA also characterizes itself with its shared memory organization scheme which combines the physical memories of multiple workstations to form a large shared space.Performance measurements with SPLASH2 program suite and NAS benchmarks indicate that,compared to recent SVM systems such as CVM,higher speedup is achieved by JIAJIA.Besides,JIAJIA can solve large scale problems that cannot be solved by other SVM systems due to memory size limitation. 相似文献

9.

To be silent or not: on the impact of evictions of clean data in cache-coherent multicores

Ricardo Fernández-Pascual Alberto Ros Manuel E. Acacio 《The Journal of supercomputing》2017,73(10):4428-4443

Maintaining coherence across hundreds or even thousands of cores is not an easy task. Among all of the proposed solutions until now, directory-based cache coherence has been advocated as the most feasible way of beating the scalability hurdles that arise at such large scale. Thanks to the knowledge accumulated during the last four decades, there is general consensus on the impact of most of the design aspects of directory coherence on performance, energy consumption and cost. However, there is one subtle design point for which we have observed some divergences in contemporary research works on cache-coherent multicores. Specifically, while some recent works assume a silent replacement policy for evictions of clean data in the last-level private caches, others implement just the opposite that we call a noisy replacement policy, and even others do not mention how these evictions are managed. In this work, we put this important aspect into the spotlight, demonstrating that the way in which evictions of clean data are managed can have important influence on the performance and energy consumption of a directory-based cache coherence protocol. We show that the noisy replacement policy leads to a significant increase in the total traffic (around 20% in several cases, 9.6% on average) compared with the silent policy. Given the important fraction of the total power budget that the on-chip interconnection network of future manycores is expected to consume, assuming the silent replacement policy for clean data will lead to non-negligible energy savings. Moreover, and what is more important, we have observed that depending on the particular directory structure used, assuming silent replacements could affect performance or not. This means that the use of noisy replacements is not justified in all cases, since it would increase unnecessarily network traffic without leading to any performance advantages. 相似文献

10.

Synergistic design of an application-oriented sparse directory on many-core embedded systems

《Journal of Systems Architecture》2017

As many-core embedded systems are evolving from single-memory based designs to systems-on-a-chip running on an on-chip network, implementing a cache coherence mechanism in large-scale many-core embedded systems turns out to be a technical challenge. However, existing coherence mechanisms are difficult to scale beyond tens of cores, which require either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets. In this paper, we present a hardware-software synergistic design of a cache coherence mechanism by considering OS-level application allocation and hardware-level coherence operations. The proposed application-oriented sparse directory (AoSD) cooperates with a contiguous allocation algorithm to isolate cache coherence traffic and thereby reduce interferences among applications. The proposed micro-architecture of sharer set representations is area-efficient; moreover, it can also be configured dynamically to track a flexible and exact sharer set. We verify our design by analyzing memory requirements of different cache organizations and implementing our design on a popular simulator Graphite to evaluate cache coherence traffic improvement. The results show that our design is both area-efficient and efficient with improvements in memory network performance by 11.74%–28.72%. It is also indicated that our design is feasible to scale up to work well in thousands-of-cores embedded systems. 相似文献

11.

分布式文件系统中Cache一致性的验证

王建勇祝明发《计算机学报》1999,22(5):460-466

基于信念逻辑,分析了曙光超级服务器单一映象文件系统中所采用的基于目录的无效使能Ｃａｃｈｅ一致性协议,首先介绍了Ｃａｃｈｅ一致性协议的目标,并为之建立了系统模型及逻辑,然后以基于目录的无效使能协议为例演示了运用信念逻辑对Ｃａｃｈｅ一致性进行正确性证明的过程。相似文献

12.

多核Cache稀疏目录性能提升方法综述

吴健虢陈海燕刘胜邓让钰陈俊杰《计算机工程与科学》2019,41(3):385-392

受限于功耗,十多年前通用微处理器就停止追求更高的主频转而向集成更多处理器核的方向发展;同时,随着晶体管密度按摩尔定律不断提高,单片可集成的处理器核数成倍增长,片上多核、众核处理器已成为高性能微处理器发展的主流。未来千核级通用众核处理器支持共享存储编程模型是一种必然趋势,但传统的Cache一致性目录结构面临着查找延迟高、目录项替换频繁以及硬件代价和功耗可扩展性有限等问题。稀疏目录实现了传统目录结构硬件开销与一致性维护效率的折衷,被认为是众核处理器维护Cache一致性的一种高能效、可扩展结构。综述了近年来提高稀疏目录性能的相关研究与方法,并对其在面积、访问延迟、功耗和实现复杂性等方面进行分析,归纳出这些方法各自的优点和存在的不足,对创新设计未来高性能众核处理器共享存储体系结构具有一定的参考价值。相似文献

13.

A scalable organization for distributed directories

Alberto Ros Manuel E. Acacio José M. García 《Journal of Systems Architecture》2010,56(2-3):77-87

Although directory-based cache-coherence protocols are the best choice when designing chip multiprocessors with tens of cores on-chip, the memory overhead introduced by the directory structure may not scale gracefully with the number of cores. Many approaches aimed at improving the scalability of directories have been proposed. However, they do not bring perfect scalability and usually reduce the directory memory overhead by compressing coherence information, which in turn results in extra unnecessary coherence messages and, therefore, wasted energy and some performance degradation. In this work, we present a distributed directory organization based on duplicate tags for tiled CMP architectures whose size is independent on the number of tiles of the system up to a certain number of tiles. We demonstrate that this number of tiles corresponds to the number of sets in the private caches. Additionally, we show that the area overhead of the proposed directory structure is 0.56% with respect to the on-chip data caches. Moreover, the proposed directory structure keeps the same information than a non-scalable full-map directory. Finally, we propose a mechanism that takes advantage of this directory organization to remove the network traffic caused by replacements. This mechanism reduces total traffic by 15% for a 16-core configuration compared to a traditional directory-based protocol. 相似文献

14.

Design and formal verification of a hierarchical cache coherence protocol for NoC based multiprocessors

Hemangee K. Kapoor Praveen Kanakala Malti Verma Shirshendu Das 《The Journal of supercomputing》2013,65(2):771-796

Advancement in semiconductor technology is allowing to pack more and more processing cores on a single die and scalable directory based protocols are needed for maintaining cache coherence. Most of the currently available directory based protocols are designed for mesh based topology and have the problem of delay and scalability. Cluster based coherence protocol is a better option than flat directory based protocol but the problem of mesh based topology is still exits. On the other hand, tree based topology takes fewer hop counts compared to mesh based topology. In this paper we give a hierarchical cache coherence protocol based on tree based topology. We divide the processing cores into clusters and each cluster shares a higher-level cache. At the next level we form clusters of caches connected to yet another higher-level cache. This is continued up to the top level cache/memory. We give various architectural placements that can benefit from the protocol; hop-count comparison; and memory overhead requirements. Finally, we formally verify the protocol using the Mur? tool. 相似文献

15.

CMP中基于目录的协作Cache设计方案

下载免费PDF全文

赵小雨吴俊敏隋秀峰王庆波唐轶轩《计算机工程》2010,36(21):283-285

片上多处理器中二级Cache的设计和管理是影响其性能的关键因素之一。在私有二级Cache的基础上,提出一种基于集中式一致性目录的协作Cache设计方案,通过有效地管理片上存储资源来优化处理器的性能,从而使该协作Cache具有平均访存延迟小、Cache缺失率低、可扩展性好等优点。实验结果显示,与共享二级Cache设计相比,协作Cache可以将4核处理器的吞吐量平均提高13.5%,而其硬件开销约为8.1%。相似文献

16.

Multiple-bus shared-memory system: Aquarius project

Carlton M. Despain A. 《Computer》1990,23(6):80-83

A multiple-bus architecture called a multi-multi is presented. The architecture is designed to handle several dimensions with a moderate number of processors per bus. It provides scaling to a large number of processors in a system. A key characteristic of the architecture is the large amount of bandwidth it provides. Each node in the architecture contains a microprocessor, memory, and a cache. The cache-coherence protocol for the multi-multi architecture combines features of snooping cache schemes, to provide consistency on individual buses, with features of directory schemes, to provide consistency between buses. The snooping cache component can take advantage of the low-latency communication possible on shared buses for efficiency, yet the complete protocol will support many more processors than a single bus can. The resulting protocol naturally extends cache coherence from a multi to a multi-multi. Cache and directory states are described. Concepts that allow efficient performance, namely, local sharing, root node, and bus addresses in the directory, are discussed 相似文献

17.

Improving memory utilization in cache coherence directories

Lilja D.J. Yew P.-C. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(10):1130-1146

Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significantly reduces their memory requirements, but they still allocate pointers to blocks that do not need them. The authors present two compiler optimizations that exploit the high-level sharing information available to the compiler to further reduce the size of a tagged directory by allocating pointers only when necessary. Trace-driven simulations are used to show that the performance of this combined hardware-software approach is comparable to other coherence schemes, but with significantly lower memory requirements. In addition, these simulations suggest that this approach is less sensitive to the quality of the memory disambiguation and interprocedural analysis performed by the compiler than software-only coherence schemes 相似文献

18.

片上多核处理器Cache一致性协议优化研究综述

胡森森计卫星王一拙陈旭付文飞石峰《软件学报》2017,28(4):1027-1047

现代晶体管技术在单芯片上集成多个处理器已经成为现实.近年来,随着多核处理器集成核数的不断增加,高速缓存的一致性问题凸显出来,已成为多核处理器的性能瓶颈之一,亟待解决.本文介绍了片上多核处理器一致性问题的由来.总结了多核时代高速缓存一致性协议设计的关键问题,综述了近年来学术界对一致性的研究.从程序访存行为模式、目录组织结构、一致性粒度、一致性协议流量、目录协议的可扩展性等方面,阐述了近年来缓存一致性协议性能优化的方向.对目前片上多核处理器缓存一致性协议设计中存在的问题进行了讨论,并指出了未来进一步研究的方向. 相似文献

19.

An on-chip cache compression technique to reduce decompression overhead and design complexity

Jang-Soo Won-Kee Shin-Dug 《Journal of Systems Architecture》2000,46(15):1365-1382

This research explores a compressed memory hierarchy model which can increase both the effective memory space and bandwidth of each level of memory hierarchy. It is well known that decompression time causes a critical effect to the memory access time and variable-sized compressed blocks tend to increase the design complexity of the compressed memory systems. This paper proposes a selective compressed memory system (SCMS) incorporating the compressed cache architecture and its management method. To reduce or hide decompression overhead, this SCMS employs several effective techniques, including selective compression, parallel decompression and the use of a decompression buffer. In addition, fixed memory space allocation method is used to achieve efficient management of the compressed blocks. Trace-driven simulation shows that the SCMS approach can not only reduce the on-chip cache miss ratio and data traffic by about 35% and 53%, respectively, but also achieve a 20% reduction in average memory access time (AMAT) over conventional memory systems (CMS). Moreover, this approach can provide both lower memory traffic at a lower cost than CMS with some architectural enhancement. Most importantly, the SCMS is a more attractive approach for future computer systems because it offers high performance in cases of long DRAM latency and limited bus bandwidth. 相似文献

20.

Characterization and Evaluation of Cache Hierarchies for Web Servers

Iyer Ravi 《World Wide Web》2004,7(3):259-280

As Internet usage continues to expand rapidly, careful attention needs to be paid to the design of Internet servers for achieving high performance and end-user satisfaction. Currently, the memory system continues to remain a significant performance bottleneck for Internet servers employing multi-GHz processors. In this paper, our aim is two-fold: (1) to characterize the cache/memory performance of web server workloads and (2) to propose and evaluate cache design alternatives for future web servers. We chose SPECweb99 as the representative web server workload and our entire characterization and evaluation methodology is based on our CASPER simulation framework. We begin by exploring the processor cache design space for single and dual-processor servers. Based on our observations, we then evaluate other cache hierarchy alternatives such as chipset caches, coherence filters and decompressed page stores. We show the sensitivity of these components to basic organization parameters such as cache size, line size and degree of associativity. We also present the performance implications of routing memory requests initiated by I/O devices through these caches. Based on detailed simulation data and its implications on system level performance, this paper shows that chipset caches have significant potential for improving future web server performance. 相似文献