期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于新型Cache一致性协议的共享虚拟存储系统 总被引：11，自引：2，他引：9

胡伟武施巍松唐志敏《计算机学报》1999,22(5):467-475

介绍了一个基于新型Ｃａｃｈｅ一致性协议的共享虚拟存储系统ＪＩＡＪＩＡ,与目前国际上具有代表性的共享虚拟存储系统相比,ＪＩＡＪＩＡ采用了基于ＵＮＭＡ的结构,能够把多个机器的物理地址空间组织成一个更大的共享虚拟地址空间,此外,ＪＩＡＪＩＡ实现了一种基于锁的新型一致性协议,通过附带在锁上的ｗｒｉｔｅ－ｎｏｔｉｃｅ来维护一致性,从而避免了传统的目录协议中由目录引起的存储开销和系统复杂度,利用一些被广泛使用相似文献

2.

Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Alberto Ros Ricardo Fernández-Pascual Manuel E. Acacio José M. García 《Journal of Parallel and Distributed Computing》2008

In glueless shared-memory multiprocessors where cache coherence is usually maintained using a directory-based protocol, the fast access to the on-chip components (caches and network router, among others) contrasts with the much slower main memory. Unfortunately, directory-based protocols need to obtain the sharing status of every memory block before coherence actions can be performed. This information has traditionally been stored in main memory, and therefore these cache coherence protocols are far from being optimal. In this work, we propose two alternative designs for the last-level private cache of glueless shared-memory multiprocessors: the lightweight directory and the SGluM cache. Our proposals completely remove directory information from main memory and store it in the home node’s L2 cache, thus reducing both the number of accesses to main memory and the directory memory overhead. The main characteristics of the lightweight directory are its simplicity and the significant improvement in the execution time for most applications. Its drawback, however, is that the performance of some particular applications could be degraded. On the other hand, the SGluM cache offers more modest improvements in execution time for all the applications by adding some extra structures that cope with the cases in which the lightweight directory fails. 相似文献

3.

环连接CMP的缓存一致性协议

曹非刘志勇《计算机研究与发展》2009,46(Z2)

片上多核处理器(CMP)已经成为处理器发展的方向,处理器设计的重点也转到了互连网络和存储层次结构方面,其中的一个关键问题是如何维护各处理器各级缓存(Cache)的一致性,该问题在传统的共享存储多处理器中使用Cache一致性协议来解决,而CMP相对于传统的多处理器结构具有更高的片上互连带宽和速度,给Cache一致协议提出了新的要求,也提供了新的改进机会.传统的总线侦听协议存在可扩展性不足和不必要的广播、侦听过多的缺点,而目录协议则存在失效间接延时大和复杂度高、验证困难等问题.环形连接的可扩展性好于总线结构,而其实现复杂度也远小于通常目录协议所使用的包交换点到点网络.将基于环的侦听协议应用于CMP;并考虑利用环的顺序性取消原有协议中冲突引起的重发操作,消除可能的饥饿、死锁和活锁等情况,增加协议的稳定性,同时减少消息流量和功耗;利用片上互连延时短的特点,将侦听结果和侦听请求同时传播,使得处理器可以根据侦听结果来对侦听请求进行选择性的侦听操作,可减少不必要的侦听操作,降低功耗. 相似文献

4.

基于目录的Cache一致性协议的可扩展性研究

下载免费PDF全文

潘国腾窦强谢伦国《计算机工程与科学》2008,30(6):131-133

基于CC-NUMA结构的DSM多处理器系统是大规模高性能并行计算机的一个实现方式,由于比监听协议具有更好的扩展性,系统多采用基于目录的Cache一致性协议。但是,随着系统规模的不断扩大,目录协议同样面临着可扩展性的问题。本文在分析影响目录协议可扩展性因素的基础上,对当前比较典型的几种目录组织形式从存储开销方面进行了讨论,最后提出了基于目录Cache的两级目录组织方案。相似文献

5.

On the Correctness of Program Execution When Cache Coherence Is Maintained Locally at Data-Sharing Boundaries in Distributed Shared Memory Multiprocessors

H. Sarojadevi S. K. Nandy S. Balakrishnan 《International journal of parallel programming》2004,32(5):415-446

Emerging multiprocessor architectures such as chip multiprocessors, embedded architectures, and massively parallel architectures, demand faster, more efficient, and more scalable cache coherence schemes. In devising more cost-efficient schemes, formal insights into a system model is deemed useful. We, in this paper, build formalisms for execution in cache based Distributed shared-memory multiprocessors (DSM) obeying Release Consistency model, and derive conditions for cache coherence. A cost-efficient cache coherence scheme without directories is designed. Our approach relies on processor directed coherence actions, which are early in nature. The scheme exploits sharing information provided by a programmer-centric framework. Per-processor coherence buffers (CB) are employed to impose coherence on live shared variables between consecutive release points in the execution. Simulation of 8 entry 4-way associative CB based system achieves a speedup of 1.07–4.31 over full-map 3-hop directory scheme for six of the SPLASH-2 benchmarks. 相似文献

6.

DP&TB: a coherence filtering protocol for many-core chip multiprocessors

Fengkai Yuan Zhenzhou Ji 《The Journal of supercomputing》2013,66(1):249-261

Future many-core chip multiprocessors (CMPs) will integrate hundreds of processor cores on chip. Two cache coherence protocols are the mainstream applied to current CMPs. The token-based protocol (Token) provides high performance, but it generates a prohibitive amount of network traffic, which translates into excessive power consumption. The directory-based protocol (Directory) reduces network traffic, yet trades off with the storage overhead of the directory as well as entails comparatively low performance caused by indirection limiting its applicability for many-core CMPs. In this work, we present DP&TB, a novel cache coherence protocol particularly suited to future many-core CMPs. In DP&TB, cache coherence is maintained at the granularity of a page, facilitating to filter out either unnecessary coherence inspections for blocks inside private pages or network traffic for blocks inside shared pages. We employ Directory to detect private and shared pages and Token to maintain the coherence of the blocks inside shared pages. DP&TB inherits the merit of Directory and Token and overcome their problems. Experimental results show that DP&TB comprehensively beyond Directory and Token with improvement by 9.1 % in performance over Token and by 13.8 % in network traffic over Directory. In addition, the storage overhead of DP&TB is less than half of that of Directory. Our proposal can fulfill the requirement of many-core CMPs to achieve high performance, power and area efficiency. 相似文献

7.

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors 总被引：1，自引：1，他引：0

下载免费PDF全文

王惊雷薛一波王海霞李崇民汪东升《计算机科学技术学报》2010,25(2):257-266

As the number of cores in chip multiprocessors(CMPs) increases,cache coherence protocol has become a key issue in integration of chip multiprocessors.Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles:design complexity,performance and scalability.This paper proposes Cache Coherent Network on Chip(CCNoC),a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and ... 相似文献

8.

Integrated Memory Controllers with Parallel Coherence Streams

Chaudhuri M. Heinrich M. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1159-1173

Previous work in scalable hardware distributed shared memory (DSM) multiprocessors has established the critical and dominant role that protocol processing bandwidth (or its inverse, occupancy) plays in determining overall performance in architectures with standalone memory/coherence controllers. However, with recent architectural trends toward integrated (on-chip) memory controllers and the well-known fact that processor frequency is increasing more rapidly than memory systems, we must ask whether parallel coherence processing engines (either multiple integrated protocol processors/cores or multiple protocol threads) are needed in DSM machines constructed from modern processor architectures and, if so, when. We construct a useful analytical model to give the designer insight into when parallel coherence streams will improve performance and verify our model via detailed simulation on 64-threaded microbenchmarks and parallel applications and on single-node multiprogrammed workloads. Surprisingly, and contrary to related work, we find that, in these architectures, adding a second coherence engine has almost no impact on performance. Further, for less-tuned applications that suffer from hot spots (contentious requests to the same memory line), additional engines offer no benefit whatsoever. Even with double the memory bandwidth (or channels), an additional coherence processing stream yields only slight performance improvement. Only for a special class of DSM machines employing directoryless broadcast protocols over unordered interconnects does parallel "snoop" processing offer reasonable performance improvement for communication-intensive applications. Overall, given the architectural trends, this is good news for DSM designers who want to minimize the resources necessary (protocol threads or integrated protocol processor cores for maintaining internode coherence, respectively) to create SMTp-based or multi-CMP-based scalable DSM machines using directory protocols. 相似文献

9.

多核处理器Cache一致性协议关键技术研究

黄安文张民选《计算机工程与科学》2009,31(Z1)

多核处理器规模的不断扩大和核间通信机制的日益复杂,使得Cache一致性维护变得更加困难。本文从多核处理器Cache一致性问题的产生背景出发,分析监听协议、目录协议、Token协议和Hammer协议的实现机制以及在多核环境中的优缺点,分别从一致性协议与片上互连结构协同设计、面向低功耗应用的协议优化策略、Cache一致性协议验证及容错机制等角度考虑,对未来多核处理器Cache一致性协议设计的发展趋势和技术挑战进行详细分析与讨论。相似文献

10.

硬件结构支持的基于同步的高速缓存一致性协议

黄河刘磊宋风龙马啸宇《计算机学报》2009,32(8)

共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能. 相似文献

11.

多核Cache稀疏目录性能提升方法综述

吴健虢陈海燕刘胜邓让钰陈俊杰《计算机工程与科学》2019,41(3):385-392

受限于功耗,十多年前通用微处理器就停止追求更高的主频转而向集成更多处理器核的方向发展;同时,随着晶体管密度按摩尔定律不断提高,单片可集成的处理器核数成倍增长,片上多核、众核处理器已成为高性能微处理器发展的主流。未来千核级通用众核处理器支持共享存储编程模型是一种必然趋势,但传统的Cache一致性目录结构面临着查找延迟高、目录项替换频繁以及硬件代价和功耗可扩展性有限等问题。稀疏目录实现了传统目录结构硬件开销与一致性维护效率的折衷,被认为是众核处理器维护Cache一致性的一种高能效、可扩展结构。综述了近年来提高稀疏目录性能的相关研究与方法,并对其在面积、访问延迟、功耗和实现复杂性等方面进行分析,归纳出这些方法各自的优点和存在的不足,对创新设计未来高性能众核处理器共享存储体系结构具有一定的参考价值。相似文献

12.

存储模型仿真器的设计与实现 总被引：2，自引：1，他引：1

吴俊敏杨超陈国良张淼辉门珂《计算机研究与发展》2005,42(3):394-403

存储一致性问题和高速缓存一致性问题是共享存储并行计算机中两个最关键的问题,通过仿真器对它们进行了量化研究,设计并实现了一个存储模型仿真器MMS．基于MMS仿真了不同并行机结构模型下多种存储一致性模型的行为;针对不同类型的计算问题比较了不同的存储一致性模型,并对实验结果进行了分析;实现了几个不同的高速缓存一致性协议,并比较了它们的性能．相似文献

13.

Godson-T缓存一致性协议的Murphi建模和验证

周琰《计算机系统应用》2013,22(10):124-128

Godson-T缓存一致性协议是用于Godson-T众核处理器的缓存一致性协议．在Godson-T协议中,缓存一致性协议和存储一致性模型存在紧密的紧耦合关系,分析协议的一致性时发现该协议满足的缓存一致性不是强一致性,不满足传统意义上缓存透明的一致性要求．我们选取了Murphi模型检测工具作为我们建模的语言和验证工具．在对Godson-T缓存一致性协议建模的时候,由于协议的上述特点,我们需要对处理器核结点,高速缓存和内存作为一个整体建模,并成功地验证了协议的相关性质．相似文献

14.

M<Emphasis Type="SmallCaps">osaic</Emphasis>: A Scalable Coherence Protocol

Lucia G. Menezo Valentin Puente Pablo Abad Jose-Angel Gregorio 《International journal of parallel programming》2018,46(6):1110-1138

The coherence protocol presented in this work, denoted Mosaic, introduces a new approach to face the challenges of complex multilevel cache hierarchies in future many-core systems. The essential aspect of the proposal is to eliminate the condition of inclusiveness through the different levels of the memory hierarchy while maintaining the complexity of the protocol limited. Cost reduction decisions taken to reduce this complexity may introduce artificial inefficiencies in the on-chip cache hierarchy, especially when the number of cores and private cache size is large. Our approach trades area and complexity for on-chip bandwidth, employing an integrated broadcast mechanism in a directory structure. In energy terms, the protocol scales like a conventional directory coherence protocol, but relaxes the shared information inclusiveness. This allows the performance implications of directory size and associativity reduction to be overcome. As it is even simpler than a conventional directory, the results of our evaluation show that the approach is quite insensitive, in terms of performance and energy expenditure, to the size and associativity of the directory. 相似文献

15.

Filtering directory lookups in CMPs

A. Bosque V. Viñals P. Ibáñez J.M. Llaber?´aAuthor vitae 《Microprocessors and Microsystems》2011,35(8):695-707

Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with shared cache and directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that a big fraction of directory lookups cause a miss, because the block looked up is not allocated in any local cache. To reduce the number of directory lookups and therefore the power consumption, we propose to add a filter before the directory access.We introduce two filter implementations. In the first one, filtering information is explicitly kept in the shared cache for every block. In the second one, filtering information is decoupled from the shared cache organization, so the filter size does not depend on the shared cache size.We evaluate our filters in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with write-through local caches and a shared cache. We show that, for SPLASH2 benchmarks, the proposed filters reduce the number of directory lookups performed by 60% while power consumption is reduced by ∼28%. For Specweb2005, the number of directory lookups performed is reduced by 68% (44%), while directory power consumption is reduced by 19% (9%) using the first (second) filter implementation. 相似文献

16.

众核处理器Cache一致性研究综述

韩立敏安建峰高德远樊晓桠任向隆《计算机应用研究》2012,29(11):4011-4016

以瓦片结构众核处理器一致性协议的设计为主线,综述了国内外近年来关于众核处理器cache一致性的相关研究;介绍了不同NUCA结构对一致性协议的影响;分析和对比了几种传统目录一致性协议的特性及其存在的问题;归纳了最新几个面向众核结构一致性协议的设计思想和特性。最后为设计具备应用程序适应性和可扩展性的cache一致性协议指出了几个关键的设计方向。相似文献

17.

基于共享总线的多处理器cache一致性的硬件实现*

李均晓张盛兵沈绪榜《计算机应用研究》2008,25(6):1890-1893

龙腾R2微处理器是西北工业大学航空微电子中心设计的采用PowerPC体系结构,具有自主知识产权的R ISC微处理器。为了扩展其多处理器的功能,采用总线侦听的方法来维护多处理器环境下的cache一致性。首先介绍了共享总线侦听技术以及侦听协议,然后详细介绍了龙腾R2微处理器的总线侦听部件的实现方案,对几类cache一致性的实现方案以及性能进行了评析。FPGA实验结果表明,总线侦听部件能高效而准确地保证多处理器系统的cache一致性。相似文献

18.

分布式文件系统中Cache一致性的验证

王建勇祝明发《计算机学报》1999,22(5):460-466

基于信念逻辑,分析了曙光超级服务器单一映象文件系统中所采用的基于目录的无效使能Ｃａｃｈｅ一致性协议,首先介绍了Ｃａｃｈｅ一致性协议的目标,并为之建立了系统模型及逻辑,然后以基于目录的无效使能协议为例演示了运用信念逻辑对Ｃａｃｈｅ一致性进行正确性证明的过程。相似文献

19.

Design and formal verification of a hierarchical cache coherence protocol for NoC based multiprocessors

Hemangee K. Kapoor Praveen Kanakala Malti Verma Shirshendu Das 《The Journal of supercomputing》2013,65(2):771-796

Advancement in semiconductor technology is allowing to pack more and more processing cores on a single die and scalable directory based protocols are needed for maintaining cache coherence. Most of the currently available directory based protocols are designed for mesh based topology and have the problem of delay and scalability. Cluster based coherence protocol is a better option than flat directory based protocol but the problem of mesh based topology is still exits. On the other hand, tree based topology takes fewer hop counts compared to mesh based topology. In this paper we give a hierarchical cache coherence protocol based on tree based topology. We divide the processing cores into clusters and each cluster shares a higher-level cache. At the next level we form clusters of caches connected to yet another higher-level cache. This is continued up to the top level cache/memory. We give various architectural placements that can benefit from the protocol; hop-count comparison; and memory overhead requirements. Finally, we formally verify the protocol using the Mur? tool. 相似文献

20.

Improving memory utilization in cache coherence directories

Lilja D.J. Yew P.-C. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(10):1130-1146

Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significantly reduces their memory requirements, but they still allocate pointers to blocks that do not need them. The authors present two compiler optimizations that exploit the high-level sharing information available to the compiler to further reduce the size of a tagged directory by allocating pointers only when necessary. Trace-driven simulations are used to show that the performance of this combined hardware-software approach is comparable to other coherence schemes, but with significantly lower memory requirements. In addition, these simulations suggest that this approach is less sensitive to the quality of the memory disambiguation and interprocedural analysis performed by the compiler than software-only coherence schemes 相似文献