期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

刘松鹤宋焕生亓淑敏李文敏《计算机应用研究》2013,30(7):2064-2067

“存储墙”问题是高性能处理器设计必须跨越的障碍之一, 高效、智能的Cache系统是处理器存储体系的关键因素。具有分支预测能力的处理器在猜测执行分支路径上访存指令时取回的存储器数据所导致的Cache污染会显著影响Cache和处理器性能。分析了猜测执行和Cache数据污染对处理器性能的影响, 在此基础上结合分支预测机制的特征提出了一种基于分支路径跟踪的Cache污染控制技术——Contra, 通过构建分支路径跟踪表对猜测路径写入Cache的数据进行跟踪, 并对这些数据的存储、访问和替换过程进行控制, 有效地避免了污染数据对Cache效率的影响, 提升了处理器存储系统的性能。仿真结果表明, Contra技术相对于baseline结构来说, L1 D-Cache命中率提升幅度为0. 03%～6. 69%, 平均提升为1. 80%; IPC的提升幅度为0. 01%～6. 60%, 平均提升为2. 56%。相似文献

2.

面向微处理器猜测执行过程中预载入数据的Cache污染控制方法

张骏《小型微型计算机系统》2012,33(5):987-994

“存储墙”问题已经成为处理器性能提升的主要障碍,而处理器内核猜测执行预测路径上访存指令时预载入的存储器数据所导致Cache污染会严重影响处理器性能.本文提出一种针对猜测执行过程中预载入数据的Cache污染控制方法CSDA.首先,利用置信度评估技术从所有预测路径中分离出错误概率较大的路径.然后,根据低置信度污染型访存指令识别历史表将低置信度预测路径上的访存指令划分为预取型和污染型,为污染型的访存指令建立低优先级Load/Store队列,并采用污染数据Cache存储污染数据.仿真结果表明,在双核模式下,CSDA策略相对于baseline结构来说,L1 D-Cache缺失率降低幅度从9％-23％,平均降低了17％;L2 Cache缺失率的下降范围从1.02％-14.39％,平均为5.67％;IPC的提升幅度从0.19％ -5.59％,平均为2.21％. 相似文献

3.

面向非一致Cache的智能多跳提升技术

吴俊杰潘晓辉杨学军《计算机学报》2009,32(10)

非一致Cache体系结构(Non-Uniform Cache Architecture,NUCA)几乎已经成为未来片上大容量Cache的设计趋势.非一致Cache中,数据提升技术通过将经常访问的数据放置在距离处理器较近的Cache bank中减少处理器对该数据访问的等待时间,对NUCA的性能有着重要影响.然而,目前已有的数据提升技术使用固定的提升策略,投有考虑所要提升到目标bank的实际状态,容易将目标bank中更有用的数据"挤"得远离处理器,从而产生Cache污染问题,严重制约了提升技术的性能发挥.针对这一问题,文中提出智能多跳提升技术.智能多跳提升技术能够感知候选目标bank的状态,为被提升的数据动态地选择合适的目标bank,从而提高了提升效率,减少了Cache污染.同时,智能多跳提升技术的设计巧妙地利用了处理器访问的反向路径,只是简单地扩充了处理器访问报文的格式,并没有增加对Cache bank的额外访问.最后使用全系统模拟器对来自NAS Parallel Benchmark和Livermore Benchmark的15个基准测试程序进行了详细测试,智能多跳提升技术单位提升操作节省的时钟周期数是已有提升技术的1.50倍,最多达到2.61倍;系统的IPC性能平均提高了6.24%,最高达到19.03%. 相似文献

4.

支持多机环境的片上Cache的设计与实现

邹代红高德远张盛兵《计算机工程》2007,33(23):249-251

Cache是高性能微处理器解决CPU和存储器速度差异问题的有效措施之一。在共享存储器的多机环境下，共享数据在多个处理器的片上Cache中分布，Cache间维持数据一致性成为关键。该文讨论了32位嵌入式微处理器“龙腾R2”的Cache的设计和实现和支持多机环境的Cache一致性实现方法，并给出了实现的结果。相似文献

5.

基于节点预测的直接Cache一致性协议

张骏　田泽梅魁志赵季中《计算机学报》2014,(3):3700-3720

处理器性能的提升依赖于对存储系统性能的挖掘.随着片上集成内核数量的不断增大和特征尺寸的持续缩小,延迟、存储可扩展的Cache一致性协议已经成为提升访存效率的关键性因素.文中提出一种基于节点预测的直接Cache一致性协议-NPP协议,研究一致性交互延迟隐藏和目录存储开销减少技术.针对读、写缺失中存在的间接性问题和现有解决方案破坏已有数据局部性、无法获得最近数据副本等问题,分别提出节点挂起技术和直接写缺失处理技术,有效隐藏了目录访问延迟.为了实现准确的节点预测,作者还提出基于"签名"回收的历史信息更新算法,避免了冗余更新和不完整更新.使用SPLASH-2测试程序集,在基于2D MESH NoC互联的64核CMP下,相对于全映射目录协议,NPP协议的平均执行时间降幅为21.78%~31.11%;平均读缺失延迟降低14.22%~18.9%;平均写缺失延迟降低17.89%~21.13%.而获得上述性能提升的代价是网络流量平均增加6.62%~7.28%. 相似文献

6.

动态可配置片上数据存储单元设计

《计算机测量与控制》2014,(3):869-871

作为嵌入式处理器的关键部件,片上Cache的功耗能占到整个处理器功耗的50%以上;一个设计良好的片上数据存储单元能有效降低处理器功耗,并且提高整个系统的性能;便签式存储器(Scratchpad memory,SPM)具有占用片上面积少、功耗低和访问时延确定等优点,因此成为嵌入式系统领域的研究热点;以SPM为基础,介绍了一种动态可配置片上数据存储单元的设计方法,并提出SPM操作函数,方便应用程序开发;实验结果表明,该片上数据存储单元能耗降低超过35%,测试程序运行时间平均减少了20.3%。相似文献

7.

一种片上众核结构共享Cache动态隐式隔离机制研究 总被引：2，自引：0，他引：2

宋风龙刘志勇范东睿张军超余磊《计算机学报》2009,32(10)

访存带宽是限制众核处理器件能提升的关键,将片上最后一级Cache设计为所有处理器核共享是必要的.在共享Cache中隔离放置冲突的数据,是提高共享Cache性能的关键.文中提出了缓存块链接的硬件方法,用于隔离共享Cache中不同线程之间的数据.文中基于时钟精准的片上众核结构模拟器,使用Splash2程序组和生物信息学中的仟务,对所提机制进行了评估.实验结果表明,与传统共享Cache相比,使用缓存块链接机制时,使得共享Cache的冲突性缺失率降低约20%,而使得IPC平均提高了约10%. 相似文献

8.

基于预缓冲机制的低功耗指令Cache

下载免费PDF全文

王冶张盛兵王党辉《计算机工程》2012,38(1):268-269,272

为降低微处理器中片上Cache的能耗,设计一种基于预缓冲机制的指令Cache。通过预缓冲控制部件的预测,使处理器需要的指令尽可能在缓冲区命中,从而避免访问指令Cache所造成的功耗。对7个测试程序的仿真结果表明,预缓冲机制能节省23.23%的处理器功耗,程序执行性能平均提升7.53%。相似文献

9.

Cache自适应写分配策略 总被引：1，自引：0，他引：1

郇丹丹李祖松胡伟武刘志勇《计算机研究与发展》2007,44(2):348-354

处理器所能提供的有效带宽是目前制约处理器性能提高的关键因素 .通过对Cache写失效行为的分析,提出了一种新的提高处理器带宽利用率的Cache写失效处理策略--Cache自适应写分配策略 .该策略在访存失效队列中收集全修改Cache块,对全修改Cache块采用非写分配策略,并能够自适应地切换为写分配策略 .与传统的Cache写失效处理策略相比,Cache自适应写分配策略硬件代价小,避免了不必要的数据传输,降低Cache污染,减少存储管理队列阻塞的频率 .结果表明,采用Cache自适应写分配策略,STREAM基准测试程序带宽平均提高62.6%,SPEC CPU2000程序的IPC值平均提高5.9% . 相似文献

10.

基于软硬件的协同支持在众核上对1-DFFT算法的优化研究 总被引：2，自引：0，他引：2

周永彬张军超张帅张浩《计算机学报》2008,31(11)

随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性. 相似文献

11.

Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References

Onur?Mutlu Email author Hyesoon?Kim David?N.?Armstrong Yale?N.?Patt 《International journal of parallel programming》2005,33(5):529-559

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. This paper proposes the use of the first-level caches as filters that predict the usefulness of speculative memory references. With the proposed technique, speculative memory references bring data only into the first-level caches rather than all levels in the cache hierarchy. The processor monitors the use of the cache blocks in the first-level caches and decides which blocks to keep in the cache hierarchy based on the usefulness of cache blocks. It is shown that a simple implementation of this technique usually outperforms inclusive and exclusive baseline cache hierarchies commonly used by today’s processors and results in IPC performance improvements of up to 10% on the SPEC CPU2000 integer benchmarks. 相似文献

12.

Designing the TFP microprocessor

Hsu P.Y.-T. 《Micro, IEEE》1994,14(2):23-33

Designed to efficiently support large, real-world, floating-point-intensive applications, the TFP (short for Tremendous Floating-Point) microprocessor is a superscalar implementation of the Mips Technologies architecture. This floating-point, computation-oriented processor uses a superscalar machine organization that dispatches up to four instructions each clock cycle to two floating-point execution units, two memory load/store units, and two integer execution units. Its split-level cache structure reduces cache misses by directing integer data references to a 16-Kbyte on-chip cache, while channeling floating-point data references off chip to a 4 Mbyte cache 相似文献

13.

Compressed tag architecture for low-power embedded cache systems

Jong Wook Kwak Young Tae Jeon 《Journal of Systems Architecture》2010,56(9):419-428

Processors in embedded systems mostly employ cache architectures in order to alleviate the access latency gap between processors and memory systems. Caches in embedded systems usually occupy a major fraction of the implemented chip area. The power dissipation of cache system thus constitutes a significant fraction of the power dissipated by the entire processor in embedded systems. In this paper, we propose the compressed tag architecture to reduce the power dissipation of the tag store in cache systems. We introduce a new tag-matching mechanism by using a locality buffer and a tag compression technique. The main power reduction feature of our proposal is the use of small tag space matching instead of full tag matching, with modest additional hardware costs. The simulation results show that the proposed model provides a power and energy-delay product reduction of up to 27.8% and 26.5%, respectively, while still providing a comparable level of system performance to regular cache systems. 相似文献

14.

An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Jung-Wook Park Hoon-Mo Yang Gi-Ho Park Shin-Dug Kim Charles C. Weems 《Journal of Parallel and Distributed Computing》2010

In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture. 相似文献

15.

嵌入式处理器中降低Cache缺失代价设计方法研究 总被引：2，自引：0，他引：2

黄海林许彤范东睿唐志敏《小型微型计算机系统》2006,27(11):2077-2081

以龙芯1号处理器为研究对象，探讨了嵌入式处理器中降低Cache缺失代价的设计方法．通过分析处理器的结构特征，本文实现了在关键字优先基础上一次缺失下命中的非阻塞数据Cache，可以将处理器平均性能提高3．9％,同时利用局部性原理，在关键字优先非阻塞数据Cache的基础上，本文提出了一种类非阻塞的指令Cache设计方法，可以降低指令Cache的缺失代价，以较小的实现代价进一步将处理器平均性能提高7．7％．通过本文的工作，可以同时降低指令Cache和数据Cache的缺失代价，处理器的平均性能提高了11．6％．相似文献

16.

共享多端口数据Cache结构：SMPDCA

黄光奇李子木周兴铭窦勇《计算机学报》2001,24(12):1318-1323

随着半导体工艺技术的飞速发展,单芯片多处理器（Single-Chip Multiprocessor,SCMP)结构将是一条提高处理器性能的有效途径。该文在分析SCMP结构的特点的基础上,提出了SCMP的一种结构实现：共享多端口数据Cache结构（Shared Multi-Ported Data Cache Architecture,SMPDCA).SMPDCA结构具有三个突出的优点：最小的通信延迟、没有Cache一致性维护开销和数据Cache命中率提高。模拟结果表明,与数据Cache私有的结构相比,SMPDCA结构的煅出优点使得应用程序的性能得到了明显的提高,特别是对于改善处理器之间的通信与交互比较多的应用程序的性能具有最为明显的效果。相似文献

17.

基于指令行为的Cache可靠性评估研究

周学海余洁李曦王志刚《计算机研究与发展》2007,44(4):553-559

软错误由高能粒子撞击所产生,对处理器的可靠性产生很大的损害.随着处理器设计目标转向低功耗、高性能和低供电电压,软错误的发生日益频繁,处理器的可靠性研究也随之受到越来越多的关注.针对传统的基于注错仿真的可靠性评估方法效率低的缺陷,提出了一套系统的cache可靠性评估方法,以可靠性指标之一--体系结构易受损因子(architectural vulnerability factor,AVF))--为研究对象,一方面,基于指令行为分析应用程序运行过程中对最终结果不产生影响的指令,从而确定对cache的AVF产生作用的指令;另一方面,根据cache的存储类型、所采取的写策略,结合cache中数据/指令阵列和地址标识阵列的特点,对cache上的各种相邻操作组合对AVF的影响进行了研究,从而完成AVF评估所需的信息分析.实验部分对PISA体系结构指令cache中的指令阵列进行了AVF评估,说明了该方法的有效性. 相似文献

18.

The Mips R10000 superscalar microprocessor

Yeager K.C. 《Micro, IEEE》1996,16(2):28-41

The Mips R10000 is a dynamic, superscalar microprocessor that implements the 64-bit Mips 4 instruction set architecture. It fetches and decodes four instructions per cycle and dynamically issues them to five fully-pipelined, low-latency execution units. Instructions can be fetched and executed speculatively beyond branches. Instructions graduate in order upon completion. Although execution is out of order, the processor still provides sequential memory consistency and precise exception handling. The R10000 is designed for high performance, even in large, real-world applications with poor memory locality. With speculative execution, it calculates memory addresses and initiates cache refills early. Its hierarchical, nonblocking memory system helps hide memory latency with two levels of set-associative, write-back caches 相似文献

19.

Exploring cache performance in multithreaded processors

Dimitris Lioupis Sotiris MiliosAuthor vitae 《Microprocessors and Microsystems》1997,20(10):230

Multithreading is a well known technique to hide latency in a non-blocking cache architecture. By switching execution from one thread to another, the CPU can perform useful work, while waiting for pending requests to be processed by the main memory. In this paper we examine the effects of varying the associativity and block size on cache performance in a reduced locality of reference environment, due to multithreading. We find that for associativity equal to the number of threads, the cache produces very low miss rate even for small sizes. Also by taking into account the increase in cycle time due to larger cache size or associativity we find that the optimum cache configuration for best processor performance is 16Kbytes direct mapped. Finally, with a constant main memory bandwidth, increasing the block size to more than 32 bytes, reduces the miss rate, but degrades processor performance. 相似文献