期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

贾耀仓武成岗张兆庆《计算机研究与发展》2012,49(1):93-102

对于共享cache的多核处理器,如何管理好各个核对cache的利用,对于充分发挥多核处理器性能是很关键的问题.目前采用的cache替换方法程序间会出现性能干扰,cache静态划分技术则是通过为同时运行的程序分配不同的空间来解决性能干扰问题.为了给程序分配合适大小的cache空间,需要对程序进行性能profiling,即事先多遍运行收集程序在各种cache容量下的性能数据,这种性能profiling方法开销巨大,影响实用.为了解决性能profiling需要多遍运行程序的问题,提出了只需单遍运行的程序性能profiling优化技术.该技术利用在线的phase分析技术识别程序的运行阶段,避免对相同阶段的重复profiling;同时分析程序各phase的性能同cache容量变化的关系趋势,对于性能不敏感的容量变化则不进行profiling,降低开销.在程序运行结束后通过程序各phase在cache各种容量下的性能来估计程序在各容量下的整体性能,以指导cache静态划分.实验表明,该技术的开销仅为7%,而该方法指导的cache划分比未划分时有8%的性能改进,同多遍运行的程序性能profiling指导的cache划分性能相比仅有1%的下降. 相似文献

2.

面向虚拟机环境的Cache动态划分算法设计与实现

李家文沈立《计算机科学与探索》2012,6(1):58-66

为改善虚拟化系统的cache隔离性,提高系统的整体性能,面向虚拟化环境设计并实现了一种cache动态划分算法。该算法采用页面着色的思想,通过为虚拟机分配私有颜色页面来实现cache的划分,同时能够根据虚拟机的cache需求为其动态调整cache容量。在Xen虚拟环境中实现了该算法。实验结果表明,该算法可以在较低开销的情况下,显著提高多虚拟机上并发程序的全局性能。相似文献

3.

一种低功耗动态可重构cache方案 总被引：1，自引：0，他引：1

赵欢苏小昆李仁发《计算机应用》2009,29(5):1446-1451

嵌入式系统中,处理器功耗是十分受关注的,研究表明嵌入式系统中cache存储器的功耗占处理器总功耗的30%～60%。为此提出一种低功耗动态可重构的cache方案Tournament cache,该cache方案通过在传统cache结构的基础上增加三个计数器和一个寄存器,在程序运行的过程中,根据计数器统计的结果动态调整cache的相联度,使得相联度在1、2或4路之间变化,以适应不同程序段的需要,从而降低系统的功耗。实验结果表明,此cache方案对比传统的四路组相联的cache能耗节省超过40%,而且性能的降低几乎可以忽略。相似文献

4.

基于共享cache多核处理器的数据库内存排序优化

邓亚丹吴京熊伟景宁《计算机研究与发展》2009,46(Z2)

针对目前主流的多核处理器,提出了共享cache敏感的数据库排序多线程执行框架(sharedcache sensitive multithreaded sorting framework,SCS-MSF).首先分析了多线程QuickSort排序在共享cache多核处理器中执行时面临的性能瓶颈,在此基础上针对SCS-MSF每个处理阶段的数据访问特点,提出了各自的多线程并行执行模式,并通过各种优化策略改善线程执行时的cache性能,特别是减少多线程访问共享cache时的访问冲突问题,以提高线程的cache性能.在实验中,基于内存数据库EaseDB实现了SCS-MSF.实验结果表明SCS-MSF具有良好cache访问性能,从而提高了多线程执行的效率,而且性能稳定,数据库排序性能得到了较大提高. 相似文献

5.

基于语义信息的cache管理策略

周勇蒋泽军王丽芳宋玲玲王斌《微处理机》2011,32(6):87-90

针对传统的cache在预取时不判断预取数据块的状态,导致一些不必要的I/O,同时降低cache命中率的缺点,提出了一种基于语义信息的cache管理策略.该策略首先通过收集语义信息让磁盘了解文件系统在磁盘上的数据布局,磁盘上每个数据块是活跃的还是死亡的,并得出磁盘上分区数据块的活跃度.然后根据语义信息在预取的时候不预取死亡的块,在活跃度高的分区上提高预取参数,而在cache替换出数据块时对于死亡块不进行写盘操作.实验结果表明该策略可以较好提高cache命中率进而提高系统的吞吐量. 相似文献

6.

基于嵌入应用的一级Cache设计空间的研究

胡荣群《计算机光盘软件与应用》2010,(8):27-28

在嵌入式领域,一般都是单个或一类应用程序在处理器上反复执行,在这种情况下,通过配置cache的参数,可以得一个性能最优的cache。通过变化cache的组容量s、块容量a和块大小b三个参数,可以得到一个总访问时间最小的cache配置。在本文中,提出了两个cache仿真算法,此算法可以减少判断cache命中／失效的时间复杂度。相似文献

7.

避免模调度中cache代价的优化方法 总被引：1，自引：0，他引：1

刘利李文龙郭振宇李胜梅汤志忠《软件学报》2005,16(10):1842-1852

软件流水能够加快循环的执行速度.模调度是一种被广泛采用的软件流水的启发式.为了改善存储系统,cache使用了分级机制,但这也带来了额外的存储延迟-cache代价.证明了模调度可能导致cache代价,并提出了一种可以避免模调度的cache代价的PCPMS(prevent cache penalty in modulo scheduling)算法.实验结果表明,PCPMS能够避免模调度中的cache代价,提高程序性能. 相似文献

8.

一种步长自适应二级cache预取机制

下载免费PDF全文

靳强郭阳鲁建壮《计算机工程与应用》2011,47(29):56-59

随着集成电路制造工艺的快速发展,片上实现大容量的cache成为可能,这从很大程度上降低了cache的失效率,与此同时,大容量的cache发生失效时的开销也更加显著。通过分析cache失效行为,设计了一种新的二级cache步长自适应预取机制,该机制充分利用了二级cache对指令地址不可见的特点,使用失效地址作为索引检查预取表。通过分析测试结果,选择了合适的结构参数,有效提高了cache性能。相似文献

9.

片上多处理器中延迟和容量权衡的cache结构 总被引：1，自引：0，他引：1

肖俊华冯子军章隆兵《计算机研究与发展》2009,46(1)

片上多处理器中二级cache的设计面临着延迟和容量不能同时满足的矛盾,私有结构有较小的命中延迟但是减少了cache的有效容量,共享结构能增加cache的有效容量但是有较长的命中延迟.提出了一种适用于CMP的cache结构--延迟和容量权衡的cache结构(TCLC).该结构是一种混合私有结构和共享结构的设计,核心思想是动态识别cache块的共享类型,根据不同共享类型分别对其进行优化,对私有cache块采用迁移的优化策略,对共享只读cache块采用复制的优化策略,对共享读写cache块采用中心放置的优化策略,以期达到访问延迟接近私有结构,有效容量接近共享结构的目的,从而缓解线延迟的影响,减少平均内存访问延迟.全系统模拟的实验结果表明,采用TCLC结构,相对于私有结构性能平均提高13.7%.相对于共享结构性能平均提高12%. 相似文献

10.

32位DSP两级cache的结构设计

杨向峰张惠国陶建中《微计算机信息》2008,24(17)

采用自项向下的流程设计了一款32住DSP的cache.该cache采用两级结构,第一级采用哈佛结构,第二级采用普林斯顿结构.本文详细论述了该cache的结构设计及采用的算法. 相似文献

11.

Heterogeneous-aware cache partitioning: Improving the fairness of shared storage cache

《Parallel Computing》2014,40(10):710-721

In this paper, we investigate the problem of fair storage cache allocation among multiple competing applications with diversified access rates. Commonly used cache replacement policies like LRU and most LRU variants are inherently unfair in cache allocation for heterogeneous applications. They implicitly give more cache to the applications that has high access rate and less cache to the applications of slow access rate. However, applications of fast access rate do not always gain higher performance from the additional cache blocks. In contrast, the slow application suffer poor performance with a reduced cache size. It is beneficial in terms of both performance and fairness to allocate cache blocks by their utility.In this paper, we propose a partition-based cache management algorithm for a shared cache. The goal of our algorithm is to find an allocation such that all heterogeneous applications can achieve a specified fairness degree as least performance degradation as possible. To achieve this goal, we present an adaptive partition framework, which partitions the shared cache among competing applications and dynamically adjusts the partition size based on predicted utility on both fairness and performance. We implement our algorithm in a storage simulator and evaluate the fairness and performance with various workloads. Experimental results show that, compared with LRU, our algorithm achieves large improvement in fairness and slightly in performance. 相似文献

12.

Dynamic cache partitioning based on hot page migration

Xiaolin WANG Xiang WEN Yechen LI Zhenlin WANG Yingwei LUO Xiaoming LI 《Frontiers of Computer Science》2012,6(4):363-372

Static cache partitioning can reduce inter-application cache interference and improve the composite performance of a cache-polluted application and a cache-sensitive application when they run on cores that share the last level cache in the same multi-core processor. In a virtualized system, since different applications might run on different virtual machines (VMs) in different time, it is inapplicable to partition the cache statically in advance. This paper proposes a dynamic cache partitioning scheme that makes use of hot page detection and page migration to improve the composite performance of co-hosted virtual machines dynamically according to prior knowledge of cache-sensitive applications. Experimental results show that the overhead of our page migration scheme is low, while in most cases, the composite performance is an improvement over free composition. 相似文献

13.

Precise control of page cache for containers

Kun WANG Song WU Shengbang LI Zhuo HUANG Hao FAN Chen YU Hai JIN 《Frontiers of Computer Science》2024,18(2):182102

Container-based virtualization is becoming increasingly popular in cloud computing due to its efficiency and flexibility. Resource isolation is a fundamental property of containers. Existing works have indicated weak resource isolation could cause significant performance degradation for containerized applications and enhanced resource isolation. However, current studies have almost not discussed the isolation problems of page cache which is a key resource for containers. Containers leverage memory cgroup to control page cache usage. Unfortunately, existing policy introduces two major problems in a container-based environment. First, containers can utilize more memory than limited by their cgroup, effectively breaking memory isolation. Second, the OS kernel has to evict page cache to make space for newly-arrived memory requests, slowing down containerized applications. This paper performs an empirical study of these problems and demonstrates the performance impacts on containerized applications. Then we propose pCache (precise control of page cache) to address the problems by dividing page cache into private and shared and controlling both kinds of page cache separately and precisely. To do so, pCache leverages two new technologies: fair account (f-account) and evict on demand (EoD). F-account splits the shared page cache charging based on per-container share to prevent containers from using memory for free, enhancing memory isolation. And EoD reduces unnecessary page cache evictions to avoid the performance impacts. The evaluation results demonstrate that our system can effectively enhance memory isolation for containers and achieve substantial performance improvement over the original page cache management policy. 相似文献

14.

CPU-GPU融合架构上的缓存性能分析与优化

孙传伟安虹孙荪陈俊仕《计算机工程与应用》2017,53(2):47-52

现今CPU和GPU的发展已经出现新的瓶颈,将两者“结合”在同一块芯片上成为一种新的趋势。这种新的异构架构给片上共享资源的管理带来压力。而共享末级缓存（LLC）的管理对性能的影响非常关键。由于CPU程序和GPU程序的不同特性,给CPU和GPU间共享的末级缓存管理带来新的挑战。通过分析GPU程序访存特征,借鉴之前的缓存管理方案,提出对CPU-GPU融合系统的末级缓存进行等量的静态划分和最优静态划分的方案。实验结果表明：通过缓存划分可以有效避免CPU和GPU程序间的干扰。与传统LRU策略相比,等量静态划分和最优静态划分可以使系统整体性能分别提高7.68%和11.62%。相似文献

15.

面向多线程多道程序的加权共享Cache划分 总被引：5，自引：1，他引：4

所光杨学军《计算机学报》2008,31(11)

并行应用在共享Cache结构的多核处理器执行时,会因为对共享Cache的冲突访问而产生性能下降和执行时间不确定的现象.共享Cache划分技术可以把共享Cache互斥地分配给多个进程使用,是解决该问题的有效方法.由于线程间的数据共享,线程数目不同的应用对共享Cache的利用率不同,但传统的以失效率最低为目标的共享Cache划分算法(例如UCP)没有区分应用线程数目的不同.文中设计了一种面向多线程多道程序的加权共享Cache划分框架(Weighted Cache Partitioning,WCP),包括面向应用的失效率监控器和加权Cache划分算法.失效率监控器以进程为单位动态监控在不同的Cache容量下应用的失效率;而加权Cache划分算法扩展了传统的失效率最优的Cache划分算法,根据应用线程数目的不同在进行Cache划分时给应用赋予不同的权值,以使具有更多线程的应用获得更多的共享Cache,从而提高系统的整体性能.实验结果表明:加权Cache划分算法虽然失效率有所增高,但却改进了IPC吞吐率、加权加速比和公平性.在由科学和工程计算应用组成的多道程序测试用例中,WCP-1的IPC吞吐率比以失效率最低为目标函数的共享Cache划分算法最高高出10.8%,平均高出5.5%. 相似文献

16.

ARP:同时多线程处理器中共享Cache自适应运行时划分机制 总被引：1，自引：1，他引：0

隋秀峰吴俊敏陈国良《计算机研究与发展》2008,45(7)

同时多线程是一种延迟容忍的体系结构,采用共享的二级Cache,在每个周期内可以执行多个线程的多条指令,这就会增加对存储层次的压力,文中主要研究了SMT处理器中多个并发执行的线程之间共享Cache的划分问题,尤其是Cache共享中的公平性问题以及它和吞吐量之间的关系,传统的LRU策略会根据线程的需要隐式地划分共享Cache,给具有较高需求的线程分配较多的Cache空间,对Cache的管理具有不公平性,从而会引起线程饿死、优先级反转等问题,实现了一种自适应、运行时划分机制(ARP)来管理共享Cache.ARP采用公平性作为划分的度量,并且使用动态划分算法来优化公平性,该算法具有易于实现,所需剖析较少的特点,硬件上使用经典的监控器来收集每个线程的栈距离信息,其存储开销不到0.25%.实验结果显示,与基于LRU的Cache划分相比,ARP可以将一个2路SMT处理器的公平性提高2.26倍,而将吞吐量平均提高14.75%. 相似文献

17.

一种多线程阵列众核处理器的二级Cache划分机制

陈逸飞朱蕾李宏亮《计算机工程与科学》2019,41(3):400-408

阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。相似文献

18.

共享多端口数据Cache结构：SMPDCA

黄光奇李子木周兴铭窦勇《计算机学报》2001,24(12):1318-1323

随着半导体工艺技术的飞速发展,单芯片多处理器（Single-Chip Multiprocessor,SCMP)结构将是一条提高处理器性能的有效途径。该文在分析SCMP结构的特点的基础上,提出了SCMP的一种结构实现：共享多端口数据Cache结构（Shared Multi-Ported Data Cache Architecture,SMPDCA).SMPDCA结构具有三个突出的优点：最小的通信延迟、没有Cache一致性维护开销和数据Cache命中率提高。模拟结果表明,与数据Cache私有的结构相比,SMPDCA结构的煅出优点使得应用程序的性能得到了明显的提高,特别是对于改善处理器之间的通信与交互比较多的应用程序的性能具有最为明显的效果。相似文献

19.

Lightweight dynamic partitioning for last-level cache of multicore processor on real system

Ludan Zhang Yi Liu Rui Wang Depei Qian 《The Journal of supercomputing》2014,69(2):547-560

With rapid development of multi/many-core processors, contention in shared cache becomes more and more serious that restricts performance improvement of parallel programs. Recent researches have employed page coloring mechanism to realize cache partitioning on real system and to reduce contentions in shared cache. However, page coloring-based cache partitioning has some side effects, one is page coloring restricts memory space that an application can allocate, from which may lead to memory pressure, another is changing cache partition dynamically needs massive page copying which will incur large overhead. To make page coloring-based cache partition more practical, this paper proposes a malloc allocator-based dynamic cache partitioning mechanism with page coloring. Memory allocated by our malloc allocator can be dynamically partitioned among different applications according to partitioning policy. Only coloring the dynamically allocated pages can remit memory pressure and reduce page copying overhead led by re-coloring compared to all-page coloring. To further alleviate the overhead, we introduce minimum distance page copying strategy and lazy flush strategy. We conduct experiments on real system to evaluate these strategies and results show that they work well for reducing cache misses and re-coloring overhead. 相似文献

20.

一种面向多核处理器粗粒度的应用级Cache划分方法

所光《计算机工程与科学》2009,31(Z1)

Cache划分技术是解决共享Cache访问冲突的重要方法,但是已有的Cache划分技术具有开销高、Cache划分时机难以确定的缺点。本文提出了面向应用的Cache划分框架(ACP)。ACP的优点是能够使用程序员提供的应用最外层循环的边界信息,更好地获取应用的失效率信息,因此Cache划分算法具有更高的精度,从而降低了划分的频率,进而提高系统性能。实验结果表明,和传统的固定周期的Cache划分方向相比,ACP具有更好的性能。相似文献