首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
方娟  郭媚  杜文娟  雷鼎 《计算机应用》2013,33(9):2404-2409
针对多核处理器下的共享二级缓存(L2 Cache)提出了一种面向低功耗的Cache设计方案(LPD)。在LPD方案中,分别通过低功耗的共享Cache混合划分算法(LPHP)、可重构Cache算法(CRA)和基于Cache划分的路预测算法(WPP-L2)来达到降低Cache功耗的目的,同时保证系统的性能良好。在LPHP和CRA中,程序运行时动态地关闭Cache中空闲的Cache列,节省了对空闲列的访问功耗。在WPP-L2中,利用路预测技术在Cache访问前给出预测路信息,预测命中时则可用最短的访问延时和最少的访问功耗完成Cache访问;预测失效时,则结合Cache划分策略,降低由路预测失效导致的额外功耗开销。通过SPEC2000测试程序验证,与传统使用最近最少使用(LRU)替换策略的共享L2 Cache相比,本方案提出的三种算法虽然对程序执行时间稍有影响,但分别节省了20.5%、17%和64.6%的平均L2 Cache访问功耗,甚至还提高了系统吞吐率。实验表明,所提方法在保持系统性能的同时可以显著降低多核处理器的功耗。  相似文献   

3.
当片上多处理器系统上运行多个不同程序时,如何给这些不同的应用程序分配适当的cache空间成为一个难题。Cache划分就是解决这一难题的有效方法,目前大部分的划分方法都是针对最后一级共享cache设计的。私有cache划分(private cache partitioning,PCP)方法采用一个分布式一致性引擎(DCE)把多个私有cache组织在一起,最后通过硬件信息提取单元获得多个程序在不同cache路上的命中分布情况,用于指导划分算法的执行,最后由每个DCE根据划分算法运行的结果对cache空间进行划分。实验结果表明PCP方法降低了失效率,提高了程序执行性能。  相似文献   

4.
面向多核处理器的共享cache优化研究进展   总被引:1,自引:0,他引:1  
由于技术的发展,片上多核处理器上的核数量和片上缓存的大小一直在增长,且缓存占据了芯片的大部分面积,使得片上缓存所消耗的能量成为存储器子系统中功率损耗的主要贡献者,因此对片上缓存进行优化是提高存储器系统效率的主要途径,增强了片上多核处理器的运算性能.针对共享缓存的管理、一致性等方面介绍了共享缓存的主流优化技术,并探讨了未来的研究方向.  相似文献   

5.
Modern microprocessor design relies heavily on detailed full-chip performance simulations to evaluate complex trade-offs. Typically, different design alternatives are tried out for a specific sub-system or component, while keeping the rest of the system unchanged. We observe that full-chip simulations for such studies is overkill. This paper introduces mesoscale simulation, which employs high-level modeling for the unchanged parts of a design and uses detailed cycle-accurate simulations for the components being modified. This combination of high-level and low-level modeling enables accuracy on par with detailed full-chip modeling while achieving much higher simulation speeds than detailed full-chip simulations. Consequently, mesoscale models can be used to quickly explore vast areas of the design space with high fidelity. We describe a proof-of-concept mesoscale implementation of the memory subsystem of the Cell/B.E. processor and discuss results from running various workloads.  相似文献   

6.
随着雷达信号处理需求的快速增加,在满足处理需求的同时,降低功耗和缩小体积成为设计的难点。设计和实现了基于TI公司多核SoC芯片66AK2L06的雷达信号处理机系统。该系统利用66AK2L06集成的数字上/下变频模块和JESD204B接口,实现了多核SoC+高速ADC/DAC的处理架构,相较于传统的DSP+FPGA+高速AD/DA架构,功耗降低了40%左右,布板面积也大为减小,同时利用66AK2L06的多核心及FFT协处理器,运算能力也得到了增强。  相似文献   

7.
We present a novel architecture to develop Virtual Environments (VEs) for multicore CPU systems. An object-centric method provides a uniform representation of VEs. The representation enables VEs to be processed in parallel using a multistage, dual-frame pipeline. Dynamic work distribution and load balancing is accomplished using a thread migration strategy with minimal overhead. This paper describes our approach, and shows it is efficient and scalable with performance experiments. Near linear speed-ups have been observed in experiments involving up to 1,000 deformable objects on a six-core i7 CPU. This approach’s practicality is demonstrated with the development of a medical simulation trainer for a craniotomy procedure.  相似文献   

8.
Cache locking technique is often utilized to guarantee a tighter prediction of Worst-Case Execution Time (WCET) which is one of the most important performance metrics for embedded systems. However, in Multi-Processor Systems-on-Chip (MPSoC) systems with multi-tasks, Level 2 (L2) cache is often shared among different tasks and cores, which leads to extended unpredictability of cache. Task assignment has inherent relevancy for cache behavior, while cache behavior also affects the efficiency of task assignment. Task assignment and cache behavior have dramatic influences on the overall WCET of MPSoC. This paper proposes joint task assignment and cache partitioning techniques to minimize the overall WCET for MPSoC systems. Cache locking is applied to each task to guarantee a precise WCET. We prove that the joint problem is NP-hard and propose several efficient algorithms. Experimental results show that the proposed algorithms can consistently reduce the overall WCET compared to previous techniques.  相似文献   

9.
《Parallel Computing》2014,40(10):710-721
In this paper, we investigate the problem of fair storage cache allocation among multiple competing applications with diversified access rates. Commonly used cache replacement policies like LRU and most LRU variants are inherently unfair in cache allocation for heterogeneous applications. They implicitly give more cache to the applications that has high access rate and less cache to the applications of slow access rate. However, applications of fast access rate do not always gain higher performance from the additional cache blocks. In contrast, the slow application suffer poor performance with a reduced cache size. It is beneficial in terms of both performance and fairness to allocate cache blocks by their utility.In this paper, we propose a partition-based cache management algorithm for a shared cache. The goal of our algorithm is to find an allocation such that all heterogeneous applications can achieve a specified fairness degree as least performance degradation as possible. To achieve this goal, we present an adaptive partition framework, which partitions the shared cache among competing applications and dynamically adjusts the partition size based on predicted utility on both fairness and performance. We implement our algorithm in a storage simulator and evaluate the fairness and performance with various workloads. Experimental results show that, compared with LRU, our algorithm achieves large improvement in fairness and slightly in performance.  相似文献   

10.
Interconnection becomes one of main concerns in current and future microprocessor designs from both performance and consumption. Three-dimensional integration technology, with its capability to shorten the wire length, is a promising method to mitigate the interconnection related issues. In this paper we implement a novel high-performance processor architecture based 3D on-chip cache to show the potential performance and power benefits achievable through 3D integration technology. We separate other logic module and cache module and stack 3D cache with the processor which reduces the global interconnection, power consumption and improves access speed. The performance of 3D processor and 3D cache at different node is simulated using 3D Cacti tools and theoretical algorithms. The results show that comparing with 2D, power consumption of the storage system is reduced by about 50%, access time and cycle time of the processor increase 18.57% and 21.41%, respectively. The reduced percentage of the critical path delay is up to 81.17%.  相似文献   

11.
Static cache partitioning can reduce inter-application cache interference and improve the composite performance of a cache-polluted application and a cache-sensitive application when they run on cores that share the last level cache in the same multi-core processor. In a virtualized system, since different applications might run on different virtual machines (VMs) in different time, it is inapplicable to partition the cache statically in advance. This paper proposes a dynamic cache partitioning scheme that makes use of hot page detection and page migration to improve the composite performance of co-hosted virtual machines dynamically according to prior knowledge of cache-sensitive applications. Experimental results show that the overhead of our page migration scheme is low, while in most cases, the composite performance is an improvement over free composition.  相似文献   

12.
运用可重构cache和动态电压缩放技术,为处理器及其cache提出了一种基于程序段的自适应低能耗算法PBLEA(phasebased low energy algorithm)。该算法使用建立在指令工作集签名基础上的程序段监测状态机来判断程序段是否发生变化,并作出cache容量及CPU电压和频率的调整决定。在程序段内,使用容量调整状态机和通过计算频率缩放因子β来先后对cache容量及CPU电压和频率进行调整。在Simpanalyzer模拟器上完成了该算法的实现。通过对MiBench测试程序集的测试表明  相似文献   

13.
In future, multicore processors with hundreds of cores will collaborate on a single chip. Then, more advanced network-on-chip (NoC) topologies will be needed than today's shared busses for dual core processors. Multistage interconnection networks, which are already used in parallel computers, seem to be a promising alternative. In this paper, a new network topology is introduced that particularly applies to multicast traffic in multicore systems and parallel computers. Those multilayer multistage interconnection networks are described by defining the main parameters of such a topology. Performance and costs of the new architecture are determined and compared to other network topologies. Network traffic consisting of constant size packets and of varying size packets is investigated. It is shown that all kinds of multicast traffic particularly benefit from the new topology.  相似文献   

14.
公平性是一个关键的优化问题,当系统缺乏公平时,会出现线程饿死和优先级反转等问题.以公平性优化作为研究目标,分析当前共享Cache划分公平性的评价标准,找出了其评价参数和划分策略的不足,提出了一种新的共享Cache划分方案.通过提出一个新的多线程公平性评价指标并改进了已有的公平划分策略,从而提高多线程运行的公平性.实验结果表明,该共享Cache划分方案显著提高了系统公平性,并且系统吞吐量也有提高.  相似文献   

15.
The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches.  相似文献   

16.
为满足现代高带宽互联网应用环境中的安全性保障要求,提出了一种基于Tilera GX36多核网络处理平台的IPSec VPN系统结构,利用SDN思想设计了系统控制面和数据面的程序功能模块,实现网络流量安全的灵活控制。针对系统中安全策略检索性能要求,提出一种基于Hash的三级安全策略流表存储结构,并以各Tile CPU缓存中的安全关联流表为快速检索数据源设计一种安全策略检索方法。测试结果表明,对于互联网中典型的小包、中包和大包应用场景,该系统均能达到近40 Gb/s的处理性能。  相似文献   

17.
We consider the problem of power and performance management for a multicore server processor in a cloud computing environment by optimal server configuration for a specific application environment. The motivation of the study is that such optimal virtual server configuration is important for dynamic resource provision in a cloud computing environment to optimize the power and performance tradeoff for certain specific type of applications. Our strategy is to treat a multicore server processor as an M/M/m queueing system with multiple servers. The system performance measures are the average task response time and the average power consumption. Two core speed and power consumption models are considered, namely, the idle-speed model and the constant-speed model. Our investigation includes justification of centralized management of computing resources, server speed constrained optimization, power constrained performance optimization, and performance constrained power optimization. Our main results are (1) cores should be managed in a centralized way to provide the highest performance without consumption of more energy in cloud computing; (2) for a given server speed constraint, fewer high-speed cores perform better than more low-speed cores; furthermore, there is an optimal selection of server size and core speed which can be obtained analytically, such that a multicore server processor consumes the minimum power; (3) for a given power consumption constraint, there is an optimal selection of server size and core speed which can be obtained numerically, such that the best performance can be achieved, i.e., the average task response time is minimized; (4) for a given task response time constraint, there is an optimal selection of server size and core speed which can be obtained numerically, such that minimum power consumption can be achieved while the given performance guarantee is maintained.  相似文献   

18.
阵列众核处理器由于其较高的计算性能和能效比已经广泛应用于高性能计算领域。而要构建未来高性能计算系统处理器必须解决严峻的"访存墙"挑战以及核心协同问题。通常的阵列处理器,其核心多采用单线程结构,以减少开销,但是对访存提出了较高的要求。引入硬件同时多线程技术,针对实验中单核心多线程二级Cache利用率较低的问题,提出了一种共享二级Cache划分机制。经实验模拟,通过上述优化的共享二级Cache划分机制,二级指令Cache失效率下降18.59%,数据Cache失效率下降6.60%,整体CPI性能提升达到10.1%。  相似文献   

19.
A concept for a processor is described, which scans a vector display file and fills the picture information into a 512 bit line buffer in real time.  相似文献   

20.
In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号