期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Proactive Use of Shared L3 Caches to Enhance Cache Communications in Multi-Core Processors

Fide Sevin Jenks Stephen 《Computer Architecture Letters》2008,7(2):57-60

The software and hardware techniques to exploit the potential of multi-core processors are falling behind, even though the number of cores and cache levels per chip is increasing rapidly. There is no explicit communications support available, and hence inter-core communications depend on cache coherence protocols, resulting in demand-based cache line transfers with their inherent latency and overhead. In this paper, we present Software Controlled Eviction (SCE) to improve the performance of multithreaded applications running on multi-core processors by moving shared data to shared cache levels before it is demanded from remote private caches. Simulation results show that SCE offers significant performance improvement (8-28%) and reduces L3 cache misses by 88-98%. 相似文献

2.

CCNoC: Cache-Coherent Network on Chip for Chip Multiprocessors 总被引：1，自引：1，他引：0

下载免费PDF全文

王惊雷薛一波王海霞李崇民汪东升《计算机科学技术学报》2010,25(2):257-266

As the number of cores in chip multiprocessors(CMPs) increases,cache coherence protocol has become a key issue in integration of chip multiprocessors.Supporting cache coherence protocol in large chip multiprocessors still faces three hurdles:design complexity,performance and scalability.This paper proposes Cache Coherent Network on Chip(CCNoC),a scheme that decouples cache coherency maintenance from processors and shared L2 caches and implements it completely in network on chip to free up processors and ... 相似文献

3.

多核架构下的数据处理算法优化策略综述

下载免费PDF全文

陈伟杜凌霞陈红《计算机科学与探索》2011,5(12):1057-1075

多核处理器,尤其是单芯片多处理器(chip multi-processor,CMP)能够提供强大的共享内存的并行资源,然而单核处理器上的程序和算法并不能充分利用多核架构提供的并行计算资源,因此必须针对多核体系架构特点,对算法进行改进优化,提高算法的执行性能。以优化程序局部性、减少cache访问冲突、提高线程并行度、充分利用单指令多数据流(single instruction multipledata,SIMD)并行和带宽优化等几方面为出发点,归纳和分析了多核处理器上数据处理算法的相关优化策略,并对多核算法进行了总结评述。最后阐述了该领域亟待解决的诸多问题,展望了未来的研究发展方向。相似文献

4.

Cache vulnerability mitigation using an adaptive cache coherence protocol

Mohammad Maghsoudloo Hamid R. Zarandi 《The Journal of supercomputing》2014,68(3):1048-1067

This paper proposes an adaptive cache coherence protocol to improve the reliability of caches against soft errors in shared-memory multi-core processors. The proposed protocol is conducted based-on a comprehensive study and analysis intended to determine the effects of cache coherence protocols on the characteristics of cache memories. The outcomes of this analysis indicate that differences in handling dirty data items play an important role to make distinction in favor of or against a cache coherence protocol. Based on the primary results, the proposed protocol tries to enhance the reliability of caches by means of sharing management. Sharing is dynamically adjusted according to the operational mode of CPU. The experimental results show that proposed protocol leads to about 16 % improvements in MTTF, with no performance degradation and with negligible bandwidth and cache energy consumption overheads compared to previous works. 相似文献

5.

Shared write buffer to boost applications on SpMT architecture

Ming Chen John Ye Tianzhou Chen Hongjun Dai 《The Journal of supercomputing》2017,73(8):3508-3525

With the trend of growing number of integrated processing cores on Chip Multiprocessors, researchers are working hard to increase the available parallelism of software programs so as to efficiently harness the growing computing power. One noticeable direction among these efforts is speculative multi-threading (SpMT), a.k.a thread level speculation, which aims to extract thread level parallelism by splitting a sequential execution thread into several finer ones and execute them on parallel. A SpMT thread is in speculative status before it “knows” all its input data are correct. A speculative thread needs to write to the L1 cache, but its output might be discarded if the speculation eventually fails. However, another speculative thread may have already read in such speculative output. Therefore, some mechanism is needed to support speculative read and write. And because the SpMT threads are extracted from a single thread, they usually share lots of data, thus there might be intense data coherence among the L1 caches. It would be very complicated to support data coherence and speculation together. This Paper proposes a shared write buffer among the SpMT cores. We are able to confine the speculative read and write in the SWB, thus the speculation will not interference with coherence, and the L1 cache design could be drastically simplified. Experiments show that the SWB can capture a big portion of inter-core data sharing, reduce cache coherence, and drastically improve data access performance of SpMT threads. 相似文献

6.

Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration

Tameesh Suri Aneesh Aggarwal 《International journal of parallel programming》2010,38(3-4):203-224

Increasing the number of cores in a multi-core processor can only be achieved by reducing the resources available in each core, and hence sacrificing the per-core performance. Furthermore, having a large number of homogeneous cores may not be effective for all the applications. For instance, threads with high instruction level parallelism will under-perform considerably in the resource-constrained cores. In this paper, we propose a core architecture that can be adapted to improve a single thread’s performance or to execute multiple threads. In particular, we integrate Reconfigurable Hardware Unit (RHU) in the resource-constrained cores of a many-core processor. The RHU can be reconfigured to execute the frequently encountered instructions from a thread in order to increase the core’s overall execution bandwidth, thus improving its performance. On the other hand, if the core’s resources are sufficient for a thread, then the RHU can be configured to executed instructions from a different thread to increase the thread level parallelism. The RHU has low area overhead, and hence has minimal impact on scalability of the number of cores. To further limit the area overhead of this mechanism, generation of the reconfiguration bits for the RHUs of multiple cores is delegated to a single core. In this paper, we present the results for using the RHU to improve a single thread’s performance. Our experiments show that the proposed architecture improves the per-core performance by an average of about 23% across a wide range of applications. 相似文献

7.

Performance tradeoffs in multithreaded processors 总被引：1，自引：0，他引：1

Agarwal A. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(5):525-539

An analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects is presented. The model is validated through the author's simulations and by comparison with previously published simulation results. The results indicate that processors can substantially benefit from multithreading, even in systems with small caches, provided sufficient network bandwidth exists. Caches that are much larger than the working-set sizes of individual processes yield close to full processor utilization with as few as two to four contexts. Smaller caches require more contexts to keep the processor busy, while caches that are comparable in size to the working-sets of individual processes cannot achieve a high utilization regardless of the number of contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization 相似文献

8.

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

《Parallel Computing》2007,33(10-11):700-719

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. We investigate user-level schedulers that dynamically “rightsize” the dimensions and degrees of parallelism on the cell broadband engine. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. We evaluate recently introduced schedulers for event-driven execution and utilization-driven dynamic multi-grain parallelization on Cell. We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm. We evaluate S-MGPS on an IBM Cell BladeCenter with two realistic bioinformatics applications that infer large phylogenies. S-MGPS performs within 2–10% of the optimal scheduling algorithm in these applications, while exhibiting low overhead and little sensitivity to application-dependent parameters. 相似文献

9.

A queueing theoretic approach for performance evaluation of low-power multi-core embedded systems

Arslan Munir Ann Gordon-Ross Sanjay Ranka Farinaz Koushanfar 《Journal of Parallel and Distributed Computing》2014

With Moore’s law supplying billions of transistors on-chip, embedded systems are undergoing a transition from single-core to multi-core to exploit this high transistor density for high performance. However, the optimal layout of these multiple cores along with the memory subsystem (caches and main memory) to satisfy power, area, and stringent real-time constraints is a challenging design endeavor. The short time-to-market constraint of embedded systems exacerbates this design challenge and necessitates the architectural modeling of embedded systems to reduce the time-to-market by expediting target applications to device/architecture mapping. In this paper, we present a queueing theoretic approach for modeling multi-core embedded systems that provides a quick and inexpensive performance evaluation both in terms of time and resources as compared to the development of multi-core simulators and running benchmarks on these simulators. We verify our queueing theoretic modeling approach by running SPLASH-2 benchmarks on the SuperESCalar simulator (SESC). Results reveal that our queueing theoretic model qualitatively evaluates multi-core architectures accurately with an average difference of 5.6% as compared to the architectures’ evaluations from the SESC simulator. Our modeling approach can be used for performance per watt and performance per unit area characterizations of multi-core embedded architectures, with varying number of processor cores and cache configurations, to provide a comparative analysis. 相似文献

10.

Reliability improvement in private non-uniform cache architecture using two enhanced structures for coherence protocols and replacement policies

Mohammad Maghsoudloo Hamid R. Zarandi 《Microprocessors and Microsystems》2014

In this paper, a comprehensive study is first conducted to investigate the effects of cache coherence protocols and cache replacement policies on the characteristics of NUCA in current many-core processors. The main focus of this study is to analyze the effects of coherence protocols and replacement policies on the vulnerability of caches. The outcomes of this analysis indicate two facts: (i) Differences in handling write operations play an important role to make distinction in favor of or against a cache coherence protocol; (ii) Near-optimal solutions for replacement problem, aimed at enhancing the performance, can also make positive influence on reduction of cache vulnerability factor. Based on the results of first step, two schemes are introduced to enhance the reliability of caches by applying some modification on the structures of cache coherence protocols and cache replacement policies. The first scheme tries to manage sharing of the dirty data items among different same-level caches. The second helps to give priority and more opportunity to old dirty blocks than clean blocks for replacement. The proposed schemes reveal about 18% improvement in MTTF, with negligible performance, bandwidth and energy consumption overhead compared to previous cache structures. 相似文献

11.

An Evaluation of an OS-Based Coherence Scheme for Tiled CMPs

Christian Fensch Marcelo Cintra 《International journal of parallel programming》2011,39(3):271-295

The interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors (CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores with a lightweight point-to-point interconnect. However, such interconnects make snooping impractical and, thus, require alternative solutions to cache coherence. In this article, we investigate a novel, cost-effective mechanism to support shared-memory parallel applications that forgoes hardware maintained cache coherence. This mechanism is based on the key ideas that mapping of lines to physical caches is done at the page level with OS support and that hardware supports remote cache accesses. We extend our previous work by investigating in detail the impact of system design parameters and extending the system to support multi-level cache hierarchies. Results show that the choice of implementation of multi-level cache hierarchies can have a significant impact on performance. 相似文献

12.

Amdahl’s law for multithreaded multicore processors

Hao Che Minh Nguyen 《Journal of Parallel and Distributed Computing》2014

In this paper, we conduct performance scaling analysis of multithreaded multicore processors (MMPs) for parallel computing. We propose a thread-level closed-queuing network model covering a fairly large design space, accounting for hardware scaling models, coarse-grain, fine-grain, and simultaneous multithreading (SMT) cores, shared resources, including cache, memory, and critical sections. We then derive a closed-form solution for this model in terms of speedup performance measure. This solution makes it possible to analyze performance scaling properties of MMPs along multiple dimensions. In particular, we show that for the parallelizable part of the workload, the speedup, in the absence of resource contention, is no longer just a linear function of parallel processing unit counts, as predicted by Amdahl’s law, but also a strong function of workload characteristics, ranging from strong memory-bound to strong CPU-bound workloads. We also find that with core multithreading, super linear speedup, higher than that predicted by Amdahl’s law, may be achieved for the parallelizable part of the workload, if core threads exhibit strong cache affinity and the workload is strongly memory-bound. Then, we derive a tight speedup upper bound in the presence of both memory resource contention and critical section for multicore processors with single-threaded cores. This speedup upper bound indicates that with resource contention among threads, whether it is due to shared memory or critical section, a sequential term is guaranteed to emerge from the parallelizable part of the workload, fundamentally limiting the scalability of multicore processors for parallel computing, in addition to the sequential part of the workload, as dictated by Amdahl’s law. As a result, to improve speedup performance for MMPs, one should strive to enhance memory parallelism and confine critical sections as locally as possible, e.g., to the smallest possible number of threads in the same core. 相似文献

13.

指导cache静态划分的程序性能profiling优化技术

贾耀仓武成岗张兆庆《计算机研究与发展》2012,49(1):93-102

对于共享cache的多核处理器,如何管理好各个核对cache的利用,对于充分发挥多核处理器性能是很关键的问题.目前采用的cache替换方法程序间会出现性能干扰,cache静态划分技术则是通过为同时运行的程序分配不同的空间来解决性能干扰问题.为了给程序分配合适大小的cache空间,需要对程序进行性能profiling,即事先多遍运行收集程序在各种cache容量下的性能数据,这种性能profiling方法开销巨大,影响实用.为了解决性能profiling需要多遍运行程序的问题,提出了只需单遍运行的程序性能profiling优化技术.该技术利用在线的phase分析技术识别程序的运行阶段,避免对相同阶段的重复profiling;同时分析程序各phase的性能同cache容量变化的关系趋势,对于性能不敏感的容量变化则不进行profiling,降低开销.在程序运行结束后通过程序各phase在cache各种容量下的性能来估计程序在各容量下的整体性能,以指导cache静态划分.实验表明,该技术的开销仅为7%,而该方法指导的cache划分比未划分时有8%的性能改进,同多遍运行的程序性能profiling指导的cache划分性能相比仅有1%的下降. 相似文献

14.

The DASH prototype: Logic overhead and performance 总被引：1，自引：0，他引：1

Lenoski D. Laudon J. Joe T. Nakahira D. Stevens L. Gupta A. Hennessy J. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(1):41-61

The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multiprocessors with hardware cache coherence. The hardware overhead of directory-based cache coherence in a 48-processor is examined. The data show that the overhead is only about 10-15%, which appears to be a small cost for the ease of programming offered by coherent caches and the potential for higher performance. The performance of the system is discussed, and the speedups obtained by a variety of parallel applications running on the prototype are shown. Using a sophisticated hardware performance monitor, the effectiveness of coherent caches and the relationship between an application's reference behavior and its speedup are characterized. The optimizations incorporated in the DASH protocol are evaluated in terms of their effectiveness on parallel applications and on atomic tests that stress the memory system 相似文献

15.

A superlinear speedup region for matrix multiplication

Marjan Gusev Sasko Ristov 《Concurrency and Computation》2014,26(11):1847-1868

The realization of modern processors is based on a multicore architecture with increasing number of cores per processor. Multicore processors are often designed such that some level of the cache hierarchy is shared among cores. Usually, last level cache is shared among several or all cores (e.g., L3 cache) and each core possesses private low level caches (e.g., L1 and L2 caches). Superlinear speedup is possible for matrix multiplication algorithm executed in a shared memory multiprocessor due to the existence of a superlinear region. It is a region where cache requirements for matrix storage of the sequential execution incur more cache misses than in parallel execution. This paper shows theoretically and experimentally that there is a region, where the superlinear speedup can be achieved. We provide a theoretical proof of existence of a superlinear speedup and determine boundaries of the region where it can be achieved. The experiments confirm our theoretical results. Therefore, these results will have impact on future software development and exploitation of parallel hardware on the basis of a shared memory multiprocessor architecture. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

16.

多核处理机系统Cache管理技术研究现状 总被引：1，自引：0，他引：1

下载免费PDF全文

所光杨学军《计算机工程与科学》2010,32(7):65-68

多核处理器的Cache结构设计和管理是微处理器设计领域的重要问题。当前主流的商用微处理器均采用共享最后一级Cache的系统结构,而片上最后一级Cache的性能通常对处理器的性能影响较大,因此共享Cache的管理问题成为当前研究热点。本文首先介绍当前主流多核处理器及其设计问题,然后介绍了共享Cache管理的三项重要技术:线程调度、NUCA和Cache划分,最后给出多核处理器Cache管理技术的发展方向。相似文献

17.

基于Simics的分布式一致性协议仿真

郑志硕郑存陆曹宏徙《计算机与现代化》2011,(9):105-108

用于多种计算机系统和指令系统仿真的Virtutech Simics只提供一个简单的顺序扁平侦听式高速缓存一致性（Snoo-ping Cache Coherence Protocol）模型支持MESI协议,从而制约了可仿真的并行处理器个数。以下将基于目录的分布式高速缓存一致性协议（Distributed Directory-based Cache Coherence Protocol）模型应用于Simics中并给出基于Simics的分布式一致性协议的仿真结果。这一结果证实分布式协议能降低事件总数,减少网络中的事件。本文提出一个简单的基于目录的分布式高速缓存一致性协议,从而解决制约Simics的可扩放性问题。相似文献

18.

Data-type specific cache compression in GPGPUs

Ehsan Atoofian Sean Rea 《The Journal of supercomputing》2018,74(4):1609-1635

In this paper, we evaluate compressibility of L1 data caches and L2 cache in general-purpose graphics processing units (GPGPUs). Our proposed scheme is geared toward improving performance and power of GPGPUs through cache compression. GPGPUs are throughput-oriented devices which execute thousands of threads simultaneously. To handle working set of this massive number of threads, modern GPGPUs exploit several levels of caches. GPGPU design trend shows that the size of caches continues to grow to support even more thread level parallelism. We propose using cache compression to increase effective cache capacity, improve performance, and reduce power consumption in GPGPUs. Our work is motivated by the observation that the values within a cache block are similar, i.e., the arithmetic difference of two successive values within a cache block is small. To reduce data redundancy in L1 data caches and L2 cache, we use low-cost and implementation-efficient base-delta-immediate (BDI) algorithm. BDI replaces a cache block with a base and an array of deltas where the combined size of the base and deltas is less than the original cache block. We also study locality of fields in integer and floating-point numbers. We found that entropy of fields varies across different data types. Based on entropy, we offer different BDI compression schemes for integer and floating-point numbers. We augment a simple, yet effective, predictor that determines type of values dynamically in hardware and without the help of a compiler or a programmer. Evaluation results show that on average, cache compression improves performance by 8% and saves energy of caches by 9%. 相似文献

19.

硬件结构支持的基于同步的高速缓存一致性协议

黄河刘磊宋风龙马啸宇《计算机学报》2009,32(8)

共享存储系统中如何高效地实现高速缓存一致性是体系结构设计面临的一个关键问题和难点问题.已有的基于目录的协议存在难于实现、验证复杂和存储空间开销大等问题.面向片上众核处理器,文中提出一种由硬件结构支持、基于同步的高速缓存一致性协议.该方案不使用目录,而是通过使用bloom-filter表示一致性信息,并在并行程序中的同步点维护高速缓存一致性.与现有的基于目录的高速缓存一致性协议相比,该方案可以降低目录协议的实现、验证复杂度.用SPLASH一2测试程序集评估表明,基于同步的协议可以获得与基于目录的协议相当的性能. 相似文献

20.

支持多核并行程序确定性重放的高效访存冲突记录方法 总被引：2，自引：0，他引：2

刘磊黄河唐志敏《计算机研究与发展》2012,49(1):64-75

多核系统中并行程序执行过程的不确定性给程序调试带来了很大的困难.准确记录初始执行中冲突访存的次序是并行程序确定性重放的基础.提出了通过建立精确happens-before关系记录访存冲突的方法.此方法利用简洁高效的地址冲突检测机制确定冲突访存操作在执行中所处happens-before序关系的位置,可以抑制部分记录信息的产生,从而有效减少记录信息.与其他方式方法相比,可以进一步压缩17%的记录条数.采用逻辑向量时钟描述冲突访存操作间的happens-before关系,与采用标量时钟相比,可以避免happens-before关系的误识,降低重放执行时并行度的损失. 相似文献