期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Locality-aware data replication in the last-level cache for large scale multicores

Farrukh Hijaz Qingchuan Shi George Kurian Srinivas Devadas Omer Khan 《The Journal of supercomputing》2016,72(2):718-752

相似文献

2.

面向多核处理器的共享cache优化研究进展 总被引：1，自引：0，他引：1

陈占龙张丁文吴亮臧英《计算机应用研究》2014,31(10)

由于技术的发展,片上多核处理器上的核数量和片上缓存的大小一直在增长,且缓存占据了芯片的大部分面积,使得片上缓存所消耗的能量成为存储器子系统中功率损耗的主要贡献者,因此对片上缓存进行优化是提高存储器系统效率的主要途径,增强了片上多核处理器的运算性能.针对共享缓存的管理、一致性等方面介绍了共享缓存的主流优化技术,并探讨了未来的研究方向. 相似文献

3.

Mesoscale performance simulation of multicore processor systems

Peter Altevogt Tibor Kiss Mike Kistler Ram Rangan 《Software and Systems Modeling》2013,12(4):731-744

Modern microprocessor design relies heavily on detailed full-chip performance simulations to evaluate complex trade-offs. Typically, different design alternatives are tried out for a specific sub-system or component, while keeping the rest of the system unchanged. We observe that full-chip simulations for such studies is overkill. This paper introduces mesoscale simulation, which employs high-level modeling for the unchanged parts of a design and uses detailed cycle-accurate simulations for the components being modified. This combination of high-level and low-level modeling enables accuracy on par with detailed full-chip modeling while achieving much higher simulation speeds than detailed full-chip simulations. Consequently, mesoscale models can be used to quickly explore vast areas of the design space with high fidelity. We describe a proof-of-concept mesoscale implementation of the memory subsystem of the Cell/B.E. processor and discuss results from running various workloads. 相似文献

4.

基于多核SoC的雷达信号处理机设计

《微型机与应用》2017,(12)

随着雷达信号处理需求的快速增加,在满足处理需求的同时,降低功耗和缩小体积成为设计的难点。设计和实现了基于TI公司多核SoC芯片66AK2L06的雷达信号处理机系统。该系统利用66AK2L06集成的数字上/下变频模块和JESD204B接口,实现了多核SoC+高速ADC/DAC的处理架构,相较于传统的DSP+FPGA+高速AD/DA架构,功耗降低了40%左右,布板面积也大为减小,同时利用66AK2L06的多核心及FFT协处理器,运算能力也得到了增强。相似文献

5.

A pipeline virtual environment architecture for multicore processor systems

Eric Acosta Alan Liu 《The Visual computer》2012,28(11):1099-1114

We present a novel architecture to develop Virtual Environments (VEs) for multicore CPU systems. An object-centric method provides a uniform representation of VEs. The representation enables VEs to be processed in parallel using a multistage, dual-frame pipeline. Dynamic work distribution and load balancing is accomplished using a thread migration strategy with minimal overhead. This paper describes our approach, and shows it is efficient and scalable with performance experiments. Near linear speed-ups have been observed in experiments involving up to 1,000 deformable objects on a six-core i7 CPU. This approach’s practicality is demonstrated with the development of a medical simulation trainer for a craniotomy procedure. 相似文献

6.

Joint task assignment and cache partitioning with cache locking for WCET minimization on MPSoC

Tiantian Liu Yingchao Zhao Minming Li Chun Jason XueAuthor vitae 《Journal of Parallel and Distributed Computing》2011,71(11):1473-1483

Cache locking technique is often utilized to guarantee a tighter prediction of Worst-Case Execution Time (WCET) which is one of the most important performance metrics for embedded systems. However, in Multi-Processor Systems-on-Chip (MPSoC) systems with multi-tasks, Level 2 (L2) cache is often shared among different tasks and cores, which leads to extended unpredictability of cache. Task assignment has inherent relevancy for cache behavior, while cache behavior also affects the efficiency of task assignment. Task assignment and cache behavior have dramatic influences on the overall WCET of MPSoC. This paper proposes joint task assignment and cache partitioning techniques to minimize the overall WCET for MPSoC systems. Cache locking is applied to each task to guarantee a precise WCET. We prove that the joint problem is NP-hard and propose several efficient algorithms. Experimental results show that the proposed algorithms can consistently reduce the overall WCET compared to previous techniques. 相似文献

7.

High-performance processor design based on 3D on-chip cache

《Microprocessors and Microsystems》2016

Interconnection becomes one of main concerns in current and future microprocessor designs from both performance and consumption. Three-dimensional integration technology, with its capability to shorten the wire length, is a promising method to mitigate the interconnection related issues. In this paper we implement a novel high-performance processor architecture based 3D on-chip cache to show the potential performance and power benefits achievable through 3D integration technology. We separate other logic module and cache module and stack 3D cache with the processor which reduces the global interconnection, power consumption and improves access speed. The performance of 3D processor and 3D cache at different node is simulated using 3D Cacti tools and theoretical algorithms. The results show that comparing with 2D, power consumption of the storage system is reduced by about 50%, access time and cycle time of the processor increase 18.57% and 21.41%, respectively. The reduced percentage of the critical path delay is up to 81.17%. 相似文献

8.

Heterogeneous-aware cache partitioning: Improving the fairness of shared storage cache

《Parallel Computing》2014,40(10):710-721

In this paper, we investigate the problem of fair storage cache allocation among multiple competing applications with diversified access rates. Commonly used cache replacement policies like LRU and most LRU variants are inherently unfair in cache allocation for heterogeneous applications. They implicitly give more cache to the applications that has high access rate and less cache to the applications of slow access rate. However, applications of fast access rate do not always gain higher performance from the additional cache blocks. In contrast, the slow application suffer poor performance with a reduced cache size. It is beneficial in terms of both performance and fairness to allocate cache blocks by their utility.In this paper, we propose a partition-based cache management algorithm for a shared cache. The goal of our algorithm is to find an allocation such that all heterogeneous applications can achieve a specified fairness degree as least performance degradation as possible. To achieve this goal, we present an adaptive partition framework, which partitions the shared cache among competing applications and dynamically adjusts the partition size based on predicted utility on both fairness and performance. We implement our algorithm in a storage simulator and evaluate the fairness and performance with various workloads. Experimental results show that, compared with LRU, our algorithm achieves large improvement in fairness and slightly in performance. 相似文献

9.

MLMIN: A multicore processor and parallel computer network topology for multicast

Dietmar Tutsch Günter Hommel 《Computers & Operations Research》2008,35(12):3807

In future, multicore processors with hundreds of cores will collaborate on a single chip. Then, more advanced network-on-chip (NoC) topologies will be needed than today's shared busses for dual core processors. Multistage interconnection networks, which are already used in parallel computers, seem to be a promising alternative. In this paper, a new network topology is introduced that particularly applies to multicast traffic in multicore systems and parallel computers. Those multilayer multistage interconnection networks are described by defining the main parameters of such a topology. Performance and costs of the new architecture are determined and compared to other network topologies. Network traffic consisting of constant size packets and of varying size packets is investigated. It is shown that all kinds of multicast traffic particularly benefit from the new topology. 相似文献

10.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

11.

片上多核处理器共享Cache划分的公平性研究

方娟蒲江张欣《计算机工程与设计》2010,31(15)

公平性是一个关键的优化问题,当系统缺乏公平时,会出现线程饿死和优先级反转等问题.以公平性优化作为研究目标,分析当前共享Cache划分公平性的评价标准,找出了其评价参数和划分策略的不足,提出了一种新的共享Cache划分方案.通过提出一个新的多线程公平性评价指标并改进了已有的公平划分策略,从而提高多线程运行的公平性.实验结果表明,该共享Cache划分方案显著提高了系统公平性,并且系统吞吐量也有提高. 相似文献

12.

Optimal configuration of a multicore server processor for managing the power and performance tradeoff

Keqin Li 《The Journal of supercomputing》2012,61(1):189-214

We consider the problem of power and performance management for a multicore server processor in a cloud computing environment by optimal server configuration for a specific application environment. The motivation of the study is that such optimal virtual server configuration is important for dynamic resource provision in a cloud computing environment to optimize the power and performance tradeoff for certain specific type of applications. Our strategy is to treat a multicore server processor as an M/M/m queueing system with multiple servers. The system performance measures are the average task response time and the average power consumption. Two core speed and power consumption models are considered, namely, the idle-speed model and the constant-speed model. Our investigation includes justification of centralized management of computing resources, server speed constrained optimization, power constrained performance optimization, and performance constrained power optimization. Our main results are (1) cores should be managed in a centralized way to provide the highest performance without consumption of more energy in cloud computing; (2) for a given server speed constraint, fewer high-speed cores perform better than more low-speed cores; furthermore, there is an optimal selection of server size and core speed which can be obtained analytically, such that a multicore server processor consumes the minimum power; (3) for a given power consumption constraint, there is an optimal selection of server size and core speed which can be obtained numerically, such that the best performance can be achieved, i.e., the average task response time is minimized; (4) for a given task response time constraint, there is an optimal selection of server size and core speed which can be obtained numerically, such that minimum power consumption can be achieved while the given performance guarantee is maintained. 相似文献

13.

A processor for real time TV raster scan conversion

Rolf Lindner 《Computers & Graphics》1979,4(1):23-28

A concept for a processor is described, which scans a vector display file and fills the picture information into a 512 bit line buffer in real time. 相似文献

14.

Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Abu Fadi N. Manira 《Microprocessors and Microsystems》2009,33(5-6):388-397

In order to satisfy the needs for increasing computer processing power, there are significant changes in the design process of modern computing systems. Major chip-vendors are deploying multicore or manycore processors to their product lines. Multicore architectures offer a tremendous amount of processing speed. At the same time, they bring challenges for embedded systems which suffer from limited resources. Various cache memory hierarchies have been proposed to satisfy the requirements for different embedded systems. Normally, a level-1 cache (CL1) memory is dedicated to each core. However, the level-2 cache (CL2) can be shared (like Intel Xeon and IBM Cell) or distributed (like AMD Athlon). In this paper, we investigate the impact of the CL2 organization type (shared Vs distributed) on the performance and power consumption of homogeneous multicore embedded systems. We use VisualSim and Heptane tools to model and simulate the target architectures running FFT, MI, and DFT applications. Experimental results show that by replacing a single-core system with an 8-core system, reductions in mean delay per core of 64% for distributed CL2 and 53% for shared CL2 are possible with little additional power (15% for distributed CL2 and 18% for shared CL2) for FFT. Results also reveal that the distributed CL2 hierarchy outperforms the shared CL2 hierarchy for all three applications considered and for other applications with similar code characteristics. 相似文献

15.

On the effectiveness of cache partitioning in hard real-time systems

Sebastian Altmeyer Roeland Douma Will Lunniss Robert I. Davis 《Real-Time Systems》2016,52(5):598-643

相似文献

16.

Design and implementation of dual processor block with shared external cache memory

Soo-Won Kim Hanseok Ko Woo-Jong Hahn Jong-Sik HahmAuthor vitae 《Microprocessors and Microsystems》1997,20(10):595-605

The availability of low cost, high performance microprocessors has led to various designs of shared memory multiprocessor systems. As a result, commercial products which are based on shared memory have been proliferated. Such a multiprocessor system is heavily influenced by the structure of memory system and it is not difficult to find that most configurations include local cache memories. The more processors a system carries, the larger local cache memory is needed to maintain the traffic to and from the shared memory at reasonable level. The implementation of local cache memories, however, is not a simple task because of environmental limitations. In particular, the general lack of board space availability presents a formidable problem. A cache memory system usually needs space mostly to support its complex control logic circuits for the cache itself and network interfaces like snooping logic circuits for shared bus. Although packaging can be made denser to reduce system size, there are still multiple processors per board. It requires a more area-efficient cache memory architecture. This paper presents a design of shared cache for dual processor board of bus-based symmetric multiprocessors. The design and implementation issues are described first and then the evaluation and measurement results are discussed. The shared cache proposed in this paper has been determined to be quite area-efficient without the significant loss of throughput and scalability. It has been implemented as a plug-in unit for TICOM, a prevalent commercial multiprocessor system. 相似文献

17.

A dynamic associative processor for machine vision applications

Herrmann F.P. Sodini C.G. 《Micro, IEEE》1992,12(3):31-41

The use of massively parallel associative processors as coprocessors for accelerating machine vision applications is considered. They achieve very fine granularity, as every word of memory functions as a simple processing element. A dense, dynamic, content-addressable memory cell supports fully parallel operation, and pitch-matched word logic improves arithmetic performance with minimal area cost. An asynchronous reconfigurable mesh network handles interprocessor communication and image input/output, and an area-efficient pass-transistor circuit counts and prioritizes responders. Some applications are discussed 相似文献

18.

Complete high-performance cache system for the 80386

Mammad Safai Robert Stodieck 《Microprocessors and Microsystems》1990,14(10):664-675

The 80386 is a high-performance third-generation microprocessor that is now standard in most top-of-the-range PCs. Like all similar processors operating at clock rates above 30 MHz, the 80386 must use cache memory if it is to operate efficiently. Without cache memory, the user must either pay a very high price for very fast RAM or employ slower memory by introducing wait states. This application note describes the 80386 bus interface and demonstrates how it can be interfaced to IDT cache tag RAMs to create a cache system. Although the report describes a relatively basic cache system, it covers all design considerations ranging from system timing to the programming of the PALs needed to implement the interface. A.C. 相似文献

19.

Chain grouping: a method for partitioning loops onto mesh-connected processor arrays

《Parallel and Distributed Systems, IEEE Transactions on》2000,11(9):941-955

This paper presents Chain Grouping, a new low complexity method for the problem of partitioning the loop iteration space into groups with little intercommunication requirements, for mapping onto mesh-connected architectures. First, the iterations are scheduled in time, according to the hyperplane method, taking into consideration the minimum time displacement. Then, the iteration space is divided into discrete groups of related iterations, which are assigned to different processors, while preserving the optimal completion time. Chain Grouping is based on clustering together neighboring uniform chains of iterations, formed by a particular dependence vector. This vector will be proven as the best among all to reduce the total communication requirements. Inside every group, the optimal hyperplane scheduling is preserved and references to intragroup iterations are considerably increased. The partitioned groups are afterward assigned to meshes of processors. The resulting space mapping maximizes processor utilization and cuts down overall communication delays while preserving the optimal hyperplane time schedule. 相似文献

20.

DFTS: A dynamic fault-tolerant scheduling for real-time tasks in multicore processors

Mohammad H. Mottaghi Hamid R. Zarandi 《Microprocessors and Microsystems》2014

This paper presents a dynamic scheduling for real-time tasks in multicore processors to tolerate single and multiple transient faults. The scheduling is performed based on three important issues: (1) current released tasks, (2) current available processor cores, and (3) consideration of the number of faults and their occurrences. Using tasks utilization along with a defined criticality threshold in the proposed scheduling method, current ready tasks are divided into critical- and noncritical ones. Based on whether a task is critical or noncritical, an appropriate fault-tolerance policy is exploited. Moreover, scheduling decisions are made to fulfill two key goals: (1) increasing scheduling feasibility and (2) decreasing the total tasks execution time. Several simulation experiments are carried out to compare the proposed method with two well-known methods, called checkpointing with rollback recovery and hardware replication. Experimental results reveal that in the presence of multiple transient faults, the feasibility rate of the proposed method is considerably higher than the other well-known fault-tolerance methods. Moreover, the average timing overhead of this method is lower than the traditional methods. 相似文献