期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

李颀安虹李功明邓博斌《小型微型计算机系统》2013,34(5)

多核处理器通过增加处理器核数提高计算能力,虽然可以通过同时运行多道程序的方式利用处理器资源,但是多核处理器真正的成功取决于解决并行应用开发中的难题.为此,处理器体系结构和编程模型的协同开发是必须的.而随着核数的增多,传统上使用的软件模拟器因为软件的串行性而性能越来越差,无法支持这种软硬件协同开发.FPGA天生的并行性使它在模拟多核处理器时具有较高的模拟性能和高度的可扩放性,成为处理器体系结构研究的理想工具.本文介绍了基于FPGA的多核模拟系统,RAMP-Pink.该系统基于HASim实现,同时支持事务存储和线程级推测,用于对事务存储和线程级推测的软硬件协同开发.该模拟系统可配置不同的FPGA开发平台,也可以以软件模拟方式运行. 相似文献

2.

OpenSMT：一个同时多线程处理器模拟器的设计和实现

路放安虹梁博任建《计算机科学》2006,33(1):158-163

同时多线程（SMT）技术是目前微处理器体系结构的研究热点之一。为了支持对SMT技术和基于SMT核的单芯片多处理器（CMP）体系结构技术的深入研究，我们在广泛使用的超标体系结构模拟器Simple Sealar的基础上，通过对SMT结构的关键特性进行适当的抽象，开发了一个SMT体系结构模拟器OpenSMT。本文介绍了谊模拟器主要的设计思想和实现方法，包括多个线程上下文结构的表示、超标量流水线各个阶段的模拟，以及模拟器设计和实现时需要解决的几个关键问题等。初步的应用研究表明，与现有可免费获得的研究用SMT模拟器相比，该模拟器能够较好地平衡模拟性能、灵活性和精度三个基本设计目标，实现了执行驱动、易于扩展指令集结构、良好的用户接口、灵活的软件结构、适宜评估更广泛的SMT体系结构设计空间等设计要求。相似文献

3.

一种硬件事务存储系统模拟环境的研究与实现

刘轶吴名瑜张翠王永会《小型微型计算机系统》2012,33(2):409-413

针对事务存储技术研究中的模拟实验问题,实现了一种专门用于硬件事务存储系统的模拟环境,该模拟环境采用执行驱动模拟方式,支持全系统模拟,利用系统结构模拟器Simics和多核扩展包GEMS实现多核处理器相关部件的功能和性能模拟,在此基础上扩展实现硬件事务存储系统各部件的建模和模拟,以模块化的方法支持多种事务存储系统的模拟实验和性能评价.论文在分析事务存储和系统结构模拟技术的基础上,讨论了事务存储系统模拟环境的设计思路和方案,给出了该模拟环境的组成结构,并通过一种目标事务存储系统结构和一组测试程序对模拟环境进行了实验测试. 相似文献

4.

一种挖掘多核处理器存储级并行的算法

彭林张小强刘德峰谢伦国田祖伟《计算机研究与发展》2009,46(Z2)

多核处理器中,各个处理器核之间可以并发地进行外部存储访问,提供不同于单处理器的存储级并行(memory level parallelism)能力.不规则应用中的循环,传统的并行方法难以识别其并行性,不能充分利用多核处理器存储级并行能力和并行计算能力.对基于软件开发多核处理器存储级并行进行了讨论,提出一种前瞻并行多线程算法LLSM(loop level speculative mssultithreading).LLSM对不规则应用中的循环进行并行化,在多核处理器上的测试数据表明:该算法能够有效地挖掘多核处理器的存储级并行能力和计算能力,同时指出多核环境下存储级并行计算公式需要考虑线程同步开销. 相似文献

5.

一种分片式多核处理器的用户级模拟器 总被引：1，自引：0，他引：1

黄琨马可曾洪博张戈章隆兵《软件学报》2008,19(4):1069-1080

随着片上晶体管资源的增多和互连线延迟的加大,分片式多核微处理器已成为多核处理器设计的新方向.为了对这种新型处理器进行体系结构的深入研究和设计空间的探索,设计并实现了针对分片式多核处理器的用户级多核性能模拟器.该多核模拟器在龙芯2号单处理器核的基础上,完整地模拟了基于目录的Cache一致性协议和存储转发式片上互联网络的结构模型,详细地刻画了由于系统乱序处理各种请求应答和请求之间的冲突而造成的时序特性,可以通过运行各种串行或并行的工作负载对多核处理器的各种重要性能指标加以评估,为多核处理器的结构设计提供了快速、灵活、高效的研究平台. 相似文献

6.

支持细粒度并行性开发的多核DSP快速核间通信机制

方兴陈书明《计算机工程与科学》2009,31(4)

一些数字信号处理程序存在强数据相关性,在将这些数字信号处理程序划分到多核DSP上时,需要开发细粒度并行性,而细粒度并行性的开发需要快速的核间通信机制支持。本文提出了一种新的面向多核DSP的快速核间通信机制:标记式共享寄存器文件TSRF,TSRF由所有的DSP核共享,寄存器文件中的每个寄存器同一个有效标记位关联,该标记位提供了核间通信同步支持。本文构建了集成TSRF机制的多核DSP原型的周期精确模拟器,该多核DSP原型包含的处理器核数目为4个。通过详细模拟,我们使用数据相关性较强的数字信号处理算法:IIR滤波和ADPCM编解码,对TSRF机制的性能进行了测试,与单核DSP相比,TSDB机制性能提升分别为1.8、1.2和1.9左右。相似文献

7.

CMT模拟器的设计与实现

下载免费PDF全文

杨华崔刚吴智博刘宏伟《计算机工程》2007,33(19):251-252

片上多线程(CMT)是未来高性能处理器的发展方向，而软件模拟是处理器体系结构研究和设计中不可或缺的技术手段。该文基于SimpleScalar工具集设计并实现了CMT节拍级模拟器——OpenSimCMT，对CMT体系结构的设计和评测提供支持。OpenSimCMT特点如下：(1)支持同时多线程(SMT)和片上多处理器(CMP)的模拟；(2)架构开放，配置灵活，可根据具体研究目标随时进行扩展，添加新的模拟内容及相关统计；(3)功能全面，对线程间资源竞争与共享、各功能单元、流水段、分支预测、多级cache等全方位模拟，模拟结果准确。相似文献

8.

M5模拟器的内核分析及应用 总被引：1，自引：0，他引：1

时磊逄珺杨磊张铁军王东辉《微计算机应用》2009,30(4)

由美国密歇根大学发布的M5模拟器,是一个针对计算机系统级体系结构进行研究的模块化的仿真平台.它除了能够支持仿真单处理器结构外,还提供了强大的对包含多个处理器的多系统级体系结构进行仿真的功能.本文详细分析了M5模拟器的仿真内核、仿真机制和基本模型,并以一个存储器调度算法为例说明该模拟器对于处理器建模的完备支持. 相似文献

9.

基于SESC仿真器的存储预取器设计

赵磊张萌刘芳《计算机与现代化》2013,(6):183-188

仿真器是在宿主机上运行并能模拟目标体系结构机器行为的一种软件系统,它可以解释并执行目标体系结构机器上可执行的程序,同时可提供运行时的指令和事件相关记录,以及目标体系结构机器的性能统计参数。系统级体系结构仿真器是可以作为一个虚拟目标机器运行的软件系统,它可以实现对单(多)处理器、内存系统、Cache和外部设备等子系统的功能模拟。本文根据多核处理器结构特点,研究体系结构仿真器与测试程序的设计方法。利用体系结构仿真器,分析不同结构的多核处理器片外存储访问需求,讨论片外存储访问带宽对计算性能的影响问题。总结出多核系统片外存储器访问的机制与需求,以及片外访存与程序特征的关系。相似文献

10.

实时微处理器体系结构综述 总被引：1，自引：0，他引：1

下载免费PDF全文

石伟张明郭御风龚锐《计算机工程与科学》2015,37(5):857-864

实时应用已经成为嵌入式应用中一类快速崛起的典型应用。作为实时系统的核心部件,实时微处理器体系结构是微处理器领域的一个重要研究方向。与通用处理器追求最大吞吐量不同,实时处理器要求具有紧凑且可计算的最坏执行时间。传统的实时处理器往往采用较为简单的处理器结构,避免复杂结构引入执行时间的不确定性。随着实时应用对处理器性能需求越来越高,实时处理器正逐渐向多线程与多核结构发展。在多线程与多核处理器中,共享资源竞争导致实时系统的确定性变差,对实时处理器体系结构带来了更大挑战。对实时微处理器体系结构进行综述,首先从指令集、微体系结构、存储、I/O、任务调度等多个方面对传统实时处理器进行分析;然后分别对采用多线程与多核结构的高性能实时处理器展开分析;最后对几种商用实时处理器结构进行比较,总结实时处理器发展现状与未来发展趋势。相似文献

11.

The Cell Broadband Engine: Exploiting Multiple Levels of Parallelism in a Chip Multiprocessor 总被引：1，自引：0，他引：1

Michael Gschwind 《International journal of parallel programming》2007,35(3):233-262

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE. This paper is based in part on “Chip multiprocessing and the Cell Broadband Engine”, ACM Computing Frontiers 2006. 相似文献

12.

A NoC-based hybrid message-passing/shared-memory approach to CMP design

Mario R. CasuAuthor Vitae Massimo Ruo RochAuthor VitaeSergio V. TotaAuthor Vitae Maurizio ZamboniAuthor Vitae 《Microprocessors and Microsystems》2011,35(2):261-273

Future chip-multiprocessors (CMP) will integrate many cores interconnected with a high-bandwidth and low-latency scalable network-on-chip (NoC). However, the potential that this approach offers at the transport level needs to be paired with an analogous paradigm shift at the higher levels. In particular, the standard shared-memory programming model fails to address the requirements of scalability of the many-core era. Fast data exchange among the cores and low-latency synchronization are desirable but hard to achieve in practice due to the memory hierarchy. The message-passing paradigm permits instead direct data communication and synchronization between the cores. The shared-memory could still be used for the instruction fetch. Hence, we propose a hybrid approach that combines shared-memory and message passing in a single general-purpose CMP architecture that allows efficient execution of applications developed with both parallel programming approaches. Cores fetch instructions from a hierarchical memory and exchange their data through the same memory, for compatibility with existing software, or directly through the fast NoC. We developed a fast SystemC based cycle-accurate simulator for design space explorations that we used to evaluate the performance with real benchmarks. The various components have been RTL coded and mapped to a CMOS 45 nm technology to build a silicon area model that we used to select the best architectural configurations. 相似文献

13.

Accelerating Sequential Applications on CMPs Using Core Spilling

Cong J. Han Guoling Jagannathan A. Reinman G. Rutkowski K. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(8):1094-1107

Chip multiprocessors (CMPs) provide a scalable means of exploiting thread-level parallelism for multitasking or multithreaded applications. However, single-threaded applications will have difficulty dynamically leveraging the statically partitioned resources in a CMP. Such sequential applications may be difficult to statically decompose into threads or may simply be-a legacy code where recompilation is not possible or cost-effective. We present a novel approach to dynamically accelerate the performance of sequential application(s) on multiple cores. Execution is allowed to spill from one core to another when resources on one core have been exhausted. We propose two techniques to enable low-overhead migration between cores: prespilling and locality-based filtering. We develop and analyze an arbitration mechanism to intelligently allocate cores among a set of sequential applications on a CMP. On average, core spilling on an eight-core CMP can accelerate single-threaded performance by 35 percent. We further explore an eight-core CMP running a multiple application workload composed of the entire SPEC 2000 benchmark suite in various combinations and arrival times. Using core spilling to accelerate the current set of running applications in cases where there are idle cores, we achieve up to a 40 percent improvement in performance. 相似文献

14.

Direct distributed memory access for CMPs

Weiwei Fu Li Liu Tianzhou Chen 《Journal of Parallel and Distributed Computing》2014

On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory interfaces with on-chip cores. The packet-based memory access model (PDMA) has provided a scalable and flexible solution for distributed memory management, but suffers from complicated and costly on-chip network protocol translation and massive interferences among packets, which leads to unpredictable performance. In this paper we propose a direct distributed memory access (DDMA) model, in which remote memory can be directly accessed by local cores via remote-to-local virtualization, without network protocol translation. From the perspective of local cores, remote memory controllers (MC) can be directly manipulated through accessing the local agent MC, which is responsible for accessing remote memory through high-performance inter-tile communication. We further discuss some detailed architecture supports for the DDMA model, including the memory interface design, work flow and the protocols involved. Simulation results of executing PARSEC benchmarks show that our DDMA architecture outperforms PDMA in terms of both average memory access latency and IPC by 17.8% and 16.6% respectively on average. Besides, DDMA can better manage congested memory traffic, since a reduction of bandwidth in running memory-intensive SPEC2006 workloads only incurs 18.9% performance penalty, compared with 38.3% for PDMA. 相似文献

15.

Network aware performance evaluation of prefetching techniques in CMPs

《Simulation Modelling Practice and Theory》2014

This study focuses on the importance of quantifying the effect of prefetching on the interconnection network of a multiprocessor chip. This kind of microarchitectural effects are often quantified using simulators. However, if prefetching traffic in a CMP (Chip MultiProcessor) system is to be accurately evaluated, simulators should simulate not only the memory hierarchy module and the multicore system, but also the network-on-chip. Unfortunately, no open-source simulator is able to evaluate all these elements at the same time. This paper describes how to develop a prefetching module for the gem5 CMP simulator and how to integrate this into the Ruby memory system.Moreover, by using the infrastructure developed in this study, this paper shows the importance of taking the network effect in prefetching-related studies into account, in order for accurate results to be obtained: not doing so may lead to mistaken conclusions. For this purpose, we have carried out a detailed analysis of the behavior of three different prefetching engines, providing not only the typical statistics for instructions per cycle and the miss rate, but also specific network and prefetching statistics. 相似文献

16.

Compilation techniques for parallel systems

《Parallel Computing》1999,25(13-14):1741-1783

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献

17.

SimTile:片状多核处理器的高效模拟器(英文)

下载免费PDF全文

刘涛季振洲王庆《计算机科学与探索》2010,4(12):1115-1120

传统的基于共享总线的多核芯片随着核心数增加产生了瓶颈问题。新型TiledCMP(chip multiprocessor)的结构设计中,片上核心互联网络对提高扩展能力和执行效率起到了重要作用。为了实现低延迟、高带宽的核心通信,高速点对点网络方式的片上多核互联结构模拟成为研究的热点。抽象片上Tiled方式16核功能单元结构,设计实现了SimTile模拟器,可提供配置灵活、功能单元齐全的片上多核处理器设计,支持高效率的全局共享缓存、高速片上路由结构。模拟器采用模块化的组件配置方式,片上核心数量与互联网络结构、数据一致性协议、全局寄存器通信与cache共享模式等,均可通过精简的参数调整。实验表明模拟器执行效率较高,为片上多核研究提供了灵活、高效并具备可扩展性的新平台。相似文献