期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

翁玉芬徐传福车永刚方建滨王正华《计算机工程与科学》2009,31(Z1)

基于新型多核SMP集群的层次化性能模型,本文在BigSim并行性能模拟器基础上实现了一个Trace驱动的多核SMP集群并行性能模拟器Sim-MSC。在一个InfiniBand多核SMP集群的宿主机平台上采用jacobi3D程序进行了测试,结果表明Sim-MSC能够模拟MPI消息传递并行应用程序在多核SMP集群上的执行特征,精确预测系统和应用性能。相似文献

2.

基于Trace的并行性能模拟任务任务映射方法

下载免费PDF全文

方建滨徐传福车永刚翁玉芬王正华《计算机工程》2010,36(12):269-271

针对Trace驱动的并行性能模拟问题,提出基于Trace信息指导的映射方法CO-LP3M。CO-LP3M利用从Trace中提取的目标应用程序的通信特征,以宿主机物理进程间通信次数最小化为目标,兼顾计算负载均衡,生成并行模拟任务到宿主机的映射。对Jacobi3D和HPL两个程序进行实验改为：对HPL程序进行实验(注：此处本来是两个程序的,后来为了缩减篇幅就删掉了其中的一个),结果表明CO-LP3M可有效提高并行模拟性能,相对于常见的映射方式,模拟性能最多提高14.7%。在此基础上给出CO-LP3M的扩展技术SCO-LP3M。相似文献

3.

一种并行性能模拟中LP到PP的高效映射方法

方建滨徐传福车永刚翁玉芬王正华《计算机工程与科学》2009,31(Z1)

LP到PP的高效映射是加速并行性能模拟的关键技术之一。针对交互规则的并行应用程序,设计映射生成方法A2-LP3M从Trace中提取LP间的交互模式,以宿主机物理进程间通信最小化为目标,兼顾计算负载平衡,从循环块映射中选取合适的映射方式。实验表明,相对于常规映射方法,A2-LP3M减少并行模拟时间最多可达16.2%。相似文献

4.

集群计算机系统中基于协同设计的并行I/O模拟器研究

李李《网友世界》2014,(21):10-10

本文的目标是在集群计算机系统中实现基于协同设计的并行I/O模拟器,主要思想是在原有并行I/O模拟器设计方法的基础上实现对模拟器的各种参数的优化选择,提高了原模拟器的各项性能,再在协同设计的理论指导下实现了设计人员知识无差别性,使原系统的功能得以增强。相似文献

5.

SimHPC:一种基于执行驱动的大规模并行系统模拟器 总被引：1，自引：0，他引：1

刘轶支予哲张昕李鹤焦林张鹏苏阳明倪泽辉钱德沛《计算机学报》2013,36(4)

模拟实验方法对高性能计算机系统的性能评价和优化设计有着重要的意义,然而由于目标系统规模庞大,传统的体系结构模拟器难以满足模拟性能方面的要求.文中提出了一种专门用于高性能计算系统的模拟器——SimHPC,该模拟器采用执行驱动的全系统模拟方法,支持操作系统和应用程序的模拟运行.通过采用与目标系统同构的节点作为宿主节点以及并行模拟的方法,使得模拟性能相比传统的体系结构模拟器大幅提高,与现有的几种大规模并行系统模拟器相比,SimHPC在通用性和模拟性能方面也具有一定的优势. 相似文献

6.

众核处理器和众核集群的并行模拟

吕慧伟程元白露陈明宇范东睿孙凝晖《计算机研究与发展》2013,50(5)

模拟器是计算机体系结构研究的重要工具.近年来并行计算机体系结构的发展给计算机模拟带来了巨大的挑战.一方面,随着体系结构朝着多核以及众核处理器发展,模拟的目标系统规模随着模拟核数以摩尔定律的速度增加而不断增大;另一方面,串行模拟的速度因为模拟器运行所在宿主机主频提速减缓而停滞不前.上述两方面的原因使得传统的串行模拟方式无法满足对新兴体系结构模拟规模和速度的需求.以众核处理器和众核集群这两种体系结构为例,并行模拟技术在并行计算机体系结构模拟中是必要而且可行的.对于众核处理器的模拟,使用并行离散事件模拟对其进行加速,在模拟精度不变的前提下,提高模拟速度10.9倍.对于众核集群的模拟,模拟的目标系统总规模达到1024核,并且支持MPI/Pthreads混合编程的运行环境. 相似文献

7.

宿主机和并行处理机共享存储器模板的设计与实现

傅勇《小型微型计算机系统》2001,22(2):254-256

本文介绍了宿主机和并行处理机（MPP）之间采用共享存储器模板的数据交换方式,设计并实现了具有一定通用性的共享存储器模板。该模板采用主从式总线切换控制方法,存储器的组织方式可变,能够提供不同的访问带宽,可以为不同的宿主机和协处理机系统提供数据共享。相似文献

8.

一种基于Trace精度改进的内存系统模拟器优化方法

卢天越陈荔城陈明宇《计算机研究与发展》2014,(Z1)

随着计算机系统规模的不断增长,计算机系统结构的研究对于如何更有效地利用各个部件的性能显得尤为重要.但是在系统结构的研究中,由于研究对象规模过大,采用模拟器进行模拟测试是一种常用的方法.但是在使用全系统模拟器的时候,将整个系统进行模拟会造成实验效率的降低和模拟器程序的维护困难.因此,使用基于trace输入的模拟器成为了一种提高模拟器效率的常用方法,但是由于trace不能良好地表现计算机系统某些部分的运行特性,难以避免地存在一定的模拟误差.对此,提出了一种基于trace精度改进的内存系统模拟器优化方法,通过增加trace中包含的内容、提高trace的精度并在内存系统模拟器中实现相应的支持机制,从而在不影响模拟器运行效率的情况下提高内存系统模拟器的运行精度. 相似文献

9.

一种改进的基于基本块的跟踪缓存

李海泉管海兵《小型微型计算机系统》2007,28(4):765-767

跟踪缓存(Trace Cache)是着力解决取指令的带宽的一种颇具潜力的技术.SimpleScalar模拟器是使用软件手段模拟和研究CPU体系结构的重要手段.本文在介绍CPU模拟器和Trace Cache技术的基础上,提出了一种改进的基于基本块构造的Trace Cache,并在SimpleScalar模拟器中实现,并且给出了在这个平台上的试验结果. 相似文献

10.

基于SANs模型的一种并行I/O系统的可用性评估

下载免费PDF全文

郑霄李宏亮郑方郑翔陈左宁《计算机工程与应用》2008,44(19):67-71

并行I/O系统是高性能计算机系统的一个重要组成部分,其可用性水平对整机系统性能的发挥具有重要作用。采用SANs(Stochastic Activity Networks,随机行为网)模型及其支持工具Mobius,对一种大规模并行I/O系统建立可用性评估模型,并采用模拟方法进行解析。模拟结果反映了全局文件系统数量、单一文件系统内最小可用OST(Object Storage Target,对象存储目标)数量和系统维修时间等参数的变化对全系统可用度的影响,对于大规模并行I/O系统的设计与维护具有积极的参考价值。相似文献

11.

Two‐phase trace‐driven simulation (TPTS): a fast multicore processor architecture simulation approach

Hyunjin Lee Lei Jin Kiyeon Lee Socrates Demetriades Michael Moeng Sangyeun Cho 《Software》2010,40(3):239-258

Simulation is indispensable in computer architecture research. Researchers increasingly resort to detailed architecture simulators to identify performance bottlenecks, analyze interactions among different hardware and software components, and measure the impact of new design ideas on the system performance. However, the slow speed of conventional execution‐driven architecture simulators is a serious impediment to obtaining desirable research productivity. This paper describes a novel fast multicore processor architecture simulation framework called Two‐Phase Trace‐driven Simulation (TPTS), which splits detailed timing simulation into a trace generation phase and a trace simulation phase. Much of the simulation overhead caused by uninteresting architectural events is only incurred once during the cycle‐accurate simulation‐based trace generation phase and can be omitted in the repeated trace‐driven simulations. We report our experiences with tsim, an event‐driven multicore processor architecture simulator that models detailed memory hierarchy, interconnect, and coherence protocol based on the TPTS framework. By applying aggressive event filtering, tsim achieves an impressive simulation speed of 146 millions of simulated instructions per second, when running 16‐thread parallel applications. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

12.

分布式I/O日志收集系统的设计与实现

下载免费PDF全文

詹玲马骏陈伯江陈维梁吕睿《计算机工程与应用》2010,46(36):88-90

随着存储系统的快速发展,以及实际应用中对存储系统的要求日益苛刻,为了研究存储系统I/O子系统的运行形态,设计并实现了一种分布式I/O日志收集系统,该系统能够通过总控制台同时控制分布式系统的多个节点并行收集分布式系统的I/O日志,为分析和回放分布式系统的I/O日志提供有效的数据,且详细描述了系统的设计与实现。相似文献

13.

A superscalar simulation employing poisson distributed stalls

Christopher B. Smith Author VitaeAuthor Vitae Eugene John^{Author Vitae} 《Computers & Electrical Engineering》2008,34(3):192-201

This paper presents a statistical approach to estimating the performance of a superscalar processor. Traditional trace-driven simulators can take a large amount time to conduct a performance evaluation of a machine, especially as the number of instructions increases. The result of this type of simulation is typically tied to the particular trace that was run. Elements such as dependencies, delays, and stalls are all a direct result of the particular trace being run, and can differ from trace to trace. This paper describes a model designed to separate simulation results from a specific trace. Rather than running a trace-driven simulation, a statistical model is employed, more specifically a Poisson distribution, to predict how these types of delay affects performance. Through the use of this statistical model, a performance evaluation can be conducted using a general code model, with specific stall rates, rather than a particular code trace. This model allows simulations to quickly run tens of millions of instructions and evaluate the performance of a particular micro-architecture while at the same time, allowing the flexibility to change the structure of the architecture. 相似文献

14.

一种新型高效共享的并行I/O系统

郭御风李琼刘光明肖立权《计算机工程》2006,32(12):244-246

如何有效地解决I／O瓶颈问题，一直是高性能并行计算机有待解决的关键技术。该文提出了一种高效共享的并行I／O系统——HPPIO，该系统基于CC-NUMA并行系统结构，采用了一系列高效共享、并行I／O技术。该文对其分布与集中相结合的高效共享并行I／O系统结构、基于PCI Express的高性能I／O控制器设计等进行了介绍。相似文献

15.

Scalable mpNoC for massively parallel systems – Design and implementation on FPGA

M. Baklouti Y. Aydi Ph. Marquet J.L. Dekeyser M. Abid 《Journal of Systems Architecture》2010,56(7):278-292

The high chip-level integration enables the implementation of large-scale parallel processing architectures with 64 and more processing nodes on a single chip or on an FPGA device. These parallel systems require a cost-effective yet high-performance interconnection scheme to provide the needed communications between processors. The massively parallel Network on Chip (mpNoC) was proposed to address the demand for parallel irregular communications for massively parallel processing System on Chip (mppSoC). Targeting FPGA-based design, an efficient mpNoC low level RTL implementation is proposed taking into account design constraints. The proposed network is designed as an FPGA based Intellectual Property (IP) able to be configured in different communication modes. It can communicate between processors and also perform parallel I/O data transfer which is clearly a key issue in an SIMD system. The mpNoC RTL implementation presents good performances in terms of area, throughput and power consumption which are important metrics targeting an on chip implementation. mpNoC is a flexible architecture that is suitable for use in FPGA-based parallel systems. This paper introduces the basic mppSoC architecture. It mainly focuses on the mpNoC flexible IP based design and its implementation on FPGA. The integration of mpNoC in mppSoC is also described. Implementation results on a Stratix II FPGA device are given for three data-parallel applications ran on mppSoC. The obtained good performances justify the effectiveness of the proposed parallel network. It is shown that the mpNoC is a lightweight parallel network making it suitable for both small as well as large FPGA-based parallel systems. 相似文献

16.

Reducing and manipulating complex trace data

Herv Touati Alan Jay Smith 《Software》1991,21(6):639-655

In performance analysis of computer systems, trace-driven simulation techniques have the important advantage of credibility and accuracy. Unfortunately, traces are usually difficult to obtain, and little work has been done to provide efficient tools to help in the process of reducing and manipulating them. This paper presents TRAMP, a tool for the data reduction and data analysis phases of trace-driven simulation studies. TRAMP has three main advantages: it accepts a variety of common trace formats; it provides a programmable user interface in which many actions can be directly specified; and it is easy to extend. TRAMP is particularly helpful for reducing and analysing complex trace data, such as traces of file system or database activity. This paper presents the design principles and implementation techniques of TRAMP and provides a few concrete examples of the use of this tool. 相似文献

17.

分布式I/O日志回放系统的设计与实现

下载免费PDF全文

詹玲马骏陈伯江陈维梁吕睿《计算机工程与应用》2010,46(36):91-94

随着存储系统的快速发展,需要对不同存储系统的性能进行评测,以I/O日志为基础在实际应用环境中测试存储系统性能,更为客观和准确。提出了一种分布式环境下日志回放系统,通过中央控制器,能够方便地控制多个节点,同时对分布式存储系统进行性能测试,并对该系统的设计和实现过程进行了详细述。相似文献

18.

Massively parallel algorithms for trace-driven cache simulations

Nicol D.M. Greenberg A.G. Lubachevsky B.D. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):849-859

Considers the use of massively parallel architectures to execute a trace-driven simulation of a single cache set. A method is presented for the least-recently-used (LRU) policy, which, regardless of the set size C, runs in time O(log N) using N processors on the EREW (exclusive read, exclusive write) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. We present timings of this algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference-based line replacement policies are considered, which includes LRU as well as the least-frequently-used (LFU) and random replacement policies. A simulation method is presented for any such policy that, on any trace of length N directed to a C line set, runs in O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation 相似文献

19.

HARTS: a distributed real-time architecture

Shin K.G. 《Computer》1991,24(5):25-35

The design, implementation, and evaluation of a distributed real-time architecture called HARTS (hexagonal architecture for real-time systems) are discussed, emphasizing its support of time-constrained, fault-tolerant communications and I/O (input/output) requirements. HARTS consists of shared-memory multiprocessor nodes, interconnected by a wrapped hexagonal mesh. This architecture is intended to meet three main requirements of real-time computing: high performance, high reliability, and extensive I/O. The high-level and low-level architecture is described. The evaluation of HARTS, using modeling and simulation with actual parameters derived from its implementation, is reported. Fault-tolerant routing, clock synchronization and the I/O architecture are examined 相似文献

20.

I/O受限的并行加速比模型与可扩展I/O体系结构

李琼杜云飞杨学军《计算机工程与科学》2011,33(3):28

为了缓解I/O瓶颈问题,可以从应用程序、可扩展算法、编译器和语言、运行时库、操作系统和体系结构六方面展开研究。其中,I/O体系结构是所有技术途径的关键支撑。当前并行I/O性能分析缺乏科学的理论模型为I/O体系结构设计提供理论依据。本文针对并行计算机系统的可扩展性问题,研究了I/O负载对并行计算机系统可扩展性的影响,建立了I/O受限的并行加速比性能模型,对目前大规模并行计算机系统中三种常用I/O体系结构的可扩展性进行了分析;以此为理论依据,提出了一种面向高性能计算的可扩展并行I/O系统结构。同时,还提出了几种有效降低I/O操作服务时间的策略,从而达到增强系统可扩展性的目的,为后续研究奠定了基础。相似文献