期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

林新华王一超秦强李硕文敏华松岡聡《计算机工程与科学》2016,38(9):1741-1747

Intel Xeon Phi协处理器的指令集IMCI引入了硬件实现的vgather指令,旨在帮助512位SIMD寄存器访问非连续内存地址上的数据。然而实验结果显示,vgather很有可能成为应用在Xeon Phi协处理器上关键的性能瓶颈之一。基于以上结论,针对vgather的性能建模可以帮助用户深入地掌握和理解Xeon Phi协处理器的性能特性。在实验方法上,本文方法与现存的通过程序段内嵌入汇编代码进行数据统计不同,使用PAPI等性能分析工具直接收集硬件计数器的统计结果,作为模型的实验数据。本文的性能模型基于AGI事件次数和根据VPU_DATA_READ次数估算得出的vgather所导致的平均延迟构建而成。该模型能够对Xeon Phi应用代码中由vgather所导致的总延迟进行预测。最终,为了验证模型预测的准确性,将该模型应用在三维7点stencil应用代码上,预测结果显示,vgather耗时占计算总耗时的约40%。再将该结果与利用intrinsics指令去除vgather后的计算耗时进行了对比验证,结果显示模型预测准确。基于上述结论,采用硬件计数器的统计结果在Xeon Phi协处理器上针对vgather构建了性能模型。同时,通过与其他平台的vgather对比,认为该模型也可以应用在同样具备vgather的Intel CPU处理器平台上。相似文献

2.

一类Stencil应用在众核NUMA架构的性能研究

高凌云勾文进刘夏真袁武张鉴陆忠华《数据与计算发展前沿》2023,(6):58-66

【应用背景】模板计算是CFD（计算流体动力学,Computational Fluid Dynamics）等科学计算的典型算法,其访存性能受到关注。NUMA架构因扩展性好,在以鲲鹏920处理器为代表的ARM架构上普遍被应用。【方法】使用性能分析工具和benchmark程序,对鲲鹏平台的访存和通信子系统进行性能测试。针对典型stencil应用软件CCFD V3.0开展热点分析和性能测试,并建立Roofline模型。【结果】鲲鹏920处理器依托其众核NUMA架构,单节点浮点性能、内存带宽峰值,以及通信时延均优于Intel Xeon E5-2680v2与一款国产处理器。单节点时,CCFD V3.0在鲲鹏平台的运行速度约是Intel平台的2～3倍,是国产处理器的1.5～2倍。【结论】基于ARM架构的鲲鹏平台应用移植简单,其NUMA架构对模板计算一类访存密集性应用具有优势。相似文献

3.

面向SW26010处理器的三维Stencil自适应分块参数算法

朱雨庞建民徐金龙陶小涵王军《计算机科学》2021,48(6):10-18

Stencil计算是科学应用中的一类重要计算,而分块是提升Stencil计算数据局部性的关键技术.针对现有三维Stencil优化在SW26010处理器上缺少时间分块以及分块参数需手工调优的问题,引入时间分块,提出了面向SW26010处理器的三维Stencil自适应分块参数算法.通过建立性能分析模型,结合硬件计算能力及存... 相似文献

4.

面向CFD应用的Intel持久内存性能评估

文敏华陈江胡广超韦建文王一超林新华《计算机工程与科学》2022,44(9):1550-1556

在科学计算领域,数据规模随着数值模拟精度要求的提高而快速增长,以DRAM为主存的传统方案由于成本高而难以扩展容量,近年来越来越被关注的持久内存技术有望解决这一问题。持久内存是在DRAM和SSD之间的补充,相比DRAM,持久内存具有容量大、性价比高的优点,但是性能也相对较低。为测试持久内存的应用性能,面向科学计算的一个重要领域——计算流体力学(CFD),对Intel持久内存进行性能评估。实验中,持久内存采用了最易于使用的内存模式,源码不需要任何修改,测试程序涵盖内存基准测试和3种常见的CFD算法,实验结果表明,在内存模式下,对不同CFD算法,相比纯DRAM的配置,持久内存的引入会带来一定的性能损失,且该损失随数据规模的增加而增大;另一方面,持久内存的部署使单服务器能支撑超大数据规模的数值模拟。相似文献

5.

基于动态指令编译的软件性能分析方法

程克非张聪张勤汪林林《计算机科学》2006,33(4):292-294

进行精确的软件性能分析，需要在代码中插入测量和控制代码，并根据具体运行状态动态的检查多个不同的参数。但是，用静态类型的程序语言，如C语言等书写的代码，一经编译和链接，其处理逻辑即不可更改。因此，在无法获取源代码或者重新编译和重新启动代价较高的应用中，对软件进行动态性能分析非常困难。本文将介绍一种在软件运行时刻动态插入监控点的动态指令编译技术对软件进行监控，从而在上述情况下达到对软件的监控目的。这种方法是基于Dynlnst API和PAPI技术的。实验表明，这种方法在去掉了对源代码的依赖的情况下，仍然与在源代码级插入监控点的方法具有同等的采集效率，在很大程度上增强了基于硬件性能计数器方法的软件监控技术的应用范围，达到了较好的性能分析效果。相似文献

6.

SM2专用指令协处理器设计与实现

下载免费PDF全文

王腾飞张海峰许森《计算机工程与应用》2022,58(2):102-109

国家商用密码算法SM2是基于椭圆曲线密码学(ECC)而制定的公钥密码协议,已被国际标准化组织(ISO)确立为国际标准.在实际应用中,SM2算法计算过程的复杂性使其面临实现效率低的问题,并且在实现过程中还会出现与密钥相关的侧信道信息泄露.为了解决上述问题,设计了一种适用于SM2的专用指令硬件协处理器.协处理器包含接口逻辑... 相似文献

7.

无线Mesh网络中的P2P流媒体性能评估

刘婷婷杨维王玉柱《计算机工程与应用》2013,49(16):71-76

通过搭建基于无线Mesh网络的P2P流媒体点播测试平台,对影响无线Mesh网络中P2P流媒体性能的流媒体编解码方式、编码速率、数据转发路径的选择和跳数四个因素进行了测试。实验结果表明,采用H.264编解码标准更适合无线Mesh网络中流媒体的传输;编码速率必须不大于网络连接速率才能获得高视频质量;P2P技术可以抵抗10%的丢包对视频质量的影响,比采用非P2P技术在视频的前1 000帧视频质量平均高出3 dB;由于P2P技术带来的流间干扰的影响使得1 000帧以后视频质量下降了6 dB,严重影响了流媒体性能;无线Mesh网络的传输能力随着跳数的增加而减弱,但是流媒体质量并未随着跳数的增加而降低。相似文献

8.

基于指令行为的Cache可靠性评估研究

周学海余洁李曦王志刚《计算机研究与发展》2007,44(4):553-559

软错误由高能粒子撞击所产生,对处理器的可靠性产生很大的损害.随着处理器设计目标转向低功耗、高性能和低供电电压,软错误的发生日益频繁,处理器的可靠性研究也随之受到越来越多的关注.针对传统的基于注错仿真的可靠性评估方法效率低的缺陷,提出了一套系统的cache可靠性评估方法,以可靠性指标之一--体系结构易受损因子(architectural vulnerability factor,AVF))--为研究对象,一方面,基于指令行为分析应用程序运行过程中对最终结果不产生影响的指令,从而确定对cache的AVF产生作用的指令;另一方面,根据cache的存储类型、所采取的写策略,结合cache中数据/指令阵列和地址标识阵列的特点,对cache上的各种相邻操作组合对AVF的影响进行了研究,从而完成AVF评估所需的信息分析.实验部分对PISA体系结构指令cache中的指令阵列进行了AVF评估,说明了该方法的有效性. 相似文献

9.

科学计算应用程序单核指令级优化研究

罗红兵张晓霞王伟武林平《计算机研究与发展》2014,51(6):1263-1269

尽管高性能计算机性能提升越来越快,但科学计算应用程序获得同步的性能提升是很困难的.提高科学计算应用程序的执行性能,需要依照高性能计算机体系结构的特点进行针对性的优化,其中单核指令级优化是科学计算应用程序性能优化的重要方面之一.以基于JASMIN(J adaptive structured meshes applications infrastructure)框架实现的Euler程序为例,探讨了科学计算应用程序在Intel Xeon微处理器平台上的具体性能问题和指令级并行性能优化方法,并较大幅度地优化了Euler程序的单核性能.程序优化后,二维和三维两个物理模型计算的总运行时间比优化前减少了21%～34%,核心模块Gas1dapproxy的执行时间缩短了50%以上.性能优化实验表明：流水线效率已成为影响科学计算类实际应用程序计算效率的重要因素,需要通过降低计算语句的依赖度、减少长延迟计算数量等方法予以改进. 相似文献

10.

一种高能效的结构不对称指令缓存

刘骁高红光陈芳园丁亚军《计算机工程与科学》2017,39(3):443-450

在现代微处理器中,指令缓存的Tag读取、比较消耗了指令缓存较大比例的能耗。提出一种基于推断的低能耗指令缓存:不对称指令缓存。根据跳转指令比例低的特点,在该结构中区别处理跳转指令和顺序指令,使用和数据不完全对应的简化标记管理位。该结构采用了命中推断和变长指令取指两种创新技术,其中基于命中推断技术解决了指令缓存命中时Tag比较过多的问题;使用变长指令取指技术提高了顺序指令块的命中率。实验结果表明,对于选取的SPEC2006测试程序,不对称指令缓存结构较常规L1指令Cache取指能耗下降了40%~60%,比无标记指令缓存结构TH IC能耗降低了9%;取指ED2P方面,较常规L1指令Cache优化约50%,比TH IC结构优化约17%。相似文献

11.

Engineering order‐preserving pattern matching with SIMD parallelism

下载免费PDF全文

Tamanna Chhabra Simone Faro M. Oğuzhan Külekci Jorma Tarhio 《Software》2017,47(5):731-739

The order‐preserving pattern matching problem has gained attention in recent years. It consists in finding all substrings in the text, which have the same length and relative order as the input pattern. Typically, the text and the pattern consist of numbers. Since recent times, there has been a tendency to utilize the ability of the word RAM model to increase the efficiency of string matching algorithms. This model works on computer words, reading and processing blocks of characters at once, so that usual arithmetic and logic operations on words can be performed in one unit of time. In this paper, we present a fast order‐preserving pattern matching algorithm, which uses specialized word‐size packed string matching instructions, grounded on the single instruction multiple data instruction set architecture. We show with experimental results that the new proposed algorithm is more efficient than the previous solutions. ©2016 The Authors. Software: Practice and Experience Published by John Wiley & Sons Ltd. 相似文献

12.

基于MPI并行程序的性能评测可视化工具 总被引：1，自引：0，他引：1

刘华徐炜民孙强《计算机工程》2004,30(10):82-84

介绍了一个基于MPI编程环境的性能监测／分析工具,该工具对程序运行时的相关硬件系统资源数据进行采集,提供实时和事后两种可视化视图,以便程序员对程序进行实时监测和事后性能分析,帮助找出性能瓶颈并加以改进,提高并行程序的性能。相似文献

13.

Parallel Distributive Join Algorithm on the Intel Paragon

Chung Soon M. Chatterjee Arindam 《The Journal of supercomputing》1999,13(2):151-169

In this paper, we analyze the performance of the parallel Distributive Join algorithm that we proposed in Chung and Yang 1995. We implemented the algorithm on an Intel Paragon machine and analyzed the effect of the number of processors and the join selectivity on the performance of the algorithm. We also compared the performance of the Distributive Join (DJ) algorithm with that of the Hybrid-Hash(HH) join algorithm. Our results show that the DJ performs comparably with the HH over the entire range of number of processors used and different join selectivities. A big advantage of the parallel DJ algorithm over the HH join algorithm is that it can easily support non-equijoin operations. The results can also be used to estimate the performance of file I/O intensive applications to be implemented on the Intel Paragon machine. 相似文献

14.

在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析

王一超秦强施忠伟林新华《计算机科学》2015,42(1):75-78

OpenACC是一套基于指导语句方式的并行编程语言标准.编程者可以通过在代码中添加符合该标准的指导语句,经OpenACC编译器的编译,将串行代码并行化地移植到加速器或者协处理器上,进而获得异构加速器所带来的加速效果.OpenACC与CUDA和OpenCL这类异构并行编程技术的不同之处在于,它的目的是使编程者在应用移植过程中不需要考虑加速器或协处理器的底层硬件架构,从而降低编程难度.同时它也具有仅需维护一套代码便可在不同硬件平台上运行的优良跨平台性.因此,OpenACC是一个值得研究的并行编程标准.如今的异构加速硬件设备呈现出多元化趋势.在2013年11月的Top500榜单上排名第一的“天河二号”使用了48000块构建在IntelKnights Corner架构之上的协处理器.与此同时,发布不久的NVIDIA公司最新的Kepler架构GPU产品由于多年来的GPU市场积累也迅速形成了可观的用户群体.对于并非追求性能极限的应用移植者而言,寻求应用性能和移植简易性之间的平衡是相当重要的议题.只需要编写一套代码便可运行在这两种硬件平台上的OpenACC正迎合了用户在移植简易性上的需求.解决了移植的简易性之后,同一个应用在不同硬件平台上的性能表现便成了用户最想了解的问题.通过实验和构建性能模型向读者展示使用OpenACC移植的应用在Intel Knights Corner和NVIDIA Kepler架构硬件上的性能可移植性. 相似文献

15.

Modeling message-passing programs with a Performance Evaluating Virtual Parallel Machine

D.A. P.D. 《Performance Evaluation》2005,60(1-4):165-187

We present a new performance modeling system for message-passing parallel programs that is based around a Performance Evaluating Virtual Parallel Machine (PEVPM). We explain how to develop PEVPM models for message-passing programs using a performance directive language that describes a program’s serial segments of computation and message-passing events. This is a novel bottom-up approach to performance modeling, which aims to accurately model when processing and message-passing occur during program execution. The times at which these events occur are dynamic, because they are affected by network contention and data dependencies, so we use a virtual machine to simulate program execution. This simulation is done by executing models of the PEVPM performance directives rather than executing the code itself, so it is very fast. The simulation is still very accurate because enough information is stored by the PEVPM to dynamically create detailed models of processing and communication events. Another novel feature of our approach is that the communication times are sampled from probability distributions that describe the performance variability exhibited by communication subject to contention. These performance distributions can be empirically measured using a highly accurate message-passing benchmark that we have developed. This approach provides a Monte Carlo analysis that can give very accurate results for the average and the variance (or even the probability distribution) of program execution time. In this paper, we introduce the ideas underpinning the PEVPM technique, describe the syntax of the performance modeling language and the virtual machine that supports it, and present some results, for example, parallel programs to show the power and accuracy of the methodology. 相似文献

16.

Evaluating the impact of locality on the performance of large-scale SCI multiprocessors

M. J. K. L. 《Performance Evaluation》2001,46(4)

Hierarchical ring-based multiprocessor systems are attractive and enjoy several advantages over other type of systems. They ensure unique paths between nodes, simple node interfaces and simple cross-ring connections. Furthermore, employing point-to-point links allows the system to run at high clock rate which increases bandwidth and decreases latency. This paper investigates the performance of hierarchical ring-based shared-memory multiprocessors. Rings in the hierarchy are composed of point-to-point, unidirectional links and apply the Scalable Coherent Interface (SCI) protocol. We pay special emphasis on the impact of locality on processor and interconnection design issues such as number of outstanding requests, and ring topology. We find that in order to exploit the power of hierarchical multiprocessors an accurate and appropriate model of locality must be used. Hierarchical multiprocessors that are well balanced (uniform) tend to provide lower latency and higher system throughput. For non-uniform systems, high degree of locality is required for the hierarchies to perform well. However, restricting the number of outstanding transactions per processor is important in decreasing packets latency and avoiding network contention. 相似文献

17.

Evaluating expressions with a queue

Jan L.A. Van de Snepscheut 《Information Processing Letters》1985,20(2):65-66

相似文献

18.

Evaluating performance in the development of software-intensive products

《Information and Software Technology》2014,56(5):516-526

ContextOrganizational performance measurements in software product development have received a lot of attention in the literature. Still, there is a general discontent regarding the way performance is evaluated in practice, with few studies really focusing on why this is the case. In this paper research focusing on the context of developing software-intensive products in large established multi-national organizations is reported on.ObjectiveThe purpose of this research is to investigate performance measurement practices related to software product development activities. More specifically, focus is on exploring how managers engaged in software product development activities perceive and evaluate performance in large organizations from a managerial perspective.MethodThe research approach pursued in this research consist of exploratory multiple case studies. Data is collected mainly through 54 interviews in five case studies in large international organizations developing software-intensive products in Sweden. Focused group interviews with senior managers from eight companies have also been used in the data collection.ResultsThe results of this research indicate that managers within software product development in general are dissatisfied with their current way of evaluating performance. Performance measurements and the perception of performance are today focused on cost, time, and quality, i.e. what is easily measurable and not necessarily what is important. The dimensions of value creation and learning are missing. Moreover, measurements tend to be result oriented, rather than process oriented, making it difficult to integrate these measurements in the management practices.ConclusionManagers that are dissatisfied with their performance measurement system and want to improve the current situation should not start by focusing on the current measurements directly; instead they should focus on how the organization perceives performance and how important performance criteria are being developed. By developing relevant performance criteria the first step in developing an effective performance measurement system is made. Moreover, it is concluded that manager’s perception of performance is affected by the currently used measurements, hence limiting the scope of the performance criteria. Thus, a change in the way managers perceive performance is necessary before there can be any changes in the way performance is evaluated. 相似文献

19.

创建高性能与高伸缩性的J2EE应用 总被引：2，自引：0，他引：2

宋善德王鹏飞《计算机应用研究》2002,19(11):41-43

对于J2EE应用来说 ,性能和伸缩性问题是必须考虑的一个重要问题。将从J2EE应用程序体系结构出发 ,探究产生这些性能问题的根源 ,并提出一些原则来提高J2EE应用的性能和伸缩性。相似文献