期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

姚远赵荣彩《计算机工程》2012,38(12):272-275

编译器由于程序分析能力不足,无法自动实现循环向量化或者会造成盲目自动向量化。为此,提出一种基于编译指示的向量化方法。通过在代码中插入向量化编译指示语句,指导自动向量化编译工具的处理过程,自动生成高效的向量化代码。测试结果表明,该方法能够有效提高目标代码的运行性能。相似文献

2.

渐进式智能回溯向量化代码调优方法

赵博赵荣彩徐金龙高伟《计算机科学》2015,42(1):50-53,58

为了充分发挥高性能计算机的计算能力,缓解程序员设计和编写并行程序的压力,扩充可用软件集合,设计并实现了利用交互界面深入挖掘程序中的可向量化语句,优化生成代码中的向量化语句,提高生成代码的执行效率.该方法对充分发挥高性能计算机的计算能力,增强系统可用性和扩展应用范围具有重要的意义,同时能够提供有效的辅助手段和工具支持.渐进式智能回溯向量化代码调优架构通过对用户提交的串行程序进行程序分析和变换,采用串行程序分析、数据依赖分析、向量化分析等技术手段,根据分析结果对程序进行变换和优化,自动生成最终的向量化代码.该方法通过分析串行程序中潜在的并行性,将其自动变换为等价的向量化代码形式,大大简化了程序员的工作. 相似文献

3.

面向SIMD的数组重组和对齐优化

魏帅赵荣彩姚远侯永生《计算机科学》2012,39(2):305-310

随着多媒体应用的普及,越来越多的处理器集成了SIMD扩展,但是非连续或者非对齐访存会阻碍程序的向量化或者造成性能损失。针对实际应用中出现的数组引用不连续的情况,提出了一种数学模型,用以刻画数组的访存模式和数据重组方案,以判断这些数组引用是否可以通过数组转置的方法满足连续性要求;并采用过程间数组填充、循环剥离和基于SLP的向量化代码生成方法等进行对齐优化。最后基于SPEC2000测试集对该算法进行了测试,结果表明,该方法可以有效地提升向量化程序的执行效率。相似文献

4.

面向稀疏卷积神经网络的GPU性能优化方法

董晓刘雷李晶冯晓兵《软件学报》2020,31(9):2944-2964

近些年来,深度卷积神经网络在多项任务中展现了惊人的能力,并已经被用在物体检测、自动驾驶和机器翻译等众多应用中.但这些模型往往参数规模庞大,并带来了沉重的计算负担.神经网络的模型剪枝技术能够识别并删除模型中对精度影响较小的参数,从而降低模型的参数数目和理论计算量,给模型的高效执行提供了机会.然而,剪枝后的稀疏模型却难以在GPU上实现高效执行,其性能甚至差于剪枝前的稠密模型,导致模型剪枝难以带来真正的执行性能收益.提出一种稀疏感知的代码生成方法,能够生成高效的稀疏卷积GPU程序.首先为卷积算子设计了算子模板,并结合GPU的特点对模板代码进行了多种优化.算子模板中的源代码经过编译和分析被转换为算子中间表示模板,设计了一种稀疏代码生成方法,能够结合剪枝后的稀疏参数,基于中间表示模板生成对应的稀疏卷积代码.同时,利用神经网络执行过程中的数据访问特点对数据的访问和放置进行了优化,有效提升了访存吞吐量.最后,稀疏参数的位置信息被隐式编码在生成的代码中,不需要额外的索引结构,降低了访存需求.在实验中证明了：相对于GPU上已有的稀疏神经网络执行方法,提出的稀疏感知的代码生成方法能够有效提升稀疏卷积神经网络的性能. 相似文献

5.

面向自动向量化的结构体优化

于海宁韩林李鹏远《计算机科学》2016,43(2):210-215

结构体广泛应用在科学计算等应用程序中,向量化结构体数组存在的非连续和非对齐访存会严重影响程序的向量化效果。为减少结构体数组SIMD向量化过程中的非连续和非对齐数据访问,提出了基于域访问亲和度与域数据类型相结合的结构体拆分模型,以消除域存储间的内存“间隙”;同时利用结构体数组到二维数组的地址映射方式来满足结构体数组向量化时的访存连续和对齐要求,以降低Cache的失效率,从而提升应用程序性能。在自动向量化系统SW-VEC上,选取gcc-vec、spec2000和spec2006标准测试集中部分相关的测试用例,测试结果表明:与相应的串行程序相比, 采用该方法后,测试用例程序性能加速比提高了8%以上。相似文献

6.

面向非多媒体程序的SIMD向量化算法的研究及改进 总被引：3，自引：0，他引：3

李玉祥施慧陈莉《小型微型计算机系统》2009,30(10)

利用微处理器的多媒体扩展对非多媒体程序的向量化已成为提高程序性能的一个重要手段,然而目前几乎所有的商业编译器对非多媒体程序的向量化的结果,都无法说明其编译器有效的向量能力.本文通过分析典型的非多媒体程序——SPECCPU2000浮点程序,归纳出非多媒体程序的SIMD向量化特征,并依此提出局部数据重组的向量化方法、针对外层循环的向量化方法、部分语句SLP的向量化方法几种新的向量化方法和相关的向量化优化技术.通过对比Intel编译器对SPEC CPU2000的向量化性能测试,可以发现本文提出的改进方法有效的提高了程序的向量化. 相似文献

7.

面向间接数组索引的向量化方法

姚金阳赵荣彩王琦李颖颖《计算机科学》2018,45(9):220-223, 236

对现有的编译器而言,间接数组索引不能被高效地向量化,这使得程序中包含有该类访存形式的间接数组索引不能利用SIMD扩展部件,这也是程序向量化研究中的热点问题。为了高效地利用SIMD扩展部件,充分挖掘程序中的向量化潜能,提出了一种对间接数组索引进行向量化的新方法,且提供了性能收益方法,分别对各种间接数组索引进行性能收益分析。实验结果表明,使用该向量化方法可以显著地提高程序的执行效率。相似文献

8.

基于动词属性的模板化自动代码生成

汪畅王铮张胜歧《计算机技术与发展》2010,20(5):104-107

介绍了一种自动代码生成的方法.提出了以动词为中心,基于属性的语义处理方法理论.在此思想理论的指导下,建立了相应的知识库和语义处理规则库,并详细研究了受限自然语言语句中词语的语义处理过程.最后将受限自然语言理解应用自动代码生成中去,通过对已经规范化的受限汉语语句中的各个动词进行分类并赋予其属性概念,依据知识库和规则库,对受限语句进行语义分析,将之转换为中间语言,并结合可定制的模板方法,在程序生成引擎中自动生成代码. 相似文献

9.

基于ICD的接口通讯程序代码自动生成技术

胡希秀顾逸东吕从民《微计算机信息》2008,24(30)

为快速开发嵌入式应用中的接口通讯程序,提高其效率和可靠性,本文提出了一种基于ICD的代码自动生成技术.首先确定了目标代码的形式,通过设计系统结构、建立驱动模型和代码生成单元结构,最终实现了代码生成器.本文设计的代码生成器,能够根据ICD数据库自动生成符合ANSI C标准的接口通讯程序,并在某航天工程的集成仿真测试系统中得到应用验证. 相似文献

10.

一种加速访存地址计算的编译优化

高秀武姜军白书敬黄亮明《计算机工程》2023,49(1):173-180

在国产申威高性能多核服务器系统中,基础编译系统对应用程序中访存操作进行代码生成时,没有考虑国产处理器指令特征,导致编译器生成的访存地址计算代码效率较低,影响国产高性能处理器的性能。为充分发挥国产处理器高性能计算能力,提出一种加速访存地址计算的编译优化方法。加速访存地址计算编译优化基于处理器支持带扩展因子的运算指令,在编译器后端内存地址表达式合法性检查中,添加针对乘加模式的地址计算表达式合法性检查算法,自动识别地址表达式中存在的乘加运算并进行合法性检验,对符合条件的地址表达式在代码生成阶段匹配生成带扩展因子的运算指令来快速计算访存地址,从而加快访存指令的发射与执行以及应用程序中的访存地址生成,提升访存效率。使用行业标准性能测试集SPEC CPU2006对优化效果进行评测,结果表明,相比优化前SPECspeed Integer与SPECspeed Float Point两个子集,该优化方法平均性能分别提高了2.53%与1.50%。相似文献

11.

A Compile/Run-time Environment for the Automatic Transformation of Linked List Data Structures

H. L. A. van der Spek S. Groot E. M. Bakker H. A. G. Wijshoff 《International journal of parallel programming》2008,36(6):592-623

Irregular access patterns are a major problem for today’s optimizing compilers. In this paper, a novel approach will be presented that enables transformations that were designed for regular loop structures to be applied to linked list data structures. This is achieved by linearizing access to a linked list, after which further data restructuring can be performed. Two subsequent optimization paths will be considered: annihilation and sublimation, which are driven by the occurring regular and irregular access patterns in the applications. These intermediate codes are amenable to traditional compiler optimizations targeting regular loops. In the case of sublimation, a run-time step is involved which takes the access pattern into account and thus generates a data instance specific optimized code. Both approaches are applied to a sparse matrix multiplication algorithm and an iterative solver: preconditioned conjugate gradient. The resulting transformed code is evaluated using the major compilers for the x86 platform, GCC and the Intel C compiler. 相似文献

12.

Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers 总被引：1，自引：0，他引：1

Chang Chih-Yung Chen Tzung-Shi Sheu Jang-Ping 《The Journal of supercomputing》2000,17(2):187-204

In this paper, we propose a compilation scheme to analyze and exploit the implicit reuses of vector register data. According to the reuse analysis, we present a translation strategy that translates the vectorized loops into assembly vector codes with exploitation of vector reuses. Experimental results show that our compilation technique can improve the execution time and traffic between shared memory and vector registers. Techniques discussed here are simple, systematic, and easy to be implemented in the conventional vector compilers or translators to enhance the data locality of vector registers. 相似文献

13.

Runtime and language support for compiling adaptive irregular programs on distributed-memory machines

Yuan-Shin Hwang Bongki Moon Shamik D. Sharma Ravi Ponnusamy Raja Das Joel H. Saltz 《Software》1995,25(6):597-621

In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. This paper presents a library called CHAOS, which helps users implement irregular programs on distributed-memory message-passing machines, such as the Paragon, Delta, CM-5 and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient index translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsumes the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirection arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life adaptive applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC. Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper demonstrates how CHAOS can be effectively used in such a framework. By embedding CHAOS primitives in the Syracuse Fortran 90D/HPF compiler, kernels taken from the CHARMM and DSMC codes have been automatically parallelized. 相似文献

14.

An approach for analyzing auto-vectorization potential of emerging workloads

《Microprocessors and Microsystems》2017

This paper presents an analytical study on PARSEC benchmark suite in order to examine the auto-vectorization potential of emerging workloads by ICC and GCC compilers. For investigating auto-vectorization potential, we have analyzed the amount of vectorized and non-vectorized loops and the number of vector instructions of application. We have found most of the time-consuming loops of the applications have not been vectorized. Then, we have modified the applications and profiled them again. We have shown applying the modifications have a considerable effect on the amount of vectorized loops but the number of instructions has not reduced to what we expect because of the limited size of SIMD-width of current processors. As a result, in addition to applying some algorithmic methods such as loop unrolling, splitting large loops, definition of data structures, replacing function calls in loops with function bodies removing control flows from the loops in possible cases and so on to help the compilers for auto-vectorization, increasing the SIMD-width of the vector extension of CPUs is an important issue in order to improve the speed and performance. 相似文献

15.

光栅矢量混合编辑技术的研究与实现 总被引：1，自引：0，他引：1

王凤张立东《小型微型计算机系统》1996,17(12):53-56

本文介绍一个实用化的工程图纸处理系统－ＥＤＤ／ＡｕｔｏＣＡＤ，该系统可以将扫描后的光栅图纸装入到ＡｕｔｏＣＡＤ编辑环境中，直接对光栅图进行编辑修改，或利用矢量化工具技术转成矢量图形，本文就系统涉及到光栅矢量混合编辑技术，矢量化技术等关键技术提出了新的解决办法，并进行了讨论。相似文献

16.

An integrated runtime and compile-time approach for parallelizingstructured and block structured applications

Agrawal G. Sussman A. Saltz J. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(7):747-754

In compiling applications for distributed memory machines, runtime analysis is required when data to be communicated cannot be determined at compile-time. One such class of applications requiring runtime analysis is block structured codes. These codes employ multiple structured meshes, which may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present runtime and compile-time analysis for compiling such applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which supports the runtime analysis required. The library is currently implemented on several different systems. We have also developed compiler analysis for determining data access patterns at compile time and inserting calls to the appropriate runtime routines. Our methods can be used by compilers for HPF-like parallel programming languages in compiling codes in which data distribution, loop bounds and/or strides are unknown at compile-time. To demonstrate the efficacy of our approach, we have implemented our compiler analysis in the Fortran 90D/HPF compiler developed at Syracuse University. We have experimented with a multi-bloc Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads and the compiler parallelized codes perform within 20% of the codes parallelized by manually inserting calls to the runtime library 相似文献

17.

Towards a more efficient implementation of OpenMP for clusters via translation to global arrays

Lei Huang Barbara Chapman Zhenying Liu 《Parallel Computing》2005,31(10-12):1114

This paper discusses a novel approach to implementing OpenMP on clusters. Traditional approaches to do so rely on Software Distributed Shared Memory systems to handle shared data. We discuss these and then introduce an alternative approach that translates OpenMP to Global Arrays (GA), explaining the basic strategy. GA requires a data distribution. We do not expect the user to supply this; rather, we show how we perform data distribution and work distribution according to the user-supplied OpenMP static loop schedules. An inspector–executor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector–executor approach. We also illustrate how to deal with some hard cases containing reshaping and strided accesses during the translation. Our experiments show promising results for the corresponding regular and irregular GA codes. 相似文献

18.

Vc: A C++ library for explicit vectorization

Matthias Kretz Volker Lindenstruth 《Software》2012,42(11):1409-1430

It is an established trend that CPU development takes advantage of Moore's Law to improve in parallelism much more than in scalar execution speed. This results in higher hardware thread counts (MIMD) and improved vector units (SIMD), of which the MIMD developments have received the focus of library research and development in recent years. To make use of the latest hardware improvements, SIMD must receive a stronger focus of API research and development because the computational power can no longer be neglected and often auto‐vectorizing compilers cannot generate the necessary SIMD code, as will be shown in this paper. Nowadays, the SIMD capabilities are sufficiently significant to warrant vectorization of algorithms requiring more conditional execution than was originally expected for Streaming SIMD Extension to handle. The Vc library ( http://compeng.uni‐frankfurt.de/?vc ) was designed to support developers in the creation of portable vectorized code. Its capabilities and performance have been thoroughly tested. Vc provides portability of the source code, allowing full utilization of the hardware's SIMD capabilities, without introducing any overhead. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

19.

Evaluation of fortran vector compilers and preprocessors

Glenn Luecke Waqar Haque James Hoekstra Howard Jespersen James Coyle 《Software》1991,21(9):891-905

Many scientific codes can achieve significant performance improvement when executed on a computer equipped with a vector processor. Vector constructs in source code should be recognized by a vectorizing compiler or preprocessor. This paper discusses, from a general point of view, how a vectorizing compiler/preprocessor can be evaluated. The areas discussed include data dependence analysis, IF loop analysis, nested loops, loop interchanging, loop collapsing, indirect addressing, use of temporary storage, and order of arithmetic. The ideas presented are based on vectorization of over a million lines of production codes and an extensive test suite developed to evaluate preprocessors under varying degrees of code complexity. Areas for future research are also discussed. 相似文献

20.

新型超字级并行改进算法

张素平韩林丁丽丽王鹏翔《计算机应用》2017,37(2):450-456

对于超字级并行（SLP）算法不能有效地处理大型程序中并行代码率较小,且可向量化的代码中可能存在对向量化不利的代码的问题,提出了一种新型的SLP改进算法NSLPO。首先,将程序中不能向量化的非同构语句进行同构化处理,定位SLP丢失的向量化机会;然后,通过冗余节点添加构建最大通用子图,通过冗余删除等优化过程得到同构化之后的补充SLP图,提高程序中代码的并行性;最后,运用节流法将对向量化有害的代码摒除在向量化之外,仅对它们进行标量处理,通过只向量化处理那些向量化有收益的代码以尽可能地提升程序效率。在一组广泛使用的内核测试集中进行实验,结果显示,与SLP算法相比,NSLPO算法性能更优,其执行时间比SLP平均减少9.1%。相似文献