首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 974 毫秒
1.
在多核中央处理器(CPU)—图形处理器(GPU)异构并行体系结构上,采用OpenMP和计算统一设备架构(CUDA)编程实现了基于AMBER力场的蛋白质分子动力学模拟程序。通过合理地将程序划分为CPU单线程、CPU多线程和GPU多线程执行部分,高效地利用了计算机的处理能力。性能测试结果表明,相对于优化后的CPU串行计算,多核CPU-GPU异构并行计算模型有强大的性能优势,特别是将占整个程序执行时间90%的作用力的计算移植到GPU上执行,获得了最高可达12倍的计算加速比。  相似文献   

2.
MCNP-4C多粒子输运蒙特卡罗程序的MPI并行化   总被引:1,自引:0,他引:1  
三维连续截面多粒子输运蒙特卡罗程序MCNP-4C-经过MPI并行改造,实现了MPI 并行化.采用分段随机数发生器,并行取得了与串行完全一致的结果,500个处理器的计算速度较串行提高了460倍,并行效率达到92%,可计算包括临界在内的多粒子输运问题.  相似文献   

3.
基于多核的多线程程序优化研究   总被引:1,自引:1,他引:0  
随着主流芯片厂商的大力推广,多核处理器已经变得越来越普及.以往串行化的程序设计方法在多核环境下已经不能充分利用多核CPU的资源.怎样高效地利用多核处理器的计算性能,已经成为软件开发者面临的新的课题.文中在传统的多线程编程基础上,根据Intel处理器的微架构(Microarchitecture)特点,以及Linux内核提供的CPU绑定技术,通过采用Cache优化和CPU亲和力(CPU affinity)优化,消除了多核环境下局部多线程Cache行竞争和伪共享,减少了线程的调度开销,提高了多线程程序的运行效率.  相似文献   

4.
有限元单元计算子程序的OpenMP并行化   总被引:3,自引:1,他引:2       下载免费PDF全文
Intel和AMD双核乃至4核处理器的推出,使得并行计算已经普及到PC机。为了充分利用多核,需要对原有程序进行多线程改造,使其充分利用多核处理带来的性能提升。该文利用共享存储编程的工业标准OpenMP对有限元方法涉及的单元计算子程序进行了并行化实现。在机群的一个双CPU的SMP节点上的测试表明,共享并行化使得该单元子程序的性能提高了一倍。  相似文献   

5.
深入分析了飞腾处理器FT 1500A与商用处理器Intel XEON在性能上的差异。在微基准测试层面,评测了两个平台能够达到的最大可获得性能(浮点性能、访存延迟和访存带宽)。在应用层面,选取一个典型的海洋预报数值模拟软件,研究了如何将一个开源代码移植到飞腾处理器和商用处理器上,探讨了该软件在两个平台上的单核性能与多核性能,分析了性能差异的原因并提出了相应的优化建议。认为FT 1500A已经有良好的生态基础(操作系统、编译器和工具链),使得移植典型科学计算程序简单可行,虽然跟商用平台相比,飞腾处理器在性能上存在着差距,但考虑到其在功耗上的优势,飞腾处理器将是一个非常具有应用前景的平台。  相似文献   

6.
Intel和AMD双核乃至4核处理器的推出,使得并行计算已经普及到PC机。为了充分利用多核,需要对原有程序进行多线程改造,使其充分利用多核处理带来的性能提升。该文利用共享存储编程的工业标准OpenMP对有限元方法涉及的单元计算子程序进行了并行化实现。在机群的一个双CPU的SMP节点上的测试表明,共享并行化使得该单元子程序的性能提高了一倍。  相似文献   

7.
国家气象局天气组网雷达定量估测降水系统不仅拥有较大的计算量,而且具有较大的数据吞吐量,同时对实时性要求较高。如果缩短其执行时间,无疑将会带来巨大的收益。鉴于这些特点,使用VTune Amplifer XE对串行程序进行了热点分析和并行性分析,得出程序中有较多线程级并行性,从而制定了相应的并行化方案;然后使用Win32多线程和OpenMP两种技术对该程序在Intel四核处理器平台上进行了并行化。程序主要由单站处理和组网处理两部分组成。由于计算资源的限制,并行后的单站处理程序只有大约10%的性能提升,而组网处理程序则可以达到近似线性的性能提升。通过调整计算负载,并行化版本的加速比可以达到5.5。最后,可以得出该并行化方法适用于计算密集且数据吞吐量较大的一类应用。  相似文献   

8.
文章对湍流退化图像的双重循环迭代图像复原算法的计算流程进行了分析,将算法的串行计算模式转换为并行模式,提出了基于并行模式的多结点并行处理方法。成功地实现算法的并行计算模式到多结点结构的有效映射,对算法的程序代码进行了有效移植,在多结点系统上运行通过。实验结果表明,不仅提出的并行校正方法有效,而且提高了图像复原处理帧频数。  相似文献   

9.
在大规模和长时程数值计算中,浮点运算的舍入误差的累积效应可能导致数值结果不可信。求和与点乘是浮点数值计算中最为基础的运算,在大规模科学计算过程中被频繁调用,其数值结果精度至关重要。面向国产飞腾处理器,基于OpenBLAS,采用无误差变换技术设计了高效的汇编内核函数,实现并优化了高精度的求和与点乘算法。数值实验显示,该高精度算法的数值结果精度同原始算法在双倍工作精度下得到的数值结果精度相同,验证了本文算法的有效性;本文算法在单线程情况下运行时间分别是原始算法运行时间的1.57倍和1.76倍,在保证精度提升的同时效率没有明显的降低;在多线程情况下,同原始算法具有近乎相同的运行时间,体现了算法的高效性。理论误差分析进一步表明了本文算法的可靠性。  相似文献   

10.
当前,通用处理器一般支持64位浮点运算,在大规模和长时程数值计算中,由于浮点运算的舍入误差累积效应,可能导致数值结果不可信。因此,有效控制误差,设计高精度、高效可靠的浮点数值算法至关重要。基于SCILAB软件平台,通过使用无误差变换和double double数据格式,实现了高精度的算法库。对幂指数、Bernstein和Chebyshev基多项式函数估值,在Intel平台和国产飞腾处理器平台上进行了数值实验,实验结果证实了该高性能数值算法库的有效性。该多精度算法库具有独立知识产权,可有效应用于国产自主可控处理器平台,为国家重大科研项目提供技术支持。  相似文献   

11.
Spherical harmonic transforms (SHT) are at the heart of many scientific and practical applications ranging from climate modelling to cosmological observations. In many of these areas, new cutting‐edge science goals have been recently proposed requiring simulations and analyses of experimental or observational data at very high resolutions and of unprecedented volumes. Both these aspects pose formidable challenge for the currently existing implementations of the transforms. This paper describes parallel algorithms for computing SHT with two variants of intra‐node parallelism appropriate for novel supercomputer architectures, multi‐core processors and Graphic Processing Units (GPU). It also discusses their performance, alone and embedded within a top‐level, Message Passing Interface‐based parallelisation layer ported from the S2HAT library, in terms of their accuracy, overall efficiency and scalability. We show that our inverse SHT run on GeForce 400 Series GPUs equipped with latest Compute Unified Device Architecture architecture (Fermi) outperforms the state of the art implementation for a multi‐core processor executed on a current Intel Core i7‐2600K. Furthermore, we show that an Message Passing Interface/Compute Unified Device Architecture version of the inverse transform run on a cluster of 128 Nvidia Tesla S1070 is as much as 3 times faster than the hybrid Message Passing Interface/OpenMP version executed on the same number of quad‐core processors Intel Nehalem for problem sizes motivated by our target applications. Performance of the direct transforms is however found to be at the best comparable in these cases. We discuss in detail the algorithmic solutions devised for the major steps involved in the transforms calculation, emphasising those with a major impact on their overall performance and elucidates the sources of the dichotomy between the direct and the inverse operations.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

12.
受到功耗和温度的限制,传统的单核处理器性能难以提升,多核计算成为新的处理器模式。然而现有的多线程程序设计是以单核处理器为基础发展而来,无法高效利用多个处理核心来提升性能。以OpenMP为基础,对程序进行多线程优化,以实现多核处理器上多线程的并行,并通过经典的N皇后问题案例进行验证。  相似文献   

13.
This paper presents the design and implementation of a parallelization framework and OpenMP runtime support in Intel® C++ & Fortran compilers for exploiting nested parallelism in applications using OpenMP pragmas or directives. We conduct the performance evaluation of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel C++ compiler on Hyper-Threading Technology (HT) enabled multiprocessor systems. The performance results show that the multithreaded code generated by the Intel compiler achieved a speedup up to 4.69 on 4 processors with HT enabled for five different input video sequences for the H.264 encoder workload, and a 1.28 speedup on an HT enabled single-CPU system and 1.99 speedup on an HT-enabled dual-CPU system for the audio–visual speech recognition workload. The performance gain due to exploiting nested parallelism for leveraging Hyper-Threading Technology is up to 70% for two multimedia workloads under different multiprocessor system configurations. These results demonstrate that hyper-threading benefits can be achieved by exploiting nested parallelism through Intel compiler and runtime system support for OpenMP programs.  相似文献   

14.
Goodacre  J. Sloss  A.N. 《Computer》2005,38(7):42-50
Over the past few years, the ARM reduced-instruction-set computing (RISC) processor has evolved to offer a family of chips that range up to a full-blown multiprocessor. Embedded applications' demand for increasing levels of performance and the added efficiency of key new technologies has driven the ARM architecture's evolution. Throughout this evolutionary path, the ARM team has used a full range of techniques known to computer architecture for exploiting parallelism. The performance and efficiency methods that ARM uses include variable execution time, subword parallelism, digital signal processor-like operations, thread-level parallelism and exception handling, and multiprocessing. Leveraging parallelism on several levels, ARM's new chip designs could change how people access technology. With sales growing rapidly and more than 1.5 billion ARM processors already sold each year, software writers now have a huge range of markets in which their ARM code can be used.  相似文献   

15.
主流通用处理器都已经实现了多核并行以及处理器核内的SIMD并行。虽然GCC编译器实现了面向SIMD并行的自动向量化,但是编译器针对OpenMP并行程序的自动向量化效果仍很不理想。针对多线程并行的OpenMP程序,基于GCC的OpenMP编译实现,扩展了数据对齐属性指导语句,使编译器在自动向量化时能够进行更准确的数据对齐与否的判断,优化了GCC编译器的自动向量化。  相似文献   

16.
We have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture.  相似文献   

17.
在神威高性能多核服务器上,自动并行化编译系统为识别和申明程序中的并行性,产生的OpenMP程序没有经过充分的优化,其采用简单的fork-join模型,存在大量的并行循环嵌套,导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率,提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围,减少OpenMP程序的并行域数目,降低线程组频繁创建和合并等控制开销,将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明,在新一代神威高性能多核服务器SW1621平台上,并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的,可有效提升自动并行化编译系统OpenMP程序的执行效率。  相似文献   

18.
Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without considering whether the computing node is multicore architecture or not. As a result, every processor core has to communicate directly with the master node for requesting new tasks no matter the fact that the processor cores on the same node can communicate with each other through the underlying shared memory. To address the problem of higher communication overhead, in this paper we propose to adopt hybrid MPI and OpenMP programming model to design two-level parallel loop self-scheduling schemes. In the first level, each computing node runs an MPI process for inter-node communications. In the second level, each processor core runs an OpenMP thread to execute the iterations assigned for its resident node. Experimental results show that our method outperforms the previous works.  相似文献   

19.
一种多线程计算程序的机群移植方法   总被引:3,自引:0,他引:3  
机群并行化应用程序的用户接口和编程方式多种多样,常常令用户望而却步,该文详细了一种从程序的目标代码着手,以ELF格式可执行文件PLT表项重定位为基础,利用多线程程序自身的并发和同步特征,让线程中的计算负载分布到机群各节点的移植技术,为用户提供透明的机群并行机制,提出并讨论了相应的Master-Worker(Task-Farming)计算通信模型以及调度策略,最后,通过实现该移植技术,分析基于BLAS库多线程矩阵乘法程序移植后的运行结果,验证了该模型的可行性和效率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号