期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

顾坚刘伟《计算机科学》2014,41(6):113-118

代数多重网格(AMG)是众多数值模拟应用的核心算法,在基于多核的NUMA架构的机群系统上,AMG的并行扩展性暴露了新的问题。通过设计感知NUMA架构的内存分配器,将划分给多个线程的数据分割并绑定到运行对应线程的CPU所属的NUMA存储节点上,从而改善了OpenMP多线程并行的数据局部性,使BoomerAMG程序在大规模多核计算平台上具有更好的并行扩展性。在单节点和小规模机群的测试中,使用NAAlloc分配器分别获得了最高16%和60%的性能提升。相似文献

2.

基于OpenMP的分子动力学并行算法的性能分析与优化

白明泽程丽豆育升孙世新《计算机应用》2012,32(1):163-166

为提高分子动力学模拟在共享内存式服务器上的计算速度,对基于OpenMP的分子动力学并行算法(Critical方法)进行了性能分析与优化。通过在多核服务器上的测试,以及加速比和并行效率的计算分析了Critical方法的并行性能,进而提出优化的三角形方法。所提方法中每个线程所计算的粒子数固定,且粒子数目呈阶梯状上升,使得各线程能够错时到达临界区。从而使程序在临界区的闲置时间比Critical方法减半,加速比明显提高。相似文献

3.

多核构架下OpenMP多线程应用运行性能的研究 总被引：4，自引：0，他引：4

下载免费PDF全文

徐磊徐莹张丹丹《计算机工程与科学》2009,31(11)

多核平台下,OpenMP线程在核间的动态迁移在一定程度上会导致应用程序性能的下降,如果将线程绑定在固定的核上运行,使其不再迁移,这种方法将有可能提升应用程序性能,达到充分利用多核平台的计算能力的目的。本文将介绍如何使用主流的编译器绑定接口以及Linux内核API的方式实现OpenMP线程与核之间的绑定,使用STREAM Benchmark和NPB在上海超级计算中心的"魔方"超级计算机刀片上测试、比较绑定前后的应用程序的性能。结果证明,使用绑定方案将有可能提升OpenMP应用程序的性能。相似文献

4.

声波数值模拟中的多核并行方法研究

曹丹平《计算机工程与应用》2012,48(36):9-13

波动方程数值模拟普遍存在计算量大的问题,如何根据波动方程有限差分方法的特点开展并行化方法研究是适应微机多核发展的必然趋势。结合波动方程数值模拟中的多层循环嵌套问题和OpenMP的特点,通过确定循环体并行顺序、减少串行环节、合并循环体、准确设置制导语句以及线程绑定优化等方法有助于实现微机多核的高效并行。针对波动方程特点的多核并行不仅有助于提高单机计算效率,对于提高计算机集群上常用的MPI+OpenMP混合并行效率也具有重要意义。相似文献

5.

Xeon Phi平台上基于模板优化的3D-GVF场计算加速

齐金李宽杨灿群杜云飞《计算机工程与科学》2014,36(8):1435-1440

3D梯度向量流场（3D GVF field）广泛应用于多种3D图像分析算法中,其计算需要多次迭代,计算量大,如何提高其计算速度具有重要的研究意义。面向Intel Xeon Phi众核集成架构,首次进行了3D GVF场计算的加速优化。首先,挖掘3D图像像素点间存在的天然并行性,发挥众核架构优势,尝试线程级并行（多核）和数据级并行（SIMD）。其次,3D GVF场的计算过程是一种典型的3D 7点模板运算,结合Xeon Phi架构的L2 缓存规格,提出一种高效的数据分块策略,充分挖掘数据的时/空局部性,有效缓解模板计算引起的缓存缺失,提升了计算性能。实验结果表明,引入模板优化技术能显著提升3D GVF场的计算速度,在图像维度为5123时,所提方法在57核Xeon Phi平台上的性能相比在2.6GHz 8核16线程的Intel Xeon E5 2670 CPU上的性能,加速比可达2.77。相似文献

6.

GPU加速的分段Top-k查询算法

黄玉龙邹循进刘奎苏本跃《计算机应用》2014,34(11):3112-3116

现有Top-k查询优化算法无法充分利用图形处理器(GPU)强大的并行吞吐量及时获取查询结果,为此提出了一种基于统一计算设备架构(CUDA)模型的大规模分段查询算法。通过划分查询过程以及采用分段并行处理策略,该算法可最大限度地提升查询过程中的计算和比较效率。实验结果表明,与4线程多核优化算法相比,所提算法具有明显的性能优势,当有序列表数量为6,遍历步长为120时,性能达到最优,此时比多核算法快40倍。相似文献

7.

基于混合粒子群优化的CMP线程调度方法

下载免费PDF全文

李静梅张博《计算机工程》2012,38(20):113-115

为提高片上多核处理器(CMP)架构中线程调度的执行效率,发挥CMP的并行性能,提出一种基于混合粒子群优化算法的线程调度方法.根据设计的线程调度模型,利用有向无环图表述线程及线程间的相互依赖关系,并采用改进的混合粒子群算法对其进行合理调度.实验结果表明,该方法的执行效率优于现有的遗传算法,能有效地降低任务的执行时间,充分发挥多核架构的优势. 相似文献

8.

分形计算的并行设计及TBB实现 总被引：1，自引：0，他引：1

陈荣鑫陈维斌廖湖声《计算机应用》2011,31(3):839-842

线程构建模块(TBB)基于模板的特点简化了并行化设计,适合高效地实现多核并行设计。针对分形计算具有计算密集和高耗时的特点,结合TBB并行化设计,以充分利用多核计算资源。对影响并行性能的计算负载不平衡问题,提出了基于采样估算的平衡优化方法,通过采样执行时间来估算工作负载,据此进行均衡的任务划分,利用TBB任务调度实现并行处理。实验结果表明,采样估算精度高,耗时比率低,能有效实现负载均衡;基于TBB的实现可获得较好加速比。相似文献

9.

基于多核的多线程程序优化研究 总被引：1，自引：1，他引：0

施惠丰袁道华《计算机技术与发展》2010,20(6):70-73

随着主流芯片厂商的大力推广,多核处理器已经变得越来越普及.以往串行化的程序设计方法在多核环境下已经不能充分利用多核CPU的资源.怎样高效地利用多核处理器的计算性能,已经成为软件开发者面临的新的课题.文中在传统的多线程编程基础上,根据Intel处理器的微架构(Microarchitecture)特点,以及Linux内核提供的CPU绑定技术,通过采用Cache优化和CPU亲和力(CPU affinity)优化,消除了多核环境下局部多线程Cache行竞争和伪共享,减少了线程的调度开销,提高了多线程程序的运行效率. 相似文献

10.

IA-64的并行架构及其寄存器文件

下载免费PDF全文

邓晴莺张民选蒋江《计算机工程》2008,34(12):13-15

同时多线程能在同一时钟周期执行不同线程的指令,并且指令级并行和线程级并行。显式并行指令计算关注于编译器和硬件的相互协作。寄存器文件的设计在高性能处理器设计中十分重要,寄存器栈和寄存器栈引擎是提高其性能的重要手段。该文设计和实现一套并行环境,其中包括并行编译器OpenUH和基于IA-64的同时多线程体系结构EDSMT,实验表明,该并行架构适用于大多数并行应用,针对NAS的并行测试程序,该架构相对于SMTSIM平均有12.48%的性能提升。相似文献

11.

Efficient 3D stencil computations using CUDA

Marcin Krotkiewski Marcin Dabrowski 《Parallel Computing》2013

We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5 Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively. 相似文献

12.

Evaluating ARM HPC clusters for scientific workloads

Jahanzeb Maqbool Sangyoon Oh Geoffrey C. Fox 《Concurrency and Computation》2015,27(17):5390-5410

The power consumption of modern high‐performance computing (HPC) systems that are built using power hungry commodity servers is one of the major hurdles for achieving Exascale computation. Several efforts have been made by the HPC community to encourage the use of low‐powered system‐on‐chip (SoC) embedded processors in large‐scale HPC systems. These initiatives have successfully demonstrated the use of ARM SoCs in HPC systems, but there is still a need to analyze the viability of these systems for HPC platforms before a case can be made for Exascale computation. The major shortcomings of current ARM‐HPC evaluations include a lack of detailed insights about performance levels on distributed multicore systems and performance levels for benchmarking in large‐scale applications running on HPC. In this paper, we present a comprehensive evaluation of results that covers major aspects of server and HPC benchmarking for ARM‐based SoCs. For the experiments, we built an unconventional cluster of ARM Cortex‐A9s that is referred to as Weiser and ran single‐node benchmarks (STREAM, Sysbench, and PARSEC) and multi‐node scientific benchmarks (High‐performance Linpack (HPL), NASA Advanced Supercomputing (NAS) Parallel Benchmark, and Gadget‐2) in order to provide a baseline for performance limitations of the system. Based on the experimental results, we claim that the performance of ARM SoCs depends heavily on the memory bandwidth, network latency, application class, workload type, and support for compiler optimizations. During server‐based benchmarking, we observed that when performing memory intensive benchmarks for database transactions, x86 performed 12% better for multithreaded query processing. However, ARM performed four times better for performance to power ratios for a single core and 2.6 times better on four cores. We noticed that emulated double precision floating point in Java resulted in three to four times slower performance as compared with the performance in C for CPU‐bound benchmarks. Even though Intel x86 performed slightly better in computation‐oriented applications, ARM showed better scalability in I/O bound applications for shared memory benchmarks. We incorporated the support for ARM in the MPJ‐Express runtime and performed comparative analysis of two widely used message passing libraries. We obtained similar results for network bandwidth, large‐scale application scaling, floating‐point performance, and energy‐efficiency for clusters in message passing evaluations (NBP and Gadget 2 with MPJ‐Express and MPICH). Our findings can be used to evaluate the energy efficiency of ARM‐based clusters for server workloads and scientific workloads and to provide a guideline for building energy‐efficient HPC clusters. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

13.

基于双口RAM的ARM与DSP通信接口设计

操虹李臻贾洪钢《工矿自动化》2012,38(3):72-74

提出了一种基于双口RAM的ARM与DSP通信接口的设计方案。该接口以ARM为主处理器、DSP为协处理器,ARM通过在Linux系统上建立的DSP任务管理线程实现DSP任务的管理和调度工作,DSP完成ARM下发的数据计算和处理工作,两者通过双口RAM交换数据。实际应用表明,该接口充分利用了两个处理器的功能特性,数据传输速度快,适用于ARM与DSP间需要进行大量数据交换的场合。相似文献

14.

模板操作在GPU上的实现与优化 总被引：1，自引：0，他引：1

方旭东唐玉华王桂彬唐滔《计算机工程与科学》2011,33(3):41

随着GPU的快速发展,使用GPU来加速科学计算应用已成为必然趋势。本文抽取了SPEC2000中富含模板操作的Mgrid的两个典型子程序Rprj3和Interp,使用Brook+语言把它们移植到AMD GPU上运行。采用Brook+语言提供的线程调节机制,我们实现了不同线程粒度下的程序版本,并分析了加速比不同的原因,总结了线程粒度调节对模板程序移植的指导意义。我们使用AMD RadeonHD4870 GPU作为实验平台,对比Intel Xeon E5405 CPU上的运行结果发现,在最大规模下,Rprj3获得的相对于CPU版本的加速比为5.37×,Interp获得的相对于CPU版本的加速比为12.8×。相似文献

15.

生化分析仪的ARM-SoC控制系统设计

张海江黎海文吴一辉于正林《自动化仪表》2012,33(3):21-23,27

为满足小型全自动生化分析仪系统的多任务、高实时性以及数据运算复杂等要求,提出了一种基于ARM-双单片机系统的设计方法。该系统采用S3C2440A(ARM)为上位机、C8051F060单片机为下位机的系统架构,ARM与单片机之间采用串口通信方式。ARM负责任务分配和数据存贮,主单片机负责试剂盘的定位和数据采集,从单片机负责加样臂的旋转和样品加样,实现了多任务并行快速处理;系统采用嵌入式Linux和QT/E(QT/Embedded)开发软件。试验结果表明,系统具有运动控制实时性好、运算处理能力强和性能稳定可靠等优点,完全满足实际应用的需求。相似文献

16.

非线性扩散方程的显式并行计算

下载免费PDF全文

迟利华刘杰《计算机工程》2010,36(21):25-27

在分布共享的多核集群系统中,提出一种求解非线性扩散方程的显式数据分布OpenMP并行计算方法。将数据进行分布式划分后分配到每个OpenMP线程,通过数据拷贝实现同步计算,并设计全局归约算法减少障碍同步次数。性能分析和测试结果表明,该方法在 4核Xeon处理器构成的分布共享集群系统上可扩展到1 024个CPU核,相对于64个CPU核,其加速比为7.06。相似文献

17.

基于Spark框架的乘潮水位计算与可视化平台

秦勃朱勇秦雪《计算机工程与科学》2015,37(12):2216-2221

乘潮水位计算是海洋环境信息处理的重要组成部分,具有计算量大、计算复杂度高、计算时间长等特性。采用传统集群计算模式实现乘潮水位计算业务,存在计算成本高、计算伸缩性和交互性差的问题。针对以上问题,提出一种基于Spark框架的乘潮水位计算和可视化平台。结合对Spark任务调度算法的研究,设计和实现了一种基于节点计算能力的任务调度算法,实现了长时间序列的多任务乘潮水位数据的检索、获取、数值计算、特征可视化的并行处理,达到了海量海洋环境数据计算和可视化处理的目的。实验结果表明,提出的基于Spark的乘潮水位计算和可视化平台可以有效地提高海量乘潮水位数据的分布式并行处理的效率,为更加快速和高效的乘潮水位计算提供了一种新的方法。相似文献

18.

一种基于Matrix的QR分解向量化方法

鲁庆男刘仲《计算机工程与科学》2016,38(2):210-216

提出一种基于Matrix的Givens旋转的QR分解向量化方法。针对Matrix的体系结构特点,对向量数据访存和计算进行优化,使计算均衡分布到各个向量处理单元;设计双缓冲DMA的数据传输策略,使得内核的计算与DMA数据搬移的时间完全重迭,内核始终处于峰值计算,从而取得最佳的计算效率。实验结果表明,该方法能够取得较高的计算效率和性能加速比。相似文献

19.

Asynchronous migration for parallel genetic programming on a computer cluster with multi-core processors

Shingo Kurose Kunihito Yamamori Masaru Aikawa Ikuo Yoshihara 《Artificial Life and Robotics》2012,16(4):533-536

An island model is a typical implementation of genetic programming on parallel computers with distributed memory. The island model has a migration facility that sends/receives some individuals in an island to/from another island to maintain diversity. The island model requires synchronization to migrate same-generation individuals between islands, and this synchronization causes an increase in computation time. This article proposes a new parallel genetic programming implementation based on the island model with asynchronous migration. Most recent computers are equipped with one or more multi-core processors, and are suitable for multi-threading. Therefore we employ a communication thread for migration between islands. The communication thread on a processor communicates with the communication thread on another processor to migrate individuals at appropriate intervals. Since the migration and other genetic operations can be independently processed on each core, and since we allow the exchange of individuals of different generations, no synchronization is needed in our implementation. In addition, a fitness calculation is also executed in parallel by the remaining cores. Experimental results show that the proposed method can reduce the computation time to about 17% in serial GP by using 40 threads. 相似文献