共查询到20条相似文献,搜索用时 22 毫秒
1.
Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node. 相似文献
2.
The rapidly increasing number of cores in modern microprocessors is pushing the current high performance computing (HPC) systems into the petascale and exascale era. The hybrid nature of these systems - distributed memory across nodes and shared memory with non-uniform memory access within each node - poses a challenge to application developers. In this paper, we study a hybrid approach to programming such systems - a combination of two traditional programming models, MPI and OpenMP. We present the performance of standard benchmarks from the multi-zone NAS Parallel Benchmarks and two full applications using this approach on several multi-core based systems including an SGI Altix 4700, an IBM p575+ and an SGI Altix ICE 8200EX. We also present new data locality extensions to OpenMP to better match the hierarchical memory structure of multi-core architectures. 相似文献
3.
Modern computer systems become increasingly distributed and heterogeneous by comprising multi-core CPUs, GPUs, and other accelerators. Current programming approaches for such systems usually require the application developer to use a combination of several programming models (e.g., MPI with OpenCL or CUDA) in order to exploit the system’s full performance potential. In this paper, we present dOpenCL (distributed OpenCL)—a uniform approach to programming distributed heterogeneous systems with accelerators. dOpenCL allows the user to run unmodified existing OpenCL applications in a heterogeneous distributed environment. We describe the challenges of implementing the OpenCL programming model for distributed systems, as well as its extension for running multiple applications concurrently. Using several example applications, we compare the performance of dOpenCL with MPI + OpenCL and standard OpenCL implementations. 相似文献
4.
On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the non-associativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms. 相似文献
5.
The Steered Response Power with Phase Transform (SRP-PHAT) algorithm is a well-known method for sound source localization due to its robust performance in noisy and reverberant environments. This algorithm is used in a large number of acoustic applications such as automatic camera steering systems, human–machine interaction, video gaming and audio surveillance. SPR-PHAT implementations require to handle a high number of signals coming from a microphone array and a huge search grid that influences the localization accuracy of the system. In this context, high performance in the localization process can only be achieved by using massively parallel computational resources. Different types of multi-core machines based either on multiple CPUs or on GPUs are commonly employed in diverse fields of science for accelerating a number of applications, mainly using OpenMP and CUDA as programming frameworks, respectively. This implies the development of multiple source codes which limits the portability and application possibilities. On the contrary, OpenCL has emerged as an open standard for parallel programming that is nowadays supported by a wide range of architectures. In this work, we evaluate an OpenCL-based implementations of the SRP-PHAT algorithm in two state-of-the-art CPU and GPU platforms. Results demonstrate that OpenCL achieves close-to-CUDA performance in GPU (considered as upper bound) and outperforms in most of the CPU configurations based on OpenMP. 相似文献
6.
Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. 相似文献
7.
差分方法是求解偏微分方程的最主要工具之一,并行求解差分方程可以快速解决工程问题。研究了基于MPI与OpenMP混合的并行计算方法 ,基于该方法测试了稳态传热问题的计算,结果表明,混合并行模式对于提高计算速度有加大的帮助。 相似文献
8.
阐述MPI与OpenMP进行并行计算的特点,并在Visual Studio 2010上构建一个基于两者的混合编程平台。程序在该平台上执行时能够同时实现多进程与进程内多线程编程,设计并实现一种基于数据划分的矩阵乘法的并行算法,将数据分解为两部分交给两个计算节点分别完成,并在每个计算节点内将数据进一步划分,交给多个线程同时执行。通过与非并行矩阵乘法、MPI矩阵乘法、OpenMP矩阵乘法运算性能进行比较,验证该算法可以有效地挖掘计算机的处理能力。 相似文献
9.
Direct volume visualization is an important method in many areas, including computational fluid dynamics and medicine. Achieving interactive rates for direct volume rendering of large unstructured volumetric grids is a challenging problem, but parallelizing direct volume rendering algorithms can help achieve this goal. Using Compute Unified Device Architecture (CUDA), we propose a GPU-based volume rendering algorithm that itself is based on a cell projection-based ray-casting algorithm designed for CPU implementations. We also propose a multicore parallelized version of the cell-projection algorithm using OpenMP. In both algorithms, we favor image quality over rendering speed. Our algorithm has a low memory footprint, allowing us to render large datasets. Our algorithm supports progressive rendering. We compared the GPU implementation with the serial and multicore implementations. We observed significant speed-ups that, together with progressive rendering, enables reaching interactive rates for large datasets. 相似文献
11.
为了研究GPU的通用计算能力和适合SMP集群的编程模型,首次提出MPI+CUDA多粒度混合并行编程的新方法,节点间采用MPI实现粗粒度并行,节点内采用CUDA实现细粒度并行的混合编程方式.利用此方法在搭建的3节点SMP集群环境中,测试了大规模矩阵乘问题的并行计算能力.实验结果表明,该方法能够显著提升并行效率,同时证明MPI+CUDA混合编程模型能够充分发挥SMP集群节点间分布式存储和节点内共享内存的优势,为装有CUDA-enabled GPU的SMP集群提供了一种有效的并行策略. 相似文献
12.
This paper presents real-time image processing applications using multicore and multiprocessing technologies. To this end, parallel image segmentation was performed on many images covering the entire surface of the same metallic and cylindrical moving objects. Experimental results on multicore CPU with OpenMP platform showed that by increasing the chunk size, the execution time decreases approximately four times in comparison with serial computing. The same experiments were implemented on GPGPU using four techniques: (1) Single image transmission with single pixel processing; (2) Single image transmission with multiple pixel processing; (3) Multiple image transmission with single pixel processing; and (4) Multiple image transmission with multiple pixel processing. All techniques were implemented on GeForce, Tesla K20 and Tesla K40. Experimental results of GPU with CUDA platform showed that by increasing the core number speedup is increased. Tesla K40 gave the best results of 35 and 12 (for the first technique), 36 and 13 (for the second technique), 54 and 16 (for the third technique), 71 and 17 (for the fourth technique) times improvement without and with data transmission time in comparison with serial computing. As a result, users are suggested to use Tesla K40 GPU and Multiple image transmission with multiple pixel processing to get the maximum performance. 相似文献
13.
针对国内外强电磁脉冲耦合效应分析现状及问题,阐述了电磁软杀伤概念,提出了基于共享存储OpenMP标准与消息传递MPI的时域有限差分-通用电路仿真程序FDTD-SPICE并行同步算法,建立了在电磁脉冲激励作用下的动态耦合模型,解决了飞机电磁软杀伤动态评估的难题;以某军机为例,开展了电磁武器的座舱耦合分布电磁场和FDTD-SPICE等效电路的并行同步仿真试验与防护评估分析.试验结果表明,该算法并行同步效果良好,军机在通电状态下,军机在通电状态下,幅度最值大大高于不通电条件下的幅度最值,易损性增强,防护效果减弱;防护座舱能大大减轻电磁武器对机载电路的影响. 相似文献
15.
Compared to Beowulf clusters and shared-memory machines, GPU and FPGA are emerging alternative architectures that provide massive parallelism and great computational capabilities. These architectures can be utilized to run compute-intensive algorithms to analyze ever-enlarging datasets and provide scalability. In this paper, we present four implementations of K-means data clustering algorithm for different high performance computing platforms. These four implementations include a CUDA implementation for GPUs, a Mitrion C implementation for FPGAs, an MPI implementation for Beowulf compute clusters, and an OpenMP implementation for shared-memory machines. The comparative analyses of the cost of each platform, difficulty level of programming for each platform, and the performance of each implementation are presented. 相似文献
16.
Important components of molecular modeling applications are estimation and minimization of the internal energy of a molecule.
For macromolecules such as proteins and amino acids, energy estimation is performed using empirical equations known as force
fields. Over the past several decades, much effort has been directed towards improving the accuracy of these equations, and
the resulting increased accuracy has come at the expense of greater computational complexity. For example, the interactions
between a protein and surrounding water molecules have been modeled with improved accuracy using the generalized Born solvation
model, which increases the computational complexity to O ( n
3). Fortunately, many force-field calculations are amenable to parallel execution. This paper describes the steps that were
required to transform the Born calculation from a serial program into a parallel program suitable for parallel execution in
both the OpenMP and MPI environments. Measurements of the parallel performance on a symmetric multiprocessor reveal that the
Born calculation scales well for up to 144 processors. In some cases the OpenMP implementation scales better than the MPI
implementation, but in other cases the MPI implementation scales better than the OpenMP implementation. However, in all cases
the OpenMP implementation performs better than the MPI implementation, and requires less programming effort as well.
Trademark Legend Sun, Sun Microsystems, SPARC, UltraSPARC, Sun Fire, Sun Performance Library and Sun HPC Cluster Tools are
trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. 相似文献
17.
多核系统工作负载的动态性和不可预知性往往会导致系统功耗高、延时长,运行期间敏捷的任务分配方法能有效解决上述问题.为此,针对多核系统中的任务分配问题提出一种近似模型,以估计任意给定节点周围的可用节点数量;然后,采用爬坡搜索启发式策略(SHiC),在所有可用节点中迅速搜索出最优首个节点;最后,利用CoNA算法实现任务的高效分配.在不同网络规模和网络参数设置下展开仿真,实验结果表明,SHiC可实现显著的性能提升,与当前最新研究相比,改进了网络延时和功耗. 相似文献
18.
The three-dimensional wedge-shaped underwater acoustic propagation model exists analytical solution, which provides verification for models like FOR3D propagation model under certain situation. However, the solving process of a three-dimensional complex underwater sound field problem is hindered by intensive computing and long calculation times. In this paper, we exploit a hybrid parallel programing model, such as MPI and OpenMP, to accelerate the computation, design various optimization methods to improve the overall performance, and then carry out the performance and optimization analysis on the Tianhe-2 platform. Experiments show that the optimized implementation of the three-dimensional wedge-shaped underwater acoustic propagation model achieves a 46.5 speedup compared to the original serial program, thereby illustrating a substantial performance improvement. We also carried out scalability tests and parallel optimization experiments for large-scale practical examples. 相似文献
20.
Technological advancements in the silicon industry, as predicted by Moore’s law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)—an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera’s TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs. 相似文献
|