首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Current parallel systems composed of mixed multi/manycore systems and/with GPUs become more complex due to their heterogeneous nature. The programmability barrier inherent to parallel systems increases almost with each new architecture delivery. The development of libraries, languages, and tools that allow an easy and efficient use in this new scenario is mandatory. Among the proposals found to broach this problem, skeletal programming appeared as a natural alternative to easy the programmability of parallel systems in general, but also the GPU programming in particular. In this paper, we develop a programming skeleton for Dynamic Programming on MultiGPU systems. The skeleton, implemented in CUDA, allows the user to execute parallel codes for MultiGPU just by providing sequential C++ specifications of her problems. The performance and easy of use of this skeleton has been tested on several optimization problems. The experimental results obtained over a cluster of Nvidia Fermi prove the advantages of the approach.  相似文献   

2.
Bees Algorithm is a population-based method that is a computational bound algorithm whose inspired by the natural behavior of honey bees to finds a near-optimal solution for the search problem. Recently, many parallel swarm based algorithms have been developed for running on GPU (Graphic Processing Unit). Since nowadays developing a parallel Bee Algorithm running on the GPU becomes very important. In this paper, we extend the Bees Algorithm (CUBA (i.e. CUDA based Bees Algorithm)) in order to be run on the CUDA (Compute Unified Device Architecture). CUBA (CUDA based Bees Algorithm). We evaluate the performance of CUBA by conducting some experiments based on numerous famous optimization problems. Results show that CUBA significantly outperforms standard Bees Algorithm in numerous different optimization problems.  相似文献   

3.
基于CUDA的并行布谷鸟搜索算法设计与实现   总被引:1,自引:0,他引:1  
布谷鸟搜索(cuckoo search,CS)算法是近几年发展起来的智能元启发式算法,已经被成功应用于多种优化问题中。针对CS算法在求解大数据、大规模复杂问题时,计算时间过长的问题,提出了一种基于统一计算设备架构(compute unified device architecture,CUDA)的并行布谷鸟搜索算法。该算法的并行实现采用任务并行与数据并行相结合的方式,利用图形处理器(graphic processing unit,GPU)线程块与线程分别映射布谷鸟个体与个体的每一维数据,并行实现CS算法中的鸟巢位置更新、个体适应度评估、鸟巢重建、寻找最优个体操作。整个CS算法的寻优迭代过程完全通过GPU实现,降低了算法计算过程中CPU与GPU的通信开销。对4个经典基准测试函数进行了仿真实验,结果表明,相比标准CS算法,基于CUDA架构的并行CS算法在求解收敛性一致的前提下,在求解速度上获得了高达110倍的计算加速比。  相似文献   

4.
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.  相似文献   

5.
大尺度、高分辨率数字地形数据应用需求的增长,给计算密集型的累积汇流等数字地形分析算法带来了新的挑战。针对CPU/GPU(Graphics Processing Unit)异构计算平台的特点,提出了一种基于OpenCL(Open Computing Language)的多流向累积汇流算法的并行化策略,具有更好的平台独立性和可移植性,简化了CPU/GPU异构平台下的并行应用程序设计。累积汇流并行算法包括时空独立型的流量分配和空间依赖型的累积入流两个过程,均定义为OpenCL内核并交由OpenCL设备并行执行,其中累积入流过程借助流量转移矩阵由递归式转换为迭代式来实现并行计算。与基于流量转移矩阵的并行汇流算法相比,尽管基于单元入度矩阵的并行汇流算法可以降低迭代过程中的计算冗余,但需要采用具有较大延迟的原子操作以及需要更多的迭代次数,在有限的GPU计算资源下,两种算法性能差异不明显。实验结果表明,并行累积汇流算法在NVIDIA GeForce GT 650M GPU上获得了较好的加速比,加速性能随格网尺度增加而有所增加,其中流量分配获得了约50~70倍的加速比,累积入流获得了10~20倍的加速比,展示了利用OpenCL在GPU等并行计算设备上进行大规模数字地形分析的潜在优势。  相似文献   

6.
FFT(快速傅里叶变换)是基于提高DFT(离散傅里叶变换)计算的高效算法,它在众多科学和工程领域都得到了广泛的应用。自FFT算法出现以后,从早期的以降低复杂度到近年以来的大规模并行FFT计算,各种优化算法得到广泛的研究。在并行运算领域中,随着可编程的、并行化GPU的不断推广,特别是通用并行统一计算架构CUDA的出现,极大增强了GPU的计算能力,在编程和优化等方面都有显著地提升。鉴于此,本文在分析FFT算法实现的基础上,研究了一种适合GPU运算的FFT并行计算方法,并通过CUDA架构实现了FFT算法在GPU上的运算。该方法的引入在理论不计算数据传输的情况下,使一维FFT运算时间的复杂度由O(N logN2)可以降到O(N/rlogN2)。通过验证,本文提出的CUDA的并行FFT方法得到较好的加速效果,在精度计算上也符合实际的要求,从而证明了该方法的正确性和有效性。  相似文献   

7.
The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

8.
MRRR(Multiple Relatively Robust Representations)算法是求解对称三对角矩阵本征值问题高效、精确的算法之一。在分析MRRR算法及CUDA(Compute Unified Device Architecture)并行体系结构的基础上,针对算法的可并行性,采用单指令多线程并行方式实现了基于CUDA的MRRR算法并行,并从存储结构方面优化算法。实验结果显示,与LAPACK库中串行MRRR实现相比,并行方法在保证精度的基础上获得了20倍的加速比,进而从计算精度和计算时间上说明MRRR算法适合在GPU上并行。  相似文献   

9.
Graphics Processing Units (GPUs) have evolved into highly parallel and fully programmable architecture over the past five years, and the advent of CUDA has facilitated their application to many real-world applications. In this paper, we deal with a GPU implementation of Ant Colony Optimization (ACO), a population-based optimization method which comprises two major stages: tour construction and pheromone update. Because of its inherently parallel nature, ACO is well-suited to GPU implementation, but it also poses significant challenges due to irregular memory access patterns. Our contribution within this context is threefold: (1) a data parallelism scheme for tour construction tailored to GPUs, (2) novel GPU programming strategies for the pheromone update stage, and (3) a new mechanism called I-Roulette to replicate the classic roulette wheel while improving GPU parallelism. Our implementation leads to factor gains exceeding 20x for any of the two stages of the ACO algorithm as applied to the TSP when compared to its sequential counterpart version running on a similar single-threaded high-end CPU. Moreover, an extensive discussion focused on different implementation paths on GPUs shows the way to deal with parallel graph connected components. This, in turn, suggests a broader area of inquiry, where algorithm designers may learn to adapt similar optimization methods to GPU architecture.  相似文献   

10.
张佳康  陈庆奎 《计算机工程》2010,36(15):179-181
针对具有高浮点运算能力的流处理器设备GPU对神经网络的适用性问题,提出卷积神经网络的并行化识别算法,采用计算统一设备架构(CUDA)技术,并定义其上的并行化数据结构,描述计算任务到CUDA的映射机制。实验结果证明,在GTX200硬件架构的GPU上实现的并行识别算法的平均浮点运算能力峰值较CPU上串行算法提高了近60倍,更适用于神经网络的相关应用。  相似文献   

11.
叠前逆时偏移(RTM)方法是目前地震勘探领域最为精确的一种地震数据成像方法,其运用双程声波方程进行波场延拓,可实现对复杂构造介质的准确成像.文中采用互相关成像条件对震源波场与检波点波场在同时刻相关成像.针对RTM方法计算量大的问题,将图形处理器(GPU)引入到RTM计算中,充分挖掘GPU的众核结构优势,利用基于CUDA架构的并行加速算法取代传统CPU的串行运算,对逆时偏移算法中较为耗时的波场延拓和相关成像过程进行加速.复杂模型测试结果表明,在确保RTM成像精度的前提下,相比于传统CPU计算,GPU并行加速算法可大幅度地提高计算效率,进而实现基于GPU加速的叠前逆时偏移算法对复杂介质的高效率、高精度成像.  相似文献   

12.
随着图像数据的大量增加,传统单处理器或多处理器结构的计算设备已无法满足实时性数据处理要求。异构并行计算技术因其高效的计算效率和并行的实时性数据处理能力,正得到广泛关注和应用。利用GPU在图形图像处理方面并行性的优势,提出了基于OpenCL的JPEG压缩算法并行化设计方法。将JPEG算法功能分解为多个内核程序,内核之间通过事件信息传递进行顺序控制,并在GPU+CPU的异构平台上完成了并行算法的仿真验证。实验结果表明,与CPU串行处理方式相比,本文提出的并行化算法在保持相同图像质量情况下有效提高了算法的执行效率,大幅降低了算法的执行时间,并且随着图形尺寸的增加,算法效率获得明显的提升。  相似文献   

13.
一种基于GPU加速的细粒度并行蚁群算法   总被引:1,自引:0,他引:1  
为改善蚁群算法对大规模旅行商问题的求解性能,提出一种基于图形处理器(GPU)加速的细粒度并行蚁群算法.将并行蚁群算法求解过程转化为统一计算设备架构的线程块并行执行过程,使得蚁群算法在GPU中加速执行.实验结果表明,该算法能提高全局搜索能力,增大细粒度并行蚁群算法的蚂蚁规模,从而提高了算法的运算速度.  相似文献   

14.
针对并行处理H.264标准视频流解码问题,提出基于CPU/GPU的协同运算算法。以统一设备计算架构(CUDA)语言作为GPU编程模型,实现DCT逆变换与帧内预测在GPU中的加速运算。在保持较高计算精度的前提下,结合CUDA混合编程,提高系统的计算性能。利用NIVIDIA提供的CUDA语言,在解码过程中使DCT逆变换和帧内预测在GPU上并行实现,将并行算法与CPU单机实现进行比较,并用不同数量的视频流验证并行解码算法的加速效果。实验结果表明,该算法可大幅提高视频流的编解码效率,比CPU单机的平均计算加速比提高10倍。  相似文献   

15.
遥感图像配准是遥感图像应用的一个重要处理步骤.随着遥感图像数据规模与遥感图像配准算法计算复杂度的增大,遥感图像配准面临着处理速度的挑战.最近几年,GPU计算能力得到极大提升,面向通用计算领域得到了快速发展.结合GPU面向通用计算领域的优势与遥感图像配准面临的处理速度问题,研究了GPU加速处理遥感图像配准的算法.选取计算量大计算精度高的基于互信息小波分解配准算法进行GPU并行设计,提出了GPU并行设计模型;同时选取GPU程序常用面向存储级的优化策略应用于遥感图像配准GPU程序,并利用CUDA(compute unified device architecture)编程语言在nVIDIA Tesla M2050GPU上进行了实验.实验结果表明,提出的并行设计模型与面向存储级的优化策略能够很好地适用于遥感图像配准领域,最大加速比达到了19.9倍.研究表明GPU通用计算技术在遥感图像处理领域具有广阔的应用前景.  相似文献   

16.
In this paper we propose an improved algorithm to search optimal solutions to the flow shop scheduling problems with fuzzy processing times and fuzzy due dates. A longest common substring method is proposed to combine with the random key method. Numerical simulation shows that longest common substring method combined with rearranging mating method improves the search efficiency of genetic algorithm in this problem. For application in large-sized problems, we also enhance this modified algorithm by CUDA based parallel computation. Numerical experiments show that the performances of the CUDA program on GPU compare favorably to the traditional programs on CPU. Based on the modified algorithm invoking with CUDA scheme, we can search satisfied solutions to the fuzzy flow shop scheduling problems with high performance.  相似文献   

17.
基于CUDA 的Wu-Manber 多模式匹配算法   总被引:1,自引:0,他引:1  
多模式匹配是计算机科学中最基本的问题,其应用在许多领域,在一些情形下也是比较耗时的。GPU拥有比CPU更强的并行计算能力,随着CUDA架构的推出,GPU用于通用计算领域的并行编程工作变得更加轻松。实现了基于CUDA架构的Wu-Manber多模式匹配算法,实验结果表明,相比传统串行算法而言,本文的实现获得了10倍以上的加速。  相似文献   

18.
在非线性系统中,粒子滤波需要大量粒子才能保证状态估计的准确度,这降低了算法的实时性,导致故障诊断的准确率和实时性不佳。针对该问题,提出基于GPU平台的粒子群优化粒子滤波(PSOPF)并行算法。通过分析PSOPF算法的并行性,设计并实现一种基于CUDA并行计算架构的PSOPF并行算法,利用大量的GPU线程对算法进行加速。为解决拒绝重采样对GPU全局内存的非合并访问带来的执行效率低问题,通过改进拒绝重采样并行算法,使线程束中的线程对同一内存区段中的粒子进行重采样,提高了其执行效率。通过对风力机组变桨距系统故障诊断验证了算法的有效性。实验结果表明,该方法可满足故障诊断准确率和实时性的要求。  相似文献   

19.
基于CUDA海量空间数据实时体绘制研究   总被引:1,自引:0,他引:1  
针对海量空间科学数据的精细及实时三维绘制需求,提出并实现了一种基于CUDA语言的并行化光线投射体绘制加速算法,利用传统体绘制算法中光线投射法的可并行特点和GPU中高速的纹理查询的优点,通过一个实际坐标到纹理坐标的转换函数实现了对不规则采样数据的准确采样,并完成了绘制算法的CUDA并行化改造,通过CUDA语言利用GPU强大的并行计算能力实现了对海量空间数据的实时三维光线投射绘制.  相似文献   

20.
拉普拉斯边缘检测算法常用于去除CCD天文图像中的宇宙射线噪声,但其串行算法计算复杂度较高。为此,分析拉普拉斯边缘检测算法的并行性,在统一计算设备架构(CUDA)并行编程环境下,提出一种基于CUDA的拉普拉斯边缘检测图形处理单元(GPU)并行算法。分割天文图像得到多幅子图,根据GPU的硬件配置设定Block和Grid的大小,将子图依次传输到显卡进行并行计算,传回主存后拼接得到完整的图像输出。实验结果表明,图像尺寸越大,该并行算法与串行算法相比具有的速度优势越大,可获得10倍以上的加速比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号