首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 78 毫秒
1.
蔡勇  李胜 《计算机应用》2016,36(3):628-632
针对传统并行计算方法实现结构拓扑优化快速计算的硬件成本高、程序开发效率低的问题,提出了一种基于Matlab和图形处理器(GPU)的双向渐进结构优化(BESO)方法的全流程并行计算策略。首先,探讨了Matlab编程环境中实现GPU并行计算的三种途径的优缺点和适用范围;其次,分别采用内置函数直接并行的方式实现了拓扑优化算法中向量和稠密矩阵的并行化计算,采用MEX函数调用CUSOLVER库的形式实现了稀疏格式有限元方程组的快速求解,采用并行线程执行(PTX)代码的方式实现了拓扑优化中单元敏度分析等优化决策的并行化计算。数值算例表明,基于Matlab直接开发GPU并行计算程序不仅编程效率高,而且还可以避免不同编程语言间的计算精度差异,最终使GPU并行程序可以在保持计算结果不变的前提下取得可观的加速比。  相似文献   

2.
为解决高质量的轮廓提取算法计算复杂、实时性差的问题,基于GPU并行计算架构提出了一种针对高质量的轮廓提取算法——Pb(probability boundary,概率轮廓)提取算法的高效并行计算方法。重点讨论了如何利用多计算单元加速计算最耗时的梯度计算部分。详细介绍了多方向直方图并行统计机制及χ2并行计算中访存冲突避免机制。对比实验表明,在GPU上基于该并行方法的轮廓提取相比传统CPU方式具有明显加速效果,且随着图像分辨率变大,加速效果更加明显,例如图像大小为1024×1024时可获得160倍的加速;此外,基于伯克利标准测试集验证了该并行方法可保持原有算法的计算准确度。为大规模图像数据智能分析中的轮廓提取提供了快速、实时的计算方法。  相似文献   

3.
基于GPU的多数据流相关系数并行计算方法研究*   总被引:1,自引:1,他引:1  
为了满足多数据流处理的实时性需求,提出一种跨PCIE总线的四层滑动窗口模型和基于图形处理器的多数据流并行处理框架模型,在此框架模型下可以并行维护数量巨大的滑动实时多数据流统计信息,同时采用精确方法并行计算多数据流间任意两条的相关系数。通过对比在同样的实验环境下只使用CPU的计算处理方法,验证了新方法的实时计算性能具有显著的提高。  相似文献   

4.
使用GPU技术的数据流分位数并行计算方法   总被引:1,自引:0,他引:1  
周勇  王皓  程春田 《计算机应用》2010,30(2):543-546
数据流实时、连续、快速到达的特点决定了数据流的实时处理能力。在处理低维数据流时经常使用分位数信息来描述数据流的统计信息,利用图形处理器(GPU)的强大计算能力和高内存带宽的特性计算数据流分位数信息,提出了基于统一计算设备架构(CUDA)的数据流处理模型和基于该模型的数据流分位数并行计算方法。实验证明,该方法在提供不低于纯CPU分位数算法相同精度的条件下,使数据流分位数的实时计算带宽得到了显著的提高。  相似文献   

5.
针对现代优化算法在处理相对复杂问题中所面临的求解时间复杂度较高的问题,引入基于GPU的并行处理解决方法。首先从宏观角度阐释了基于计算统一设备架构CUDA的并行编程模型,然后在GPU环境下给出了基于CUDA架构的5种典型现代优化算法(模拟退火算法、禁忌搜索算法、遗传算法、粒子群算法以及人工神经网络)的并行实现过程。通过对比分析在不同环境下测试的实验案例统计结果,指出基于GPU的单指令多线程并行优化策略的优势及其未来发展趋势。  相似文献   

6.
目的几何校正(又称地理编码)是合成孔径雷达(SAR)影像处理流程中重要的一个步骤,具有一定的计算复杂度,需要用到几何定位模型。本文针对星载SAR影像,采用有理多项式系数(RPC)定位模型,提出了图形处理器(GPU)支持的几何校正大规模并行处理方法。方法该方法充分利用GPU计算资源强大及几何校正过程中每个像素处理步骤一致的特点,每次导入大量像素至GPU,为每个像素分配一个线程,每个线程执行有理函数计算、投影变换、插值采样等计算复杂度高的步骤,通过优化配置dim Grid和dim Block参数,提升GPU的并行性能。该方法通过分块处理实现SAR影像大幅面处理,且可适用于多个不同分块大小。结果实验结果显示其计算加速比为38 44,为全面客观地分析GPU并行处理的特点,还计算了整体加速比,通过多个实验分析影响整体加速性能的因素,提出大块读写提高I/O性能的优化方法。结论该方法形式简洁,通用性好,可适用于几乎所有的星载SAR影像、不同的影像幅面;且加速性能明显。  相似文献   

7.
大尺度、高分辨率数字地形数据应用需求的增长,给计算密集型的累积汇流等数字地形分析算法带来了新的挑战。针对CPU/GPU(Graphics Processing Unit)异构计算平台的特点,提出了一种基于OpenCL(Open Computing Language)的多流向累积汇流算法的并行化策略,具有更好的平台独立性和可移植性,简化了CPU/GPU异构平台下的并行应用程序设计。累积汇流并行算法包括时空独立型的流量分配和空间依赖型的累积入流两个过程,均定义为OpenCL内核并交由OpenCL设备并行执行,其中累积入流过程借助流量转移矩阵由递归式转换为迭代式来实现并行计算。与基于流量转移矩阵的并行汇流算法相比,尽管基于单元入度矩阵的并行汇流算法可以降低迭代过程中的计算冗余,但需要采用具有较大延迟的原子操作以及需要更多的迭代次数,在有限的GPU计算资源下,两种算法性能差异不明显。实验结果表明,并行累积汇流算法在NVIDIA GeForce GT 650M GPU上获得了较好的加速比,加速性能随格网尺度增加而有所增加,其中流量分配获得了约50~70倍的加速比,累积入流获得了10~20倍的加速比,展示了利用OpenCL在GPU等并行计算设备上进行大规模数字地形分析的潜在优势。  相似文献   

8.
针对目前图像处理算法日益复杂,对CPU的性能要求越来越高,而传统的基于CPU的图像处理方法无法满足需求的情况,本文对基于统一计算设备架构(CUDA)的图形处理器(GPU)在图形处理方面的算法进行研究和实现。通过充分利用GPU突出的并行处理能力,采用CUDA技术,利用C++语言实现相关算法。研究并设计高斯模糊处理算法、彩色负片处理算法、透明合并处理算法的GPU并行运算流程,并通过与CPU实现相同效果的性能的对比,证明基于GPU图像处理算法的高效性。  相似文献   

9.
GPU通用计算平台上中心差分格式显式有限元并行计算   总被引:3,自引:0,他引:3  
显式有限元是解决平面非线性动态问题的有效方法.由于显式有限元算法的条件稳定性,对于大规模的有限元问题的求解需要很长的计算时间.图形处理器(GPU)作为一种高度并行化的通用计算处理器,可以很好解决大规模科学计算的速度问题.统一计算架构(CUDA)为实现GPU通用计算提供了高效、简便的方法.因此,建立了基于GPU通用计算平台的中心差分格式的显式有限元并行计算方法.该方法针对GPU计算的特点,对串行算法的流程进行了优化和调整,通过采用线程与单元或节点的一一映射策略,实现了迭代过程的完全并行化.通过数值算例表明,在保证计算精度一致的前提下,采用NVIDIA GTX 460显卡,该方法能够大幅度提高计算效率,是求解平面非线性动态问题的一种高效简便的数值计算方法.  相似文献   

10.
研究基于总变分(TV)的图像去噪问题,针对中央处理器(CPU)计算速度较慢的问题,提出了在图像处理器(GPU)上并行计算的方法。考虑总变分最小问题的对偶模型,建立原始变量与对偶变量的关系,采用梯度投影算法求解对偶变量。数值实验分别在GPU与CPU上进行。实验结果表明,总变分去噪模型对偶算法在GPU设备上执行的效率高于在CPU上执行的效率,并且随着图像尺寸的增大,GPU并行计算的优势更加突出。  相似文献   

11.
CPU/GPU协同并行计算研究综述   总被引:3,自引:3,他引:3  
CPU/GPU异构混合并行系统以其强劲计算能力、高性价比和低能耗等特点成为新型高性能计算平台,但其复杂体系结构为并行计算研究提出了巨大挑战。CPU/GPU协同并行计算属于新兴研究领域,是一个开放的课题。根据所用计算资源的规模将CPU/GPU协同并行计算研究划分为三类,尔后从立项依据、研究内容和研究方法等方面重点介绍了几个混合计算项目,并指出了可进一步研究的方向,以期为领域科学家进行协同并行计算研究提供一定参考。  相似文献   

12.
The Support Vector Machine (SVM) is an efficient tool in machine learning with high accuracy performance. However, in order to achieve the highest accuracy performance, n-fold cross validation is commonly used to identify the best hyperparameters for SVM. This becomes a weak point of SVM due to the extremely long training time for various hyperparameters of different kernel functions. In this paper, a novel parallel SVM training implementation is proposed to accelerate the cross validation procedure by running multiple training tasks simultaneously on a Graphics Processing Unit (GPU). All of these tasks with different hyperparameters share the same cache memory which stores the kernel matrix of the support vectors. Therefore, this heavily reduces redundant computations of kernel values across different training tasks. Considering that the computations of kernel values are the most time consuming operations in SVM training, the total time cost of the cross validation procedure decreases significantly. The experimental tests indicate that the time cost for the multitask cross validation training is very close to the time cost of the slowest task trained alone. Comparison tests have shown that the proposed method is 10 to 100 times faster compared to the state of the art LIBSVM tool.  相似文献   

13.
 报文分类是网络设备的基本处理模式,通常采用报文过滤系统对每个报文进行分类。传统报文分类难以适应当今越来越高的网络流量,分类处理速度低于报文到达网络接口的速度,无法实现实时分析。因此,本文提出使用GPU对大规模报文集进行并行分类的方法,利用GPU的线程级并行处理能力加速报文分类吞吐率,并对其性能及优化方法进行详细分析。实验结果表明,GPU加速的Linear Search和RFC报文分类算法与纯CPU系统执行相比可达到4.4~132.5倍的加速比。  相似文献   

14.
A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy.  相似文献   

15.
Contemporary many-core processors such as the GeForce 8800 GTX enable application developers to utilize various levels of parallelism to enhance the performance of their applications. However, iterative optimization for such a system may lead to a local performance maximum, due to the complexity of the system. We propose program optimization carving, a technique that begins with a complete optimization space and prunes it down to a set of configurations that is likely to contain the global maximum. The remaining configurations can then be evaluated to determine the one with the best performance. The technique can reduce the number of configurations to be evaluated by as much as 98% and is successful at finding a near-best configuration. For some applications, we show that this approach is significantly superior to random sampling of the search space.  相似文献   

16.
程宾洋  王茂芝  罗耀华  郭科 《软件》2012,(8):144-146
由于空间和波谱分辨率的不断提高,高光谱遥感影像的海量数据特性导致高光谱遥感影像并行处理成为遥感影像处理技术的发展趋势。本文基于CUDA和GPU环境,设计并实现了高光谱遥感蚀变填图的SCM并行算法。实验结果表明,并行加速比可达到25,SCM并行算法能有效改善算法性能。  相似文献   

17.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run.  相似文献   

18.
The general purpose computing on graphics processing unit (GP-GPU) has emerged as a new cost effective parallel computing paradigm in high performance computing research that enables large amount of data to be processed in parallel. Large scale scientific data intensive applications have been playing an important role in modern high performance computing research. A common access pattern into such scientific data analysis applications is multi-dimensional range query, but not much research has been conducted on multi-dimensional range query on the GPU. Inherently multi-dimensional indexing trees such as R-Trees are not well suited for GPU environment because of its irregular tree traversal. Traversing irregular tree search path makes it hard to maximize the utilization of massively parallel architectures. In this paper, we propose a novel MPTS (Massively Parallel Three-phase Scanning) R-tree traversal algorithm for multi-dimensional range query, that converts recursive access to tree nodes into sequential access. Our extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors.  相似文献   

19.
随着软件和硬件的不断发展,图形处理器(GPUs)已经广泛用于通用计算领域,并作为加速器来协助CPU加速程序的运行。为了追求高性能,GPU往往包含成百上千个核心运算单元,高密度的计算资源使其在性能远高于CPU的同时功耗也高于CPU,因此功耗问题已经成为制约GPU发展的重要问题之一。分析了并行程序在GPU上运行时消耗的功耗,提出了并行算法在GPU上运行的功耗评估方法,接着通过并行前缀求和算法对该方法进行了详细的论述与分析。在实验部分通过稀疏矩阵向量乘算法的实际应用对该方法的正确性以及敏感性进行了证明与分析。结果表明,对于给定的程序,在满足性能要求的前提下,最优线程块数、存储访问方式以及任务分配顺序是影响系统功耗的关键因素。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号