首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 328 毫秒
1.
网格重排序是提升流体力学CPU和GPU并行计算效率的重要手段之一。对于非结构网格,由于其数据存储无规律,数据的间接访问会导致访存延迟,尤其是在GPU并行计算时,数据的间接访问将引起内存的非对齐访问,放大了访存延迟的影响。对此,采用Reverse Cuthill-Mckee网格重排序方法优化了非结构网格的数据局部性,并设计了一种面向编号重排序方法。算例测试表明,网格重排序不影响最终计算结果。对比分析了网格重排序对非结构求解器在CPU和GPU上的性能影响:对CPU计算,可以使部分热点函数运行时间降低约20%,整体运行时间降低15%~20%;对GPU计算,大部分热点函数运行时间可降低35%~60%,程序整体运行时间降低约40%。  相似文献   

2.
为了从隐式曲面快速抽取高质量的四边形网格,提出一种基于GPU的高质量隐式曲面四边形化方法.该方法分为初始网格抽取和网格优化2个阶段.利用GPU的并行性能,首先快速抽取一个粗糙的四边形网格,然后对该网格的几何(顶点位置、法向)和规整性(顶点分布、邻接关系)两方面进行迭代优化.实验结果表明,文中方法极大地提高了隐式曲面四边形化的效率,并且能得到高质量的四边形网格.  相似文献   

3.
微分域网格变形方法能够较好的保持网格模型的局部细节特征,但其计算需要耗费较长的时间.结合GPU的高速并行运算性能,设计并实现了一种基于GPU的微分域网格变形算法.通过GPU进行网格的微分坐标求解、线性系统系数矩阵的Cholesky分解、线性系统求解等运算,从而将网格局部细节特征编码和解码过程以及变形结果的绘制完全通过GPU完成.实验结果表明该算法能够有效加速微分域网格变形方法的计算和绘制.  相似文献   

4.
倪鸿  刘鑫 《计算机工程》2019,45(6):45-51
为解决高性能计算中的非结构网格离散访存问题,以神威·太湖之光国产超级计算机为平台,根据异构众核处理器SW26010的体系结构特点,提出一种基于排序思想的通用众核优化算法,以减少非结构网格计算中的随机访存。基于网格划分原理,在O(n)时间内对生成的稀疏矩阵非零元素进行并行重排序。采用一种内部映射方式对计算向量实现扩展或变换,将细粒度访存转化为无写冲突的粗粒度访存。对多个实际应用算例的通量计算进行众核优化,结果表明,相比主核上的串行算法,该算法能够获得平均10倍以上的加速效果。  相似文献   

5.
为了实现小尺度范围流体场景的实时、真实感模拟,采用弱可压SPH方法对水体进行建模,提出了流体计算的CPU GPU混合架构计算方法。针对邻域粒子查找算法影响流体计算效率的问题,采用三维空间网格对整个模拟区域进行均匀网格划分,利用并行前缀求和和并行计数排序实现邻域粒子的查找。最后,采用基于CUDA并行加速的Marching Cubes算法实现流体表面提取,利用环境贴图表现流体的反射和折射效果,实现流体表面着色。实验结果表明,所提出的流体建模和模拟算法能实现小尺度范围流体的实时计算和渲染,绘制出水的波动、翻卷和木块在水中晃动的动态效果,当粒子数达到1 048 576个时,GPU并行计算方法相较CPU方法的加速比为60.7。  相似文献   

6.
FFT(快速傅里叶变换)是基于提高DFT(离散傅里叶变换)计算的高效算法,它在众多科学和工程领域都得到了广泛的应用。自FFT算法出现以后,从早期的以降低复杂度到近年以来的大规模并行FFT计算,各种优化算法得到广泛的研究。在并行运算领域中,随着可编程的、并行化GPU的不断推广,特别是通用并行统一计算架构CUDA的出现,极大增强了GPU的计算能力,在编程和优化等方面都有显著地提升。鉴于此,本文在分析FFT算法实现的基础上,研究了一种适合GPU运算的FFT并行计算方法,并通过CUDA架构实现了FFT算法在GPU上的运算。该方法的引入在理论不计算数据传输的情况下,使一维FFT运算时间的复杂度由O(N logN2)可以降到O(N/rlogN2)。通过验证,本文提出的CUDA的并行FFT方法得到较好的加速效果,在精度计算上也符合实际的要求,从而证明了该方法的正确性和有效性。  相似文献   

7.
CUDA架构下H.264快速去块滤波算法   总被引:1,自引:0,他引:1  
刘虎  孙召敏  陈启美 《计算机应用》2010,30(12):3252-3254
针对H.264/AVC视频编码标准中去块滤波器运算复杂度高、耗时巨大这一难题,提出了一种基于NVIDIA计算统一设备架构(CUDA)平台的H.264并行快速去块滤波算法,介绍了CUDA平台硬件结构特点与软件开发流程,根据图形处理器(GPU)的并发结构特点,对BS判定与滤波计算进行了并行优化,降低了算法复杂度,利用共享内存提高了数据访问速率,实现了去块滤波器的并行处理。实验结果表明,在图像质量基本不变的情况下,GPU算法能够明显提高运算速度,平均加速比在20倍左右,取得了良好的效果。  相似文献   

8.
利用GPU进行加速的归一化差分植被指数(Normalized Differential Vegetation Index,NDVI)提取算法通常采用GPU多线程并行模型,存在弱相关计算之间以及CPU与GPU之间数据传输耗时较多等问题,影响了加速效果的进一步提升。针对上述问题,根据NDVI提取算法的特性,文中提出了一种基于GPU多流并发并行模型的NDVI提取算法。通过CUDA流和Hyper-Q特性,GPU多流并发并行模型可以使数据传输与弱相关计算、弱相关计算与弱相关计算之间达到重叠,从而进一步提高算法并行度及GPU资源利用率。文中首先通过GPU多线程并行模型对NDVI提取算法进行优化,并对优化后的计算过程进行分解,找出包含数据传输及弱相关性计算的部分;其次,对数据传输和弱相关计算部分进行重构,并利用GPU多流并发并行模型进行优化,使弱相关计算之间、弱相关计算和数据传输之间达到重叠的效果;最后,以高分一号卫星拍摄的遥感影像作为实验数据,对两种基于GPU实现的NDVI提取算法进行实验验证。实验结果表明,与传统基于GPU多线程并行模型的NDVI提取算法相比,所提算法在影像大于12000*12000像素时平均取得了约1.5倍的加速,与串行提取算法相比取得了约260倍的加速,具有更好的加速效果和并行性。  相似文献   

9.
提出一种基于GPU的高程并行插值算法,实现了对三维地表上海量离散点的并行加速渲染。通过高程纹理组织三维地表网格高程数据作为离散点渲染的基础,并通过GLSL编写GPU着色器程序动态控制图形渲染管线,实现视点相关的高程并行插值算法。实验结果表明,提出的基于GPU的高程并行插值算法较传统的内存插值算法,将三维地表上海量离散点的渲染量级从百万级提高到了千万级。  相似文献   

10.
基于CUDA的并行粒子群优化算法的设计与实现   总被引:1,自引:0,他引:1  
针对处理大量数据和求解大规模复杂问题时粒子群优化(PSO)算法计算时间过长的问题, 进行了在显卡(GPU)上实现细粒度并行粒子群算法的研究。通过对传统PSO算法的分析, 结合目前被广泛使用的基于GPU的并行计算技术, 设计实现了一种并行PSO方法。本方法的执行基于统一计算架构(CUDA), 使用大量的GPU线程并行处理各个粒子的搜索过程来加速整个粒子群的收敛速度。程序充分使用CUDA自带的各种数学计算库, 从而保证了程序的稳定性和易写性。通过对多个基准优化测试函数的求解证明, 相对于基于CPU的串行计算方法, 在求解收敛性一致的前提下, 基于CUDA架构的并行PSO求解方法可以取得高达90倍的计算加速比。  相似文献   

11.
非结构网格预处理方法是非结构网格CFD并行计算的关键技术之一。提出基于缓冲数据结构的快速搜索算法来建立全局网格单元邻接关系图,算法复杂度低,能够显著降低非结构网格预处理的存储需求;在提高核心计算访存命中率方面,提出网格单元重排序算法,该算法能够提高核心计算效率,并通用于各种非结构网格问题。实验结果表明,在用于大网格量的复杂计算区域时该非结构网格预处理技术仍能得到较理想的结果。  相似文献   

12.
This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code.Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software.  相似文献   

13.
In this study, an efficient numerical method is proposed for unifying the structured and unstructured grid approaches for solving the potential flows. The new method, named as the “alternating cell directions implicit - ACDI”, solves for the structured and unstructured grid configurations equally well. The new method in effect applies a line implicit method similar to the Line Gauss Seidel scheme for complex unstructured grids including mixed type quadrilateral and triangle cells. To this end, designated alternating directions are taken along chains of contiguous cells, i.e. ‘cell directions’, and an ADI-like sweeping is made to update these cells using a Line Gauss Seidel like scheme. The algorithm makes sure that the entire flow field is updated by traversing each cell twice at each time step for unstructured quadrilateral grids that may contain triangular cells. In this study, a cell-centered finite volume formulation of the ACDI method is demonstrated. The solutions are obtained for incompressible potential flows around a circular cylinder and a forward step. The results are compared with the analytical solutions and numerical solutions using the implicit ADI and the explicit Runge-Kutta methods on single-and multi-block structured and unstructured grids. The results demonstrate that the present ACDI method is unconditionally stable, easy to use and has the same computational performance in terms of convergence, accuracy and run times for both the structured and unstructured grids.  相似文献   

14.
The method of discontinuous finite element discrete ordinates which involves inverting an operator by iteratively sweeping across a mesh from multiple directions is commonly used to solve the time-dependent particle transport equation. Graphics Processing Unit (GPU) provides great faculty in solving scientific applications. The particle transport with unstructured grid bringing forward several challenges while implemented on GPU. This paper presents an efficient implementation of particle transport with unstructured grid under 2D cylindrical Lagrange coordinates system on a fine-grained data level parallelism GPU platform from three aspects. The first one is determining the sweep order of elements from different angular directions. The second one is mapping the sweep calculation onto the GPU thread execution model. The last one is efficiently using the on-chip memory to improve performance. As to the authors? knowledge, this is the first implementation of a general purpose particle transport simulation with unstructured grid on GPU. Experimental results show that the performance speedup of NVIDIA M2050 GPU with double precision floating operations ranges from 11.03 to 17.96 compared with the serial implementation on Intel Xeon X5355 and Core Q6600.  相似文献   

15.
光照在提高体绘制质量方面发挥重要作用,而梯度计算是实现体光照的关键.与结构化网格相比,非结构化网格拓扑关系复杂,使得顶点梯度估计困难,采样点梯度计算复杂度高,且不易采用GPU加速,阻碍了算法的实时性.因此,绝大多数非结构化网格体绘制尚未添加体光照.本文提出一种高精度的非结构化网格顶点梯度计算方法:先采用格林公式估计单元...  相似文献   

16.
We introduce efficient, large scale fluid simulation on GPU hardware using the fluid‐implicit particle (FLIP) method over a sparse hierarchy of grids represented in NVIDIA® GVDB Voxels. Our approach handles tens of millions of particles within a virtually unbounded simulation domain. We describe novel techniques for parallel sparse grid hierarchy construction and fast incremental updates on the GPU for moving particles. In addition, our FLIP technique introduces sparse, work efficient parallel data gathering from particle to voxel, and a matrix‐free GPU‐based conjugate gradient solver optimized for sparse grids. Our results show that our method can achieve up to an order of magnitude faster simulations on the GPU as compared to FLIP simulations running on the CPU.  相似文献   

17.
We present two parallel multilevel methods for solving large-scale discretized partial differential equations on unstructured 2D/3D grids. The presented methods combine three powerful numerical algorithms: overlapping domain decomposition, multigrid method and adaptivity. As the foundation of the methods we propose an algorithm for generating and partitioning a hierarchy of adaptively refined unstructured grids, so that adaptivity can be incorporated up to a certain grid level. We ensure that the resulting subgrid hierarchies are well balanced and no inter-processor communication is needed across different grid levels, thus obtaining high parallel efficiency. Numerical experiments show that the parallel multilevel methods offer almost equally fast convergence as their sequential multigrid counterpart. And the resulting implementation has reasonably good scalability. Received: 4 December 1998 / Accepted: 12 January 2000  相似文献   

18.
J. Xu 《Computing》1996,56(3):215-235
An abstract framework ofauxiliary space method is proposed and, as an application, an optimal multigrid technique is developed for general unstructured grids. The auxiliary space method is a (nonnested) two level preconditioning technique based on a simple relaxation scheme (smoother) and an auxiliary space (that may be roughly understood as a nonnested coarser space). An optimal multigrid preconditioner is then obtained for a discretized partial differential operator defined on an unstructured grid by using an auxiliary space defined on a more structured grid in which a furthernested multigrid method can be naturally applied. This new technique makes it possible to apply multigrid methods to general unstructured grids without too much more programming effort than traditional solution methods. Some simple examples are also given to illustrate the abstract theory and for instance the Morley finite element space is used as an auxiliary space to construct a preconditioner for Argyris element for biharmonic equations. Some numerical results are also given to demonstrate the efficiency of using structured grid for auxiliary space to precondition unstructured grids.  相似文献   

19.
图形处理器(graphic processing unit,GPU)的最新发展已经能够以低廉的成本提供高性能的通用计算。基于GPU的CUDA(compute unified device architecture)和OpenCL(open computing language)编程模型为程序员提供了充足的类似于C语言的应用程序接口(application programming interface,API),便于程序员发挥GPU的并行计算能力。采用图形硬件进行加速计算,通过一种新的GPU处理模型——并行时间空间模型,对现有GPU上的N-body实现进行了分析,从而提出了一种新的GPU上快速仿真N-body问题的算法,并在AMD的HD Radeon 5850上进行了实现。实验结果表明,相对于CPU上的实现,获得了400倍左右的加速;相对于已有GPU上的实现,也获得了2至5倍的加速。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号