期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Development of parallel explicit finite element sheet forming simulation system based on GPU architecture

Yong Cai Guangyao Li Hu Wang Gang Zheng Sen Lin 《Advances in Engineering Software》2012,45(1):370-379

Sheet forming simulation is very important for vehicle body design. Due to the increase of complexity and scale of the CAE model, a tradeoff between the accuracy and efficiency become the bottleneck for application. Therefore, a parallel explicit finite element (FE) based on graphics processing unit (GPU) architecture for sheet forming is developed. Implementation details with computer unified device architecture (CUDA) are considered in this work. A pre-index strategy is suggested for parallelization of nodal force assembling. Parallel reduction method is introduced to calculation of the global time step. To ensure the reliability and accuracy of the GPU-based program, double precision floating and intrinsic functions are implemented for the explicit FE computing. The simulation results based on a commercial NVIDIA GTX285 device can obtain about 27X speedup than on a Intel Q8200 CPU, which demonstrates the efficiency of the parallel sheet forming simulation system. 相似文献

2.

基于GPU加速遗传算法的直接定位研究

任衍青逯志宇王大鸣《计算机应用研究》2019,36(4):1084-1087

针对大规模数据下遗传直接定位算法执行时间慢、实时性较差的问题,提出了基于GPU加速的并行遗传直接定位算法。根据直接定位代价函数特点,设计了GPU高速并行遗传进化架构,通过对适应度函数并行化计算以及对选择、交叉、变异等遗传操作并行化设计,缩短了算法执行时间,提高了算法执行效率。仿真实验表明,通过合理的GPU并行线程结构设计,显著提升了遗传直接定位算法的执行速度,可更快得到直接定位估计结果。相似文献

3.

萤火虫2：一种多态并行机的硬件体系结构

李涛杨婷易学渊蒲林钱博文黄光新黄虎才韩俊刚《计算机工程与科学》2014,36(2):191-200

提出了一种新型的多态高效并行阵列机结构--萤火虫2号阵列机。该结构的处理单元可以在SIMD和MIMD两种模式下运行,兼有异步执行机制,还可以实现分布式指令级并行处理。采用了硬件的多线程管理器和高效通信机制,这些机制使得此种阵列机能够实现效率很高的线程级并行运算、数据级并行运算和分布式指令级并行运算。尤其值得指出的是,此种阵列机的流处理性能堪与专用集成电路匹敌。该结构还能有效实现静态与动态数据流计算,可以高效实现图形、图像和数字信号处理任务。相似文献

4.

基于GPU的串匹配算法研究 总被引：7，自引：0，他引：7

张庆丹戴正华冯圣中孙凝晖《计算机应用》2006,26(7):1735-1737

BF算法是串匹配算法中最基础的算法，但它是串行算法，不适合图形处理器（Graphic Processing Unit, GPU）的体系结构。结合GPU的特殊体系结构，通过数据存取方式和计算策略的改进，充分利用了GPU的并行处理能力，从而基于GPU实现了BF算法。实验结果表明基于GPU的并行算法能够取得较好的加速比，同时也给出了在现有GPU架构上有效实现通用计算的瓶颈。相似文献

5.

Comparative investigation of GPU-accelerated triangle-triangle intersection algorithms for collision detection

Xiao Lei Mei Gang Cuomo Salvatore Xu Nengxiong 《Multimedia Tools and Applications》2022,81(3):3165-3180

Efficient collision detection is critical in 3D geometric modeling. In this paper, we first implement three parallel triangle-triangle intersection algorithms on a GPU and then compare the computational efficiency of these three GPU-accelerated parallel triangle-triangle intersection algorithms in an application that detects collisions between triangulated models. The presented GPU-based parallel collision detection method for triangulated models has two stages: first, we propose a straightforward and efficient parallel approach to reduce the number of potentially intersecting triangle pairs based on AABBs, and second, we conduct intersection tests with the remaining triangle pairs in parallel based on three triangle-triangle intersection algorithms, i.e., the Möller’s algorithm, Devillers’ and Guigue’s algorithm, and Shen’s algorithm. To evaluate the performance of the presented GPU-based parallel collision detection method for triangulated models, we conduct four groups of benchmarks. The experimental results show the following: (1) the time required to detect collisions for the triangulated model consisting of approximately 1.5 billion triangle pairs is less than 0.5 s; (2) the GPU-based parallel collision detection method speedup over the corresponding serial version is 50x - 60x, and (3) Devillers’ and Guigue’s algorithm is comparatively and comprehensively the best of the three GPU-based parallel triangle-triangle intersection algorithms. The presented GPU-accelerated method is capable of efficiently detecting the potential collisions of triangulated models. Overall, the GPU-accelerated parallel Devillers’ and Guigue’s triangle-triangle intersection algorithm is recommended when performing practical collision detections between large triangulated models.

相似文献

6.

图形处理器在数据管理领域的应用研究综述 总被引：1，自引：0，他引：1

下载免费PDF全文

周国亮冯海军何国明陈红《计算机科学与探索》2010,4(4):289-303

比较了中央处理器和图形处理器体系结构的异同,并简要介绍了最新的图形处理器通用计算平台及不同体系结构间并行算法的异同。详细叙述了图形处理器在空间数据库、关系数据库、数据流和数据挖掘及信息检索等方面应用的技术特点;探讨了基于图形处理器的各种内外存排序算法及性能;描述了基于图形处理器的各种数据结构和索引技术;阐述了图形处理器算法优化方面的工作。最后,展望了图形处理器应用于数据管理的发展前景,并分析了这一领域未来所面临的挑战。相似文献

7.

基于显式有限元技术的梁截面抗撞性优化

侯淑娟李青龙述尧《数值计算与计算机应用》2006,27(4):271-280

本文基于显式有限元技术,采用响应面法,以结构的比吸能为优化函数,以提高吸能原件的抗撞性为目的,对正方形截面的金属薄壁梁进行了形状优化．经过数值分析,得出了正方形截面梁的比吸能关于壁厚和截面边长的变化规律,这些规律可以用于实际吸能原件的设计,并为进一步研究奠定了基础．相似文献

8.

基于GPU的海量离散点高程并行插值算法

王智广张腾畅吴相锦鲁强《计算机工程与科学》2021,43(4):614-619

提出一种基于GPU的高程并行插值算法,实现了对三维地表上海量离散点的并行加速渲染。通过高程纹理组织三维地表网格高程数据作为离散点渲染的基础,并通过GLSL编写GPU着色器程序动态控制图形渲染管线,实现视点相关的高程并行插值算法。实验结果表明,提出的基于GPU的高程并行插值算法较传统的内存插值算法,将三维地表上海量离散点的渲染量级从百万级提高到了千万级。相似文献

9.

Development of parallel 3D RKPM meshless bulk forming simulation system

《Advances in Engineering Software》2007,38(2):87-101

A parallel computational implementation of modern meshless system is presented for explicit for 3D bulk forming simulation problems. The system is implemented by reproducing kernel particle method. Aspects of a coarse grain parallel paradigm—domain decompose method—are detailed for a Lagrangian formulation using model partitioning. Integration cells are uniquely assigned on each process element and particles are overlap in boundary zones. Partitioning scheme multilevel recursive spectrum bisection approach is applied. The parallel contact search algorithm is also presented. Explicit message passing interface statements are used for all communication among partitions on different processors. The parallel 3D system is developed and implemented into 3D bulk metal forming problems, and the simulation results demonstrated the efficiency of the developed parallel reproducing kernel particle method system. 相似文献

10.

基于多核阵列体系结构的嵌套循环并行优化

杨子煜严明赵鹏《计算机工程与科学》2009,31(Z1)

多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。相似文献

11.

基于CUDA的弱可压SPH流体建模与仿真

段兴锋任鸿翔神和龙《计算机工程与科学》2018,40(8):1375-1382

为了实现小尺度范围流体场景的实时、真实感模拟,采用弱可压SPH方法对水体进行建模,提出了流体计算的CPU GPU混合架构计算方法。针对邻域粒子查找算法影响流体计算效率的问题,采用三维空间网格对整个模拟区域进行均匀网格划分,利用并行前缀求和和并行计数排序实现邻域粒子的查找。最后,采用基于CUDA并行加速的Marching Cubes算法实现流体表面提取,利用环境贴图表现流体的反射和折射效果,实现流体表面着色。实验结果表明,所提出的流体建模和模拟算法能实现小尺度范围流体的实时计算和渲染,绘制出水的波动、翻卷和木块在水中晃动的动态效果,当粒子数达到1 048 576个时,GPU并行计算方法相较CPU方法的加速比为60.7。相似文献

12.

A contribution to the real-time simulation of coupled finite element models of machine tools – A numerical comparison

S. Hoher S. Röck 《Simulation Modelling Practice and Theory》2011,19(7):1627-1639

In this paper the real-time simulation of finite element (FE) models of machine tools on a multi-processor architecture is presented. The simulation model is based on several FE component models that are connected by non-linear couplings. These couplings allow relative motions of the components in a wide range. The coupled linear FE models are decomposed at the non-linear coupling nodes and each component is solved locally. The linear structure of the components can be used for efficient simulation methods and the components can be distributed to several processors for a parallel computation. Methods that differ in numerical accuracy and stability, computational effort and real-time capacity will be presented. By means of a complex example, it will be illustrated that a parallel, stable computation can be realized time-deterministically. 相似文献

13.

Comparing Parallel Functional Languages: Programming and Performance

H.-W. Loidl F. Rubio N. Scaife K. Hammond S. Horiguchi U. Klusik R. Loogen G.J. Michaelson R. Peña S. Priebe Á.J. Rebón P.W. Trinder 《Higher-Order and Symbolic Computation》2003,16(3):203-251

This paper presents a practical evaluation and comparison of three state-of-the-art parallel functional languages. The evaluation is based on implementations of three typical symbolic computation programs, with performance measured on a Beowulf-class parallel architecture.We assess three mature parallel functional languages: PMLS, a system for implicitly parallel execution of ML programs; GPH, a mainly implicit parallel extension of Haskell; and Eden, a more explicit parallel extension of Haskell designed for both distributed and parallel execution. While all three languages employ a completely implicit approach to communication, each language takes a different approach to specifying and controlling parallelism, ranging from explicit identification of processes as language constructs (Eden) through annotation of potential parallelism (GPH) to automatic detection of parallel skeletons in sequential code (PMLS).We present detailed performance measurements of all three systems on a widely available parallel architecture: a Beowulf cluster of low-cost commodity workstations. We use three representative symbolic applications: a matrix multiplication algorithm, an exact linear system solver, and a simple ray-tracer. Our results show how moderate speedups can be achieved with little or no changes to the sequential code, and that parallel performance can be significantly improved even within our high-level model of parallel functional programming by controlling key aspects of the program such as load distribution and thread granularity. 相似文献

14.

Kernel Polynomial Method on GPU

Shixun Zhang Shinichi Yamagiwa Masahiko Okumura Seiji Yunoki 《International journal of parallel programming》2013,41(1):59-88

The simulation of lattice model systems for quantum materials is one of the most important approaches to understand quantum properties of matter in condensed matter physics. The main task in the simulation is to diagonalize a Hamiltonian matrix for the system and evaluate the electronic density of energy states. Kernel polynomial method (KPM) is one of the promising simulation methods. Because KPM contains a fine-grain recursive part in the algorithm, it is hard to parallelize it under the thread level parallelism such as on a supercomputer or a cluster computer. This paper focuses on methods to parallelize KPM on a massively parallel environment of GPU, aiming to achieve high parallelism for more speedups than the recent CPUs. This paper proposes two implementation methods called the full map and the sliding window methods, and evaluates the performances in the recent GPU platform. To enlarge available simulation sizes and at the same time to enhance the performance, this paper also describes additional optimization techniques depending on the GPU architecture. 相似文献

15.

精细神经网络仿真方法研究进展

下载免费PDF全文

张祎晨黄铁军《中国图象图形学报》2023,28(2):358-371

树突对大脑神经元实现不同的信息处理功能有着重要作用。精细神经元模型是一种对神经元树突以及离子通道的信息处理过程进行精细建模的模型,可以帮助科学家在实验条件的限制之外探索树突信息处理的特性。由精细神经元组成的精细神经网络模型可通过仿真对大脑的信息处理过程进行模拟,对于理解树突的信息处理机制、大脑神经网络功能背后的计算机理具有重要作用。然而,精细神经网络仿真需要进行大量计算,如何对精细神经网络进行高效仿真是一个具有挑战的研究问题。本文对精细神经网络仿真方法进行梳理,介绍了现有主流仿真平台与核心仿真算法,以及可进一步提升仿真效率的高效仿真方法。将具有代表性的高效仿真方法按照发展历程以及核心思路分为网络尺度并行方法、神经元尺度并行方法以及基于GPU(graphics processing unit)的并行仿真方法3类。对各类方法的核心思路进行总结,并对各类方法中代表性工作的细节进行分析介绍。随后对各类方法所具有的优劣势进行分析对比,对一些经典方法进行总结。最后根据高效仿真方法的发展趋势,对未来研究工作进行展望。相似文献

16.

一种基于代表元的划分算法

张为华王鹏臧斌宇朱传琪《计算机学报》2008,31(3):400-410

划分是把程序中不同的计算和数据分配到并行处理系统的不同处理机来充分利用并行系统的计算资源、提高程序处理速度的一种优化技术.划分的效果对程序在并行系统上的执行效率将产生至关重要的影响,因此划分问题一直是并行领域研究的一个热点.但是应用程序的一些特性,如非紧密嵌套循环、一条语句对非只读数组的多次引用间存在重叠、不同语句对同一数组不同步长的引用,给有效解决划分问题设置了极大的障碍.已有的划分算法无法对具有这些特征的程序进行自动划分.虽然在对具有这些特征的程序进行手工优化过程中,存在一些直观上的划分策略,但这些策略无法应用到编译器中来指导编译器完成对程序的自动划分.文中根据这类程序的特点,提出了一种基于代表元的划分算法.该算法通过使用程序中对划分计算产生实际影响的数组引用作为代表元素构造各种划分的限制条件,完成程序的划分.同时通过寻找最大一致性数据划分方向有效减少了程序划分过程中的数据重组织通信.该算法已经在AFT2004中实现,并对应用程序获得了很好的效果. 相似文献

17.

基于图像物理特征的并行地图匹配算法设计与研究 总被引：1，自引：0，他引：1

付光远叶雪梅张文君《计算机工程与应用》2001,37(11):98-100

文章提出了图像的质量和重心概念,并给出了相应的定义及计算方法,在实时地图匹配应用中,首先利用图像的物理或几何特征进行地图粗相关匹配,确定一些可能的匹配点作为搜索空间,大大缩小了搜索范围;然后,在这个小的匹配区域内,按照像素进行图象的精相关匹配。文中设计了一种适于SIMD结构的基于图像物理特征的二次搜索地图匹配并行算法。相似文献

18.

Grex: An efficient MapReduce framework for graphics processing units

Can Basaran Kyoung-Don Kang 《Journal of Parallel and Distributed Computing》2013

In this paper, we present a new MapReduce framework, called Grex, designed to leverage general purpose graphics processing units (GPUs) for parallel data processing. Grex provides several new features. First, it supports a parallel split method to tokenize input data of variable sizes, such as words in e-books or URLs in web documents, in parallel using GPU threads. Second, Grex evenly distributes data to map/reduce tasks to avoid data partitioning skews. In addition, Grex provides a new memory management scheme to enhance the performance by exploiting the GPU memory hierarchy. Notably, all these capabilities are supported via careful system design without requiring any locks or atomic operations for thread synchronization. The experimental results show that our system is up to 12.4× and 4.1× faster than two state-of-the-art GPU-based MapReduce frameworks for the tested applications. 相似文献

19.

一种快速求解二值线性方程组的并行结构

下载免费PDF全文

张博为吴艳霞顾国昌孙霖《计算机工程》2012,38(11):281-283,286

针对求解GF(2)域的线性方程组问题,改进现有的高斯消元算法,提出一种快速求解未知向量的硬件并行结构,通过增加消元与行循环位移的并行操作以降低时间复杂度,采用一类仿“smart memory”基本单元的互联完成整个算法在硬件上的映射。对结构的性能分析表明,对于密度远大于或小于0.5的n阶二值增广矩阵,并行结构平均计算时间约为2n个时钟周期,远小于软件算法时间(1/4n3)。在 3阶~50阶的二值非稀疏增广矩阵上的实现结果表明,与软件实现相比,该结构的性能可提高约2个数量级。相似文献

20.

基于GPU的并行最小生成树算法的设计与实现*

郭绍忠王伟王磊《计算机应用研究》2011,28(5):1682-1684

针对目前并行Prim最小生成树算法效率不高的问题,在分析现有并行Prim算法的基础上,提出了适于GPU架构的压缩邻接表图表示形式,开发了基于GPU的minreduction数据并行原语,在NVIDIA GPU上设计并实现了基于Prim算法思想的并行最小生成树算法。该算法通过使用原语缩短关键步骤的查找时间,从而获得较高效率。实验表明,相对于传统CPU实现算法和不使用原语的算法,该算法具有较明显的性能优势。相似文献