首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 953 毫秒
1.
In this paper we present a streaming compression scheme for gigantic point sets including per-point normals. This scheme extends on our previous Duodecim approach [21] in two different ways. First, we show how to use this approach for the compression and rendering of high-resolution iso-surfaces in volumetric data sets. Second, we use deferred shading of point primitives to considerably improve rendering quality. Iso-surface reconstruction is performed in a hexagonal close packing (HCP) grid, into which the initial data set is resampled. Normals are resampled from the initial domain using volumetric gradients. By incremental encoding, only slightly more than 3 bits per surface point and 5 bits per surface normal are required at high fidelity. The compressed data stream can be decoded in the graphics processing unit (GPU). Decoded point positions are saved in graphics memory, and they are then used on the GPU again to render point primitives. In this way high quality gigantic data sets can directly be rendered from their compressed representation in local GPU memory at interactive frame rates (see Fig. 1).  相似文献   

2.
A particle system for interactive visualization of 3D flows   总被引:3,自引:0,他引:3  
We present a particle system for interactive visualization of steady 3D flow fields on uniform grids. For the amount of particles we target, particle integration needs to be accelerated and the transfer of these sets for rendering must be avoided. To fulfill these requirements, we exploit features of recent graphics accelerators to advect particles in the graphics processing unit (GPU), saving particle positions in graphics memory, and then sending these positions through the GPU again to obtain images in the frame buffer. This approach allows for interactive streaming and rendering of millions of particles and it enables virtual exploration of high resolution fields in a way similar to real-world experiments. The ability to display the dynamics of large particle sets using visualization options like shaded points or oriented texture splats provides an effective means for visual flow analysis that is far beyond existing solutions. For each particle, flow quantities like vorticity magnitude and A2 are computed and displayed. Built upon a previously published GPU implementation of a sorting network, visibility sorting of transparent particles is implemented. To provide additional visual cues, the GPU constructs and displays visualization geometry like particle lines and stream ribbons.  相似文献   

3.
唐勇  邵绪强  吕梦雅 《计算机工程》2008,34(11):243-245
分析各种基于Lengyel方法模拟动态毛发技术。通过忽略质点-弹簧系统中的剪应力和折叠力,并假定提供位移力的弹簧不发生伸缩,得到一种“质点-刚性杆系统”,把它引入到Lengyel方法中,结合非线性插值计算shell层偏移量的方法来控制各s  相似文献   

4.
Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

5.
This work proposes an electro-mechanical simulator of the cardiac tissue, so that its main feature is the low computational cost. This is necessary to run real-time simulations and on the fly applications. In order to achieve this, we used cellular automata and mass-spring systems to model the cardiac behavior, and furthermore parallelize the code to run in graphics processing unit (GPU) with compute unified device architecture. Sequentially, our simulator was quite faster than traditional partial differential equations simulators. In addition, we performed different load tests to evaluate our code behavior in GPUs, and spotted its potentials and bottlenecks.  相似文献   

6.
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.  相似文献   

7.
《Parallel Computing》2007,33(10-11):663-684
We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2–10× improvement in performance.  相似文献   

8.
In this paper, we articulate a novel plastic phase-field (PPF) method that can tightly couple the phase-field with plastic treatment to efficiently simulate ductile fracture with GPU optimization. At the theoretical level of physically-based modeling and simulation, our PPF approach assumes the fracture sensitivity of the material increases with the plastic strain accumulation. As a result, we first develop a hardening-related fracture toughness function towards phase-field evolution. Second, we follow the associative flow rule and adopt a novel degraded von Mises yield criterion. In this way, we establish the tight coupling of the phase-field and plastic treatment, with which our PPF method can present distinct elastoplasticity, necking, and fracture characteristics during ductile fracture simulation. At the numerical level towards GPU optimization, we further devise an advanced parallel framework, which takes the full advantages of hierarchical architecture. Our strategy dramatically enhances the computational efficiency of preprocessing and phase-field evolution for our PPF with the material point method (MPM). Based on our extensive experiments on a variety of benchmarks, our novel method's performance gain can reach 1.56× speedup of the primary GPU MPM. Finally, our comprehensive simulation results have confirmed that this new PPF method can efficiently and realistically simulate complex ductile fracture phenomena in 3D interactive graphics and animation.  相似文献   

9.

Applications that exploit the architectural details of high-performance computing (HPC) systems have become increasingly invaluable in academia and industry over the past two decades. The most important hardware development of the last decade in HPC has been the general purpose graphics processing unit (GPGPU), a class of massively parallel devices that now contributes the majority of computational power in the top 500 supercomputers. As these systems grow, small costs such as latency—due to the fixed cost of memory accesses and communication—accumulate in a large simulation and become a significant barrier to performance. The swept time-space decomposition rule is a communication-avoiding technique for time-stepping stencil update formulas that attempts to reduce latency costs. This work extends the swept rule by targeting heterogeneous, CPU/GPU architectures representing current and future HPC systems. We compare our approach to a naive decomposition scheme with two test equations using an MPI+CUDA pattern on 40 processes over two nodes containing one GPU. The swept rule produces a factor of 1.9 to 23 speedup for the heat equation and a factor of 1.1 to 2.0 speedup for the Euler equations, using the same processors and work distribution, and with the best possible configurations. These results show the potential effectiveness of the swept rule for different equations and numerical schemes on massively parallel compute systems that incur substantial latency costs.

  相似文献   

10.
Computation on programmable graphics hardware   总被引:1,自引:0,他引:1  
GPUs have evolved into powerful and flexible streaming processors with fully programmable floating-point pipelines and tremendous aggregate computational power and memory bandwidth. With these advances, modern GPUs can now perform more functions than the specific graphics computations for which they were designed. This article describes approaches to using GPU processing power to accelerate traditionally CPU-based tasks. We discuss some important characteristics of algorithms that make them good candidates for GPU acceleration. We discuss a specific GPU image-processing application that is a common postprocess for many physically based rendering systems.  相似文献   

11.
We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance. We discuss and compare two implementations of the code: one that has maximum efficiency but is limited to the memory size of the card, and one that can handle larger problems but that is less efficient. We use a coloring scheme to handle efficiently summation operations over nodes on a topology with variable valence. We perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25.  相似文献   

12.
The numerical solution of two-layer shallow water systems is required to simulate accurately stratified fluids, which are ubiquitous in nature: they appear in atmospheric flows, ocean currents, oil spills, etc. Moreover, the implementation of the numerical schemes to solve these models in realistic scenarios imposes huge demands of computing power. In this paper, we tackle the acceleration of these simulations in triangular meshes by exploiting the combined power of several CUDA-enabled GPUs in a GPU cluster. For that purpose, an improvement of a path conservative Roe-type finite volume scheme which is specially suitable for GPU implementation is presented, and a distributed implementation of this scheme which uses CUDA and MPI to exploit the potential of a GPU cluster is developed. This implementation overlaps MPI communication with CPU–GPU memory transfers and GPU computation to increase efficiency. Several numerical experiments, performed on a cluster of modern CUDA-enabled GPUs, show the efficiency of the distributed solver.  相似文献   

13.
In embedded graphics systems, the graphics processing unit (GPU) also consumes significant amount of frame buffer memory bandwidth. Frame buffer compression is widely adopted to alleviate both memory bandwidth and power consumption issues for the display controller, but rarely has it been applied to addressing GPU’s consumption. This paper proposes a real-time fixed-ratio frame buffer compression technique for RISC-V processor-based embedded graphics systems. The advantage of the proposed method is that the fixed-ratio compressed frame buffer can be directly adopted as an input texture by GPU. The proposed architecture (called an FBC coprocessor) is a hardware extension to the RISC-V microprocessor that supports frame buffer memory bandwidth reduction. The results show that the coprocessor consumes only 1% additional silicon space of the whole system, while reducing bandwidth consumption by 72.64%. A prototype system-on-a-chip indicates that the proposed FBC coprocessor can reduce GPU power consumption by up to 12.7% for an example automotive application.  相似文献   

14.
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed‐memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task‐parallel programs executed on hybrid distributed‐memory CPU‐graphics processing unit (GPU) systems in a global‐address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU‐GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state‐of‐the‐art CCSD(T) application module from the computational chemistry domain. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

15.
Verification has grown to dominate the cost of electronic system design, consuming about 60% of design effort. Among several verification techniques, logic simulation remains the major verification technique. Speeding up logic simulation results in great savings and shorter time-to-market. We parallelize logic simulation using Graphics Processing Units (GPUs). In the past, GPUs were special-purpose application accelerators, suitable only for conventional graphics applications. The new generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. We develop a parallel cycle-based logic simulation algorithm that uses And Inverter Graphs (AIGs) as design representations. AIGs have proven to be an effective representation for various design automation applications, and we obtain similar benefits for speeding up logic simulation. We develop two clustering algorithms that partition the gates in the designs into independent blocks. Our algorithms exploit the massively parallel GPU architecture featuring thousands of concurrent threads, fast memory, and memory coalescing for optimizations. We demonstrate up-to 5x and 21x speedups on several benchmarks using our simulation system with the first and second clustering algorithms, respectively. Our work ultimately results in significant reduction in the overall design cycle.  相似文献   

16.
《Parallel Computing》2014,40(5-6):86-99
Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.  相似文献   

17.
Membrane computing is an emergent research area studying the behavior of living cells to define bio-inspired computing devices, also called P systems. Such devices provide polynomial time solutions to NP-complete problems by trading time for space. The efficient simulation of P systems poses three major challenging issues: an intrinsic massive parallelism of P systems, an exponential computational workspace, and a non-intensive floating point nature. This paper analyzes the simulation of a family of recognizer P systems with active membranes that solves the satisfiability problem in linear time on three different architectures: a shared memory multiprocessor, a distributed memory system, and a manycore graphics processing unit (GPU). For an efficient handling of the exponential workspace created by the P systems computation, we enable different data policies on those architectures to increase memory bandwidth and exploit data locality through tiling. Parallelism inherent to the target P system is also managed on each architecture to demonstrate that GPUs offer a valid alternative for high-performance computing at a considerably lower cost. Our results lead to execution time improvements exceeding 310 \(\times \) and 78 \(\times \) , respectively, for a much cheaper high-performance alternative.  相似文献   

18.
基于多重网格法的实时流体模拟   总被引:3,自引:0,他引:3  
在GPU上实现了多重网格法,并用该方法改进了二维的实时流体模拟,更充分地利用GPU的并行计算能力.使用4层网格,依靠渲染到纹理的计算方式、帧缓存扩展的纹理管理方法,提高了图形硬件的利用率.实验对比表明,在同样的帧数下该方法能提高GPU实时流体模拟的精度.尤其在较大规模的问题上,与同等精度的基于一般迭代方法的GPU实时流体模拟相比,该方法在速度上可有成倍地提高.  相似文献   

19.
Reactive force field (ReaxFF), a recent and novel bond order potential, allows for reactive molecular dynamics (ReaxFF MD) simulations for modeling larger and more complex molecular systems involving chemical reactions when compared with computation intensive quantum mechanical methods. However, ReaxFF MD can be approximately 10–50 times slower than classical MD due to its explicit modeling of bond forming and breaking, the dynamic charge equilibration at each time-step, and its one order smaller time-step than the classical MD, all of which pose significant computational challenges in simulation capability to reach spatio-temporal scales of nanometers and nanoseconds. The very recent advances of graphics processing unit (GPU) provide not only highly favorable performance for GPU enabled MD programs compared with CPU implementations but also an opportunity to manage with the computing power and memory demanding nature imposed on computer hardware by ReaxFF MD. In this paper, we present the algorithms of GMD-Reax, the first GPU enabled ReaxFF MD program with significantly improved performance surpassing CPU implementations on desktop workstations. The performance of GMD-Reax has been benchmarked on a PC equipped with a NVIDIA C2050 GPU for coal pyrolysis simulation systems with atoms ranging from 1378 to 27,283. GMD-Reax achieved speedups as high as 12 times faster than Duin et al.’s FORTRAN codes in Lammps on 8 CPU cores and 6 times faster than the Lammps’ C codes based on PuReMD in terms of the simulation time per time-step averaged over 100 steps. GMD-Reax could be used as a new and efficient computational tool for exploiting very complex molecular reactions via ReaxFF MD simulation on desktop workstations.  相似文献   

20.
This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号