首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
随着绘制任务复杂度和绘制数据规模不断增长,使用PC集群进行分布式并行绘制是一个常用的解决方案。Sort-last分布式并行绘制方法具有好的扩展性和负载平衡,但由于图像合成瓶颈的限制,绘制速度不能满足实时需求。本文提出了一个使用网络处理单元(NPU)来进行快速硬件图像合成的方法,开发了一个sort-last并行绘制系统NPUPR。实验表明,针对4个绘制节点,基于NPU的硬件图像合成方法与direct send的合成算法相比,绘制速度有了4倍的提高。本文也给出通过增加网络处理单元来扩展系统支持更多绘制节点的方案,分析表明,系统图像合成性能不会随节点个数的增加而明显降低。  相似文献   

2.
计算量过大是机载天线辐射特性分析的瓶颈问题,为此,提出一种基于等三角剖分的并行UTD计算与绘制算法,算法采用基于空间八叉树分割的模型框架半自动提取方法进行模型简化。给出一种全方位等三角剖分负载平衡方案,并采用sort-last并行图形绘制框架和Binary-swap图像合成算法进行并行绘制,将该算法在计算机集群上进行实现。实验结果表明,该算法能够有效节约计算时间,提高绘制效率,较好地满足了大型复杂飞行器机载天线的方向图特性分析需求。  相似文献   

3.
并行多边形绘制技术综述   总被引:12,自引:4,他引:12  
多边形绘制是应用最为广泛的计算机图形绘制方法.并行技术在提高多边形绘制系统的性能方面起着重要的作用.并行多边形绘制技术的基础是算法内在的可并行性,按并行流水线的组织方式可分为全图像深度合成、前分布拼接合成和中分布拼接合成三种,负载平衡和图像合成是影响多边形绘制系统性能的关键问题.并行多边形绘制系统的实现方式包括专用图形硬件实现、并行机实现和集群机实现.结合作者的工作,对并行多边形绘制技术进行了探讨.  相似文献   

4.
This paper describes a system for modeling, animating, previewing and rendering articulated objects. The system has a modeler of objects that consists of joints and segments. The animator interactively positions the articulated object in its stick, control vertex, or rectangular prism representation and previews the motion in real time. Then the data representing the motion and the models is sent to a multicomputer [iPSC/2 Hypercube (Intel)]. The frames are rendered in parallel, exploiting the coherence between successive frames, thus cutting down the rendering time significantly. Our main aim is to make a detailed study on rendering of a sequence of 3D scenes. The results show that due to an inherent correlation between the 3D scenes, an efficient rendering can be achieved.  相似文献   

5.
为研究并行图形绘制技术,介绍图形绘制的流水线过程,对其内在的可并行性进行分析,研究并行绘制的实现方式,包括流水线并行、数据并行和作业并行,以及前分布拼接合成、中分布拼接合成和后分布拼接合成,讨论并行绘制面临的主要问题及其发展趋势。  相似文献   

6.
NPU-based image compositing in a distributed visualization system   总被引:1,自引:0,他引:1  
This paper describes the first use of a network processing unit (NPU) to perform hardware-based image composition in a distributed rendering system. The image composition step is a notorious bottleneck in a clustered rendering system. Furthermore, image compositing algorithms do not necessarily scale as data size and number of nodes increase. Previous researchers have addressed the composition problem via software and/or custom-built hardware. We used the heterogeneous multicore computation architecture of the Intel IXP28XX NPU, a fully programmable commercial off-the-shelf (COTS) technology, to perform the image composition step. With this design, we have attained a nearly four-times performance increase over traditional software-based compositing methods, achieving sustained compositing rates of 22-28 fps on a 1.021times1.024 image. This system is fully scalable with a negligible penalty in frame rate, is entirely COTS, and is flexible with regard to operating system, rendering software, graphics cards, and node architecture. The NPU-based compositor has the additional advantage of being a modular compositing component that is eminently suitable for integration into existing distributed software visualization packages.  相似文献   

7.
Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements.  相似文献   

8.
One issue which is central in developing a general purpose FFT subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. In this paper we present an FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications. We have also addressed the problem of rearranging the data after computing the FFT. We have evaluated the performance of our implementation on a distributed memory parallel machine, the Intel iPSC/860.  相似文献   

9.
《Parallel Computing》1997,23(12):1839-1850
A 3D polygon rendering system conforming to the specification of OpenGL is implemented on a PVM (parallel virtual machine). The system is targeted to a low speed network of low cost serial computers. In our parallel rendering algorithm, image space is decomposed to exactly the number of regions equal to the number of processors which reduces the volume of communication and the polygons needed to be rendered more than once on separate renderers. We also propose a transmission scheme that distributes the passing of data throughout the whole process of rendering. Taking advantage of frame-to-frame coherence, load balancing is achieved by decomposing the image space unequally so that the workloads in each space region are proportional to the processing power of the corresponding renders. Distribution of workload and processing power that is volatile during rendering are obtained with linear prediction from statistics of previous frames. The system is practical and scalable for moderate number of processors.  相似文献   

10.
Sort-last parallel rendering is widely used. Recent GPU developments mean that a PC equipped with multiple GPUs is a viable alternative to a high-cost supercomputer: the Fermi architecture of a single GPU supports uniform virtual addressing, providing a foundation for non-uniform memory access (NUMA) on multi-GPU platforms. Such hardware changes require the user to reconsider the parallel rendering algorithms. In this paper, we thoroughly investigate the NUMA-aware image compositing problem, which is the key final stage in sort-last parallel rendering. Based on a proven radix-k strategy, we find one optimal compositing algorithm, which takes advantage of NUMA architecture on the multi-GPU platform. We quantitatively analyze different image compositing modes for practical image compositing, taking into account peer-to-peer communication costs between GPUs. Our experiments on various datasets show that our image compositing method is very fast, an image of a few megapixels can be composited in about 10 ms by eight GPUs.  相似文献   

11.
Sandia National Laboratories use PC clusters and commodity graphics cards to achieve higher rendering performance on extreme data sets. The main obstacle in using cluster-based graphics systems is the difficulty in realizing the full aggregate performance of all the individual graphics accelerators, particularly for very large data sets that exceed the capacity and performance characteristics of any one single node. Based on our efforts to achieve higher performance, we present results from a parallel sort-last implementation that the scalable rendering project at Sandia National Laboratories generated. Our sort-last library (libpglc) can be linked to an existing parallel application to achieve high rendering rates. We ran performance tests on a 64-node PC cluster populated with commodity graphics cards. Applications using libpglc have demonstrated rendering performance of 300 million polygons per second $approximately two orders of magnitude greater than the performance on an SGI Infinite Reality system for similar applications  相似文献   

12.
Current rendering processors are aiming to process triangles as fast as possible and they have the tendency of equipping with multiple rasterizers to be capable of handling a number of triangles in parallel for increasing polygon rendering performance. However, those parallel architectures may have the consistency problem when more than one rasterizer try to access the data at the same address. This paper proposes a consistency-free memory architecture for sort-last parallel rendering processors, in which a consistency-free pixel cache architecture is devised and effectively associated with three different memory systems consisting of a single frame buffer, a memory interface unit, and consistency-test units. Furthermore, the proposed architecture can reduce the latency caused by pixel cache misses because the rasterizer does not wait until cache miss handling is completed when the pixel cache miss occurs. The experimental results show that the proposed architecture can achieve almost linear speedup upto four rasterizers with a single frame buffer.  相似文献   

13.
Sort-last rendering is a method of parallelizing compute-intensive computer graphics; specifically, the primitives that describe a scene are first allocated to a set of renderers, and the rendered images are then composited to give the final image. This paper surveys six such techniques and compares their performance with active pixels, i.e., pixels that are covered by at least one primitive. Active pixels offer a uniform way of accounting for the time, space, and bandwidth costs in sort-last rendering. The comparison highlights the strengths of each technique. For example, tree composition has minimum work, binary-swap (hypercube) has minimum composition latency, direct pixel forwarding (mesh) has minimum bandwidth latency, and snooping (bus) has minimum bandwidth volume; binary-swap's bandwidth and composition latencies decrease, whereas the bandwidth volumes for direct pixel forwarding and snooping are constant, when the number of renderers increase; etc.  相似文献   

14.
Distributed shared memory has increasingly become a desirable programming model on which to program multicomputer systems. Such systems strike a balance between the performance attainable in distributed-memory multiprocessors and the ease of programming on shared-memory systems. In shared-memory systems, concurrent tasks communicate through shared variables, and synchronization of access to shared data is an important issue. Semaphores have been traditionally used to provide this synchronization. In this paper we propose a decentralized scheme to support semaphores in a virtual shared-memory system. Our method of grouping semaphores into semaphore pages and caching a semaphore at a processor on demand eliminates the reliability problems and bottlenecks associated with centralized schemes. We compare the performance of our scheme with a centralized implementation of semaphores and conclude that our system performs better under high semaphore access rates as well as larger numbers of processors.  相似文献   

15.
Ray tracing is a well known technique to generate life-like images. Unfortunately, ray tracing complex scenes can require large amounts of CPU time and memory storage. Distributed memory parallel computers with large memory capacities and high processing speeds are ideal candidates to perform ray tracing. However, the computational cost of rendering pixels and patterns of data access cannot be predicted until runtime. To parallelize such an application efficiently on distributed memory parallel computers, the issues of database distribution, dynamic data management and dynamic load balancing must be addressed. In this paper, we present a parallel implementation of a ray tracing algorithm on the Intel Delta parallel computer. In our database distribution, a small fraction of database is duplicated on each processor, while the remaining part is evenly distributed among groups of processors. In the system, there are multiple copies of the entire database in the memory of groups of processors. Dynamic data management is acheived by an ALRU cache scheme which can exploit image coherence to reduce data movements in ray tracing consecutive pixels. We balance load among processors by distributing subimages to processors in a global fashion based on previous workload requests. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. © 1997 John Wiley & Sons, Ltd.  相似文献   

16.
Distributed shared memory for roaming large volumes   总被引:1,自引:0,他引:1  
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.  相似文献   

17.
Parallel ray tracing of complex scenes on multicomputers requires the distribution of both computation and scene data to the processors. This is carried out during preprocessing and usually consumes too much time and memory. The paper presents an efficient parallel subdivision algorithm that decomposes a given scene into rectangular regions adaptively and maps the resultant regions to the node processors of a multicomputer. The proposed algorithm uses efficient data structures to identify the splitting planes quickly. Furthermore the mapping of the regions and the objects to the node processors is performed while parallel spatial subdivision proceeds. The proposed algorithm is implemented on an Intel iPSC/2 hypercube multicomputer and promising results have been obtained.  相似文献   

18.
We describe a system that allows programmers to take advantage of both control and data parallelism through multiple intercommunicating data-parallel modules. This programming environment extends C-type stream I/O to include intermodule communication channels. The programmer writes each module as a separate data-parallel program, then develops a channel linker specification describing how to connect the modules together. A channel linker we have developed loads the separate modules on to the parallel machine and binds the communication channels together as specified. We present performance data that demonstrates a mixed control- and data-parallel solution can yield better performance than a strictly data-parallel solution. The system described currently runs on the Intel iWarp multicomputer.  相似文献   

19.
The Scalable I/O(SIO)Initiative‘s Low-Level Application Programming Interface(SIO LLAP)provides file system implementers with a simple low-Level interface to support high-level parallel /O interfaces efficiently and effectively.This paper describes a reference implementation and the evaluation of the SIO LLAPI on the Intel Paragon multicomputer.The implementation provides the file system structure and striping algorithm compatible with the Parallel File System(PFS)of Intel Paragon ,and runs either inside the kernel or as a user level library.The scatter-gather addressing read/write,asynchronous I/O,client caching and prefetching mechanism,file access hint mechanism,collective I/O and highly efficient file copy have been implemented.The preliminary experience shows that the SIO LLAPI provides opportunities of significant performance improvement and is easy to implement.Some high level file system interfaces and applications such as PFS,ADIO and Hartree-Fock application,are also implemented on top of SIO.The performance of PFS is at least the same as that of Intel‘s native pfs,and in many cases,such as small sequential file access,huge I/O requests and collective I/O,it is stable and much better,The SIO features help to support high level interfaces easily,quickly and more efficiently,and the cache,prefetching,hints are useful to get better performance based on different access models.The scalability and performance of SIO are limited by the network latency,network scalable bandwidth,memory copy bandwidth,memory size and pattern of I/O requests.The tadeoff between generality and efficienc should be considered in implementation.  相似文献   

20.
The purpose of this work is to compare the speed of isosurface rendering in software with that using dedicated hardware. Input data consist of 10 different objects from various parts of the body and various modalities (CT, MR, and MRA) with a variety of surface sizes (up to 1 million voxels/2 million triangles) and shapes. The software rendering technique consists of a particular method of voxel-based surface rendering, called shell rendering. The hardware method is OpenGL-based and uses the surfaces constructed from our implementation of the Marching Cubes algorithm. The hardware environment consists of a variety of platforms, including a Sun Ultra I with a Creator3D graphics card and a Silicon Graphics Reality Engine II, both with polygon rendering hardware, and a 300 MHz Pentium PC. The results indicate that the software method (shell rendering) was 18 to 31 times faster than any hardware rendering methods. This work demonstrates that a software implementation of a particular rendering algorithm (shell rendering) can outperform dedicated hardware. We conclude that, for medical surface visualization, expensive dedicated hardware engines are not required. More importantly, available software algorithms (shell rendering) on a 300 MHz Pentium PC outperform the speed of rendering via hardware engines by a factor of 18 to 31  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号