首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 67 毫秒
1.
王宇  李涛  邢立冬  冯臻夫 《计算机工程》2021,47(12):236-248
针对专用硬件在处理图形图像时无法同时兼顾灵活性、可扩展性和时效性的问题,设计一种支持OpenVX 1.3标准的专用处理器。通过对OpenVX 1.3标准中的核函数进行数据通路映射,分析实现函数高效处理所需的运算单元数目,确定适用于该标准的数据通路运算器的结构。通过编写指令对数据通路进行重构,适应OpenVX标准的演进和扩展。应用65 nm CMOS工艺库对整体电路进行综合验证,实现的OpenVX可重构数据通路运算器面积为21 076.21 μm2、功耗为778.63 mW、系统主频为500 MHz、吞吐量为1.86 GB/s。实验结果表明,该数据通路运算器具有较强的可编程性和可扩展性,能够有效满足实时和高速的通用图像处理要求。  相似文献   

2.
针对传统的生物计算中DNA序列保守序列的识别(模体识别)和最长公共子序列计算需要较大的数据量、计算量,以及功耗大等问题,文中提出了两种基于PAAG多态并行处理器的并行算法,该并行处理器能够支持数据、线程、指令多种并行。通过编程在PAAG多态并行处理的处理单元( PE)上开发了相应的串行和并行程序,将计算的不同过程分派到不同的处理单元( PE)上进行处理,实现了不同粒度算法的并行。实验结果表明,文中提出的并行算法使模体识别和最长公共子序列的计算效率得到明显提高。  相似文献   

3.
提出了一种新型的多态高效并行阵列机结构--萤火虫2号阵列机。该结构的处理单元可以在SIMD和MIMD两种模式下运行,兼有异步执行机制,还可以实现分布式指令级并行处理。采用了硬件的多线程管理器和高效通信机制,这些机制使得此种阵列机能够实现效率很高的线程级并行运算、数据级并行运算和分布式指令级并行运算。尤其值得指出的是,此种阵列机的流处理性能堪与专用集成电路匹敌。该结构还能有效实现静态与动态数据流计算,可以高效实现图形、图像和数字信号处理任务。  相似文献   

4.
针对固定流水线技术中渲染效率低,对新一代GPU处理器应用不深入的现状,采用并行渲染技术,开发了一个实用方便的虚拟现实平台。平台综合运用OSG(Open Scene Graph)、OpenGL、Vtree等编程技术及知名开源软件和引擎,采用底层模块隐藏封装、模块整合等技术,通过封装各种通用的复杂仿真算法,构成了一个使用更加方便的高级API函数库,使编程工作变得简洁明快,同时设计者在必要时也可使用底层函数进行二次开发。文中具体介绍了该平台各模块功能和特点、体系结构、并行渲染技术以及应用实例。  相似文献   

5.
多态并行处理器中的线程管理器设计   总被引:2,自引:2,他引:2  
基于多态并行处理器提出了一种硬件线程管理器,支持MIMD模式8个线程管理操作和SIMD模式SC控制器统一管理两种工作模式,实现了线程级并行计算;可以监测各个线程的工作情况以及近邻通信寄存器和路由器的状态;能够在通信时停止、切换、启动线程,记录每个线程的工作状态,同时避免了因数据阻塞带来的等待问题,能够最大程度地提高单个处理器的执行效率。  相似文献   

6.
7.
基于PC集群的三维图形并行渲染性能分析   总被引:1,自引:0,他引:1  
研究基于PC集群的三维图形并行渲染性能问题,从网络性能、算法复杂度、并行分配机制等几方面分析了影响并行渲染性能的关键因素。在千兆以太网PC集群上进行了基于通用MPI和OpenGL的三维图形并行渲染仿真测试,给出了数据及分析结果,给出了合理构建并行三维图形渲染系统的建议,通过平衡图形算法复杂度和网络性能以达到最佳并行性能。  相似文献   

8.
该文在研究石油领域并行仿真需求的基础上,给出了四种并行仿真模式。这些并行仿真模式已在基于机群的石油勘探仿真系统中得到了很好的应用。它们同样适用于其它应用领域,用以提高并行仿真系统的性能。  相似文献   

9.
由微机组成的微机机群因其高性能价格比正受到各研究领域的关注。我们建立了一个有16个PentiumPro2000MHzCPU的微机机群,并在此微机机群上用PGHPF实现了二维和三维并行粒子模拟程序。结果表明,PGHPF的性能要号MPI相比拟,最多用16个CPU同样能达到比较高的并行效率。  相似文献   

10.
本文提出的并行知识库机PKBM95的硬件结构是一台微机和四台TRANSPUTER组成的多机系统。这里重点讨论了PKBM的系统结构、操作规范和操作语言,并提出一订散式的并行推理模型有诟端机、前端机两次冲突归结策略。  相似文献   

11.
In this paper, we present the graphics processing unit (GPU)‐based parallel implementation of visibility calculation from multiple viewpoints on raster terrain grids. Two levels of parallelism are introduced in the GPU kernels — parallel traversal of visibility rays from a single viewpoint and parallel processing of viewpoints. The obtained visibility maps are combined in parallel using the selected logical operator. A comparison with multi‐threaded CPU implementation is performed to establish the expected speed‐ups of viewshed construction when the source and destination types are sets of scattered locations, paths, or regions. The results demonstrate that using the GPU, the acceleration of an order of magnitude can be achieved on average with both point sampling and bilinear filtering of the elevation map. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

12.
IXP2400的网络测试系统的多级并行处理技术   总被引:1,自引:0,他引:1  
多级并行处理问题一直是计算机及其网络设计、应用的一个重要问题.本文针对IXP2400这一多核可编程芯片的多处理器并行化问题进行应用研究,提出一种兼顾处理能力与开发灵活性的多级并行技术.以"基于网络处理器的网络测试系统"为应用实例,重点分析微引擎并行方案及线程级静态调度算法,并通过WorkBench仿真及七种以太帧平均最大发送速率实测结果对方案、算法进行验证.最后总结并展望了本文提出技术的前景.  相似文献   

13.
3D图形流水线像素处理后期的设计和实现   总被引:1,自引:0,他引:1  
针对3D图形流水线像素处理后期的实时大批量数据处理和存储器读写要求,以及嵌入式系统资源和功耗的特殊性,给出一种像素处理后期的硬件设计方案。设计首先实现所有测试功能,确保各种效果,其次采用了基于屏幕分割渲染的设计思想,减少存储器需求,然后吸收了Early Z算法,尽早抛弃不可见的三角面信息,减少渲染的数据,最后实现了Flip Quad反走样算法,提高图像的质量。模块已经完成了RTL级建模,并在FPGA上通过验证。  相似文献   

14.
The watershed transform is a method for non-supervised image segmentation. In this paper we show that a watershed algorithm based on a cellular automaton is a good choice for the recent GPU architectures, especially when the synchronization rules are relaxed. In particular, we propose a block-asynchronous computation strategy that maps the cellular automaton on the thread blocks of the GPU. This method reduces the number of points of global synchronization allowing efficient exploitation of the memory hierarchy of the GPU. We also avoid the artifacts produced in the watershed lines by the block-asynchronous updating scheme by correcting the data propagation speed among the blocks. The proposals are compared to an OpenMP multithreaded code. The high speedups indicate the potential of this kind of algorithm for new architectures based on hundreds of cores. The method is tuned to be applied to 3D volumes obtaining similar results.  相似文献   

15.
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.  相似文献   

16.
The density peak (DP) algorithm has been widely used in scientific research due to its novel and effective peak density-based clustering approach. However, the DP algorithm uses each pair of data points several times when determining cluster centers, yielding high computational complexity. In this paper, we focus on accelerating the time-consuming density peaks algorithm with a graphics processing unit (GPU). We analyze the principle of the algorithm to locate its computational bottlenecks, and evaluate its potential for parallelism. In light of our analysis, we propose an efficient parallel DP algorithm targeting on a GPU architecture and implement this parallel method with compute unified device architecture (CUDA), called the ‘CUDA-DP platform’. Specifically, we use shared memory to improve data locality, which reduces the amount of global memory access. To exploit the coalescing accessing mechanism of GPU, we convert the data structure of the CUDA-DP program from array of structures to structure of arrays. In addition, we introduce a binary search-and-sampling method to avoid sorting a large array. The results of the experiment show that CUDA-DP can achieve a 45-fold acceleration when compared to the central processing unit based density peaks implementation.  相似文献   

17.
Two parallel implementations of a 3D convex hull algorithm are reported. The paper considers a MIMD distributed memory architecture and the implementations are carried out on the Meiko Computing Surface using T800 transputers and the programming languages Occam and C. The first method uses a simple parallel geometric decomposition strategy and produces encouraging results. With the second approach a parallel generic Divide-and-Conquer kernel is incorporated. This is an example of the algorithmic skeleton approach to parallel programming and involves run-time, dynamic allocation of work to processors. The resulting performances for both methods are measured and compared.  相似文献   

18.
随着多种视频编解码标准和视频算法的提出,视频处理器高效性和灵活性显得更为重要。针对视频阵列处理器中数据加载速率与阵列处理单元处理不匹配的问题,通过对视频编解码标准算法的分析,深度挖掘数据访存冗余和传输的特点,在可编程可重构体系结构下,设计了支持灌入和Cache两种工作模式的数据加载电路,并进行了功能仿真和FPGA验证。结果表明,该电路能够满足1080P视频处理对数据加载的要求,采用Desgin Compiler在SMIC 0.13μm CMOS工艺标准单元库下综合,频率可达197 MHz。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号