首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 359 毫秒
1.
利用GPU进行加速的归一化差分植被指数(Normalized Differential Vegetation Index,NDVI)提取算法通常采用GPU多线程并行模型,存在弱相关计算之间以及CPU与GPU之间数据传输耗时较多等问题,影响了加速效果的进一步提升。针对上述问题,根据NDVI提取算法的特性,文中提出了一种基于GPU多流并发并行模型的NDVI提取算法。通过CUDA流和Hyper-Q特性,GPU多流并发并行模型可以使数据传输与弱相关计算、弱相关计算与弱相关计算之间达到重叠,从而进一步提高算法并行度及GPU资源利用率。文中首先通过GPU多线程并行模型对NDVI提取算法进行优化,并对优化后的计算过程进行分解,找出包含数据传输及弱相关性计算的部分;其次,对数据传输和弱相关计算部分进行重构,并利用GPU多流并发并行模型进行优化,使弱相关计算之间、弱相关计算和数据传输之间达到重叠的效果;最后,以高分一号卫星拍摄的遥感影像作为实验数据,对两种基于GPU实现的NDVI提取算法进行实验验证。实验结果表明,与传统基于GPU多线程并行模型的NDVI提取算法相比,所提算法在影像大于12000*12000像素时平均取得了约1.5倍的加速,与串行提取算法相比取得了约260倍的加速,具有更好的加速效果和并行性。  相似文献   

2.
针对合成孔径雷达(SAR)设计了一种基于GPU的回波仿真器,使用并行计算的方法提高SAR回波模拟的效率。仿真数字回波信号通过数/模转换,转换为射频信号输出给SAR系统。使用了BP成像算法对仿真器模拟得到的点目标及分布式目标回波数据进行成像处理,验证了提出的SAR回波仿真器的有效性。实验分析表明,使用GPU并行处理的方法提高了SAR回波仿真的效率。  相似文献   

3.
针对SAR成像中图像模糊并伴有噪声的问题,结合噪声可见性函数,提出了一种SAR图像增强快速算法。该算法在图像分层的基础上,结合人眼视觉特性,引入噪声可见性函数,实现细节层图像的增益控制。根据GPU架构和存储结构特点,并行计算各个像素在基本层和细节层上的处理过程,完成该算法的并行优化设计与实现。实验结果表明,该算法能够有效提高图像质量,增强图像细节;同时,能够充分利用GPU的并行计算能力,有效提高SAR图像增强的实时性。  相似文献   

4.
张杰  柴志雷  喻津 《计算机科学》2015,42(10):297-300, 324
特征提取与描述是众多计算机视觉应用的基础。局部特征提取与描述因像素级处理产生的高维计算而导致其计算复杂、实时性差,影响了算法在实际系统中的应用。研究了局部特征提取与描述中的关键共性计算模块——图像金字塔机制及图像梯度计算。基于NVIDIA GPU/CUDA架构设计并实现了共性模块的并行计算,并通过优化全局存储、纹理存储及共享存储的访问方式进一步实现了其高效计算。实验结果表明,基于GPU的图像金字塔和图像梯度计算比CPU获得了30倍左右的加速,将实现的图像金字塔和图像梯度计算应用于HOG特征提取与描述算法,相比CPU获得了40倍左右的加速。该研究对于基于GPU实现局部特征的高速提取与描述具有现实意义。  相似文献   

5.
In this paper the feasibility of adopting Graphic Processor Units towards real-time emotion aware computing is investigated for boosting the time consuming computations employed in such applications. The proposed methodology was employed in analysis of encephalographic and electrodermal data gathered when participants passively viewed emotional evocative stimuli. The GPU effectiveness when processing electroencephalographic and electrodermal recordings is demonstrated by comparing the execution time of chaos/complexity analysis through nonlinear dynamics (multi-channel correlation dimension/D2) and signal processing algorithms (computation of skin conductance level/SCL) into various popular programming environments. Apart from the beneficial role of parallel programming, the adoption of special design techniques regarding memory management may further enhance the time minimization which approximates a factor of 30 in comparison with ANSI C language (single-core sequential execution). Therefore, the use of GPU parallel capabilities offers a reliable and robust solution for real-time sensing the user's affective state.  相似文献   

6.
Today, there is a growing demand for computer vision and image processing in different areas and applications such as military surveillance, and biological and medical imaging. Edge detection is a vital image processing technique used as a pre-processing step in many computer vision algorithms. However, the presence of noise makes the edge detection task more challenging; therefore, an image restoration technique is needed to tackle this obstacle by presenting an adaptive solution. As the complexity of processing is rising due to recent high-definition technologies, the expanse of data attained by the image is increasing dramatically. Thus, increased processing power is needed to speed up the completion of certain tasks. In this paper,we present a parallel implementation of hybrid algorithm-comprised edge detection and image restoration along with other processes using Computed Unified Device Architecture (CUDA) platform, exploiting a Single Instruction Multiple Thread (SIMT) execution model on a Graphical Processing Unit (GPU). The performance of the proposed method is tested and evaluated using well-known images from various applications. We evaluated the computation time in both parallel implementation on the GPU, and sequential execution in the Central Processing Unit (CPU) natively and using Hyper-Threading (HT) implementations. The gained speedup for the naïve approach of the proposed edge detection using GPU under global memory direct access is up to 37 times faster, while the speedup of the native CPU implementation when using shared memory approach is up to 25 times and 1.5 times over HT implementation.  相似文献   

7.
8.
刘金硕  黄朔  邓娟 《计算机工程》2022,48(12):16-23
当使用高分辨率的图像作为图像处理算法的输入时会降低算法运行速度,将算法并行化可提升执行效率,但手动将串行程序转换为并行程序则较为繁琐,并且现有自动并行翻译工具性能不稳定,同时翻译后的程序是单一并行模式。面向基于面片的三维多视角立体视觉(PMVS)算法,提出一种从C到CUDA的自动两级并行翻译方法。使用ANTLR自动解析源C代码,通过分析数据依赖关系和循环数组私有化来识别可并行化的循环结构,将算法翻译成CPU多线程和GPU两级并行结构的代码。在算法执行过程中,将输入图像在CPU和GPU上分别进行处理,降低了算法总执行时间。实验结果表明,该方法的计算加速比随着输入图像分辨率的增加逐渐提高,最高约达到32,相比于PPCG和OpenACC自动并行翻译方法提升明显。  相似文献   

9.
Summary We present a mathematically rigorous and, at the same time, convenient method for systolic design and derive systolic designs for three matrix computation problems. Each design is synthesized from a simple program and a proposed layout of processors. The synthesis derives a systolic parallel execution, channel connections for the proposed processor layout, and an arrangement of data streams such that the systolic execution can begin. Our choices of designs are governed by formal theorems. The synthesis method is implementable and is particularly effective if implemented with graphics capability. Our implementation on the Symbolics 3600 displays the resulting designs and simulated executions graphically on the screen. The method's centerpiece, a transformation of sequential program computations into systolic parallel ones, has been mechanically proved correct.Parts of this work have been presented at the Conference on Parallel Architectures and Languages Europe (PARLE) [10]. This research has been supported in part by Grant No. 26-7603-35 from the Lockheed Missiles & Space Corporation and by Grant No. DCR-8610427 from the National Science Foundation  相似文献   

10.
BMC via on-the-fly determinization   总被引:1,自引:0,他引:1  
This paper develops novel bounded model checking (BMC) techniques for asynchronous parallel systems. The aim is to increase the efficiency of BMC by exploiting the inherent concurrency in such systems. This added efficiency is gained by covering more reachable states within a given bound using two techniques. Firstly, a nonstandard execution model, step executions, where multiple actions can take place simultaneously is applied. Secondly, the number of executions the system can have is reduced by modeling the execution of the system components as if they were determinized. This determinization technique also enables the removal of the internal transitions of the components. Step executions can be further restricted to a subclass called process executions without losing any reachable states.The paper presents a translation scheme for BMC of reachability properties. The translation is from an asynchronous system where the components are modeled as labeled transition systems (LTSs) to a propositional formula. The models of the formula correspond to the step executions of the original system where each component is replaced with its determinized counterpart. The formula for step executions can be easily extended in such a way that its models correspond to the process executions of the system. The translation scheme has been implemented and some experimental comparisons performed. The results show that the bound needed to detect a violation of a reachability property is, for step and process executions, in most cases lower than in interleaving executions and that the running time of the model checker using process executions is smaller than of that using steps. Moreover, the performance compares favorably to a state-of-the-art interleaving BMC implementation in the NuSMV system.  相似文献   

11.
针对大规模数据下遗传直接定位算法执行时间慢、实时性较差的问题,提出了基于GPU加速的并行遗传直接定位算法。根据直接定位代价函数特点,设计了GPU高速并行遗传进化架构,通过对适应度函数并行化计算以及对选择、交叉、变异等遗传操作并行化设计,缩短了算法执行时间,提高了算法执行效率。仿真实验表明,通过合理的GPU并行线程结构设计,显著提升了遗传直接定位算法的执行速度,可更快得到直接定位估计结果。  相似文献   

12.
指令级并行程序执行模型   总被引:1,自引:0,他引:1  
提出了一种形式化的指令级并行程序执行模型,ILPPEM不仅可以描述程序实际执行过程的行为,也可以描述编译和执行时不确定的时间变化所造成的可行执行过程的行为;同时提出了程序执行的同构概念,并证明了可行程序执行必与一个实际程序执行同构,从而为并行程序编译和验证提供了理论依据。  相似文献   

13.
遥感图像融合是遥感图像应用的一个重要处理步骤。随着遥感图像数据规模与融合算法计算复杂度的增大,遥感图像融合面临着处理速度的挑战。最近几年,GPU计算能力得到极大提升,面向通用计算的应用得到了快速发展。本文基于GPU编程模型和硬件特性,深入研究了遥感图像融合的并行加速算法,提出了适合融合执行流的并行映射模型。本文选取计算量大、计算精度高的IHS增强小波融合算法进行GPU并行设计,并针对主流的GPU平台在数据传输、循环优化、线程设计等方面进行了优化,最后在nVIDIA GTX 460 GPU上进行了实验。实验结果表明,本文设计的并行映射模型及优化策略能够很好地适用于遥感图像融合应用,最大加速比达到了114倍。研究表明,GPU通用计算技术在遥感图像处理领域具有广阔的应用前景。  相似文献   

14.
SAR图像分类是&讯图解译中非常重要的环节,但由于SAR图像中相干斑噪声的存在,使得传统方法不能很好地对SAR图像进行分类。再之,SAR图像分类具有计算量大、耗时长的特点,SAR所能获取的信息数据量也越来越大,如何快速、准确地对SAR图像进行分类以及时获取有用信息显得日益迫切。本文提出了一种快速的SAR图像分类方法,该方法将图像的空间域和频域特征相结合,并基于并行计算环境,对图像中的每一点都计算相应的小波能量特征、共生灰度矩阵特征和滤波后的灰度特征,并组成特征向量对SAR图像进行分类。实验结果证明该方法能取得较好的分类效果,且速度较快。  相似文献   

15.
在面向对象程序设计中,软件测试的关键是设计程序运行状态及其使用方法。然而,状态图中的路径往往是部分重叠的。为此,研究一种改进的状态空间搜索的执行方法,该方法具有共享相同路径的特点。采用路径覆盖程序测试器模型测试提升执行方法的效率。实验结果显示,该执行方法可有效降低状态空间搜索的执行时间,提高搜索效率。  相似文献   

16.
This work proposes several approaches to accelerate the solid–fluid interaction through the use of the Immersed Boundary method on multicore and GPU architectures. Different optimizations on both architectures have been proposed, focusing on memory management and workload mapping. We have chosen two different test scenarios which consist of single-solid and multiple-solid simulations. The performance analysis has been carried out on an intensive set of test cases to analyze the proposed optimizations using multiple CPUs (2) and GPUs (4). An effective performance is obtained for single-solid executions using one CPU (Intel Xeon E5520) achieving a speedup peak equal to 5.5. It is reached a higher benefit on multiple solids obtaining a top speedup of approximately 5.9 and 9 using one CPU (8 cores) and two CPUs (16 cores), respectively. On GPU (Kepler K20c) architecture, two different approaches are presented as the best alternative: one for single-solid executions and one for multiple-solid executions. The best approach obtained for one solid executions achieves a speedup of approximately 17 with respect the sequential counterpart. In contrast, for multiple-solid executions the benefit is much higher, being this type of problems much more suitable for GPU and reaching a peak speedup of 68, 115 and 162 using 1, 2 and 4 GPUs, respectively.  相似文献   

17.
直方图生成算法(Histogram Generation)是一种顺序的非规则数据依赖的循环运算,已在许多领域被广泛应用。但是,由于非规则的内存访问,使得多线程对共享内存访问会产生很多存储体冲突(Bank Conflict),从而阻碍并行效率。如何在并行处理器平台,特别是当前最先进的图像处理单元(Graphic Processing Unit,GPU)实现高效的直方图生成算法是很有研究价值的。为了减少直方图生成过程中的存储体冲突,通过内存填充技术,将多线程的共享内存访问均匀地分散到各个存储体,可以大幅减少直方图生成算法在GPU上的内存访问延时。同时,通过提出有效可靠的近似最优配置搜索模型,可以指导用户配置GPU执行参数,以获得更高的性能。经实验验证,在实际应用中,改良后的算法比原有算法性能提高了42%~88%。  相似文献   

18.
基于光线追踪,将屏幕图像像素分解为投射光线与场景对象交点面片辐射亮度和 纹理贴图的合成,每个面片的辐射亮度计算基于双向反射分布函数(BRDF)基的线性组合,并通 过图形处理器(GPU)处理核心并行绘制进行加速,最后与并行计算的纹理映射结果进行合成。 提出了一种基于BRDF 和GPU 并行计算的全局光照实时渲染算法,利用GPU 并行加速,在提 高绘制效率的前提下,实现动态交互材质的全局光照实时渲染。重点研究:对象表面对光线的 多次反射用BRDF 基的线性组合来表示,将非线性问题转换为线性问题,从而提高绘制效率; 利用GPU 并行加速,分别计算对象表面光辐射能量和纹理映射及其线性组合,进一步提高计算 效率满足实时绘制需求。  相似文献   

19.
In this paper, a graphics processor unit (GPU) accelerated particle filtering algorithm is presented with an introduction to a novel resampling technique. The aim remains in the mitigation of particle impoverishment as well as computational burden, problems which are commonly associated with classical (systematic) resampled particle filtering. The proposed algorithm employs a priori-space dependent distribution in addition to the likelihood, and hence is christened as dual distribution dependent (D3) resampling method. Simulation results exhibit lesser values for root mean square error (RMSE) in comparison to that for systematic resampling. D3 resampling is shown to improve particle diversity after each iteration, thereby affecting the overall quality of estimation. However, computational burden is significantly increased owing to few excessive computations within the newly formulated resampling framework. With a view to obtaining parallel speedup we introduce a CUDA version of the proposed method for necessary acceleration by GPU. The GPU programming model is detailed in the context of this paper. Implementation issues are discussed along with illustration of empirical computational efficiency, as obtained by executing the CUDA code on Quadro 2000 GPU. The GPU enabled code has a speedup of 3 and 4 over the sequential executions of systematic and D3 resampling methods respectively. Performance both in terms of RMSE and running time have been elaborated with respect to different selections for threads per block towards effective implementations. It is in this context that, we further introduce a cost to performance metric (CPM) for assessing the algorithmic efficiency of the estimator, involving both quality of estimation and running time as comparative factors, transformed into a unified parameter for assessment. CPM values for estimators obtained from all such different choices for threads per block have been determined and a final value for the chosen parameter is resolved for generation of a holistic effective estimator.  相似文献   

20.
王桂彬 《计算机学报》2012,35(5):979-989
作为众核体系结构的典型代表,GPU(Graphics Processing Units)芯片集成了大量并行处理核心,其功耗开销也在随之增大,逐渐成为计算机系统中功耗开销最大的组成部分之一,而软件低功耗优化技术是降低芯片功耗的有效方法.文中提出了一种模型指导的多维低功耗优化技术,通过结合动态电压/频率调节和动态核心关闭技术,在不影响性能的情况下降低GPU功耗.首先,针对GPU多线程执行模型的特点,建立了访存受限程序的功耗优化模型;然后,基于该模型,分别分析了动态电压/频率调节和动态核心关闭技术对程序执行时间和能量消耗的影响,进而将功耗优化问题归纳为一般整数规划问题;最后,通过对9个典型GPU程序的评测以及与已有方法的对比分析,验证了该文提出的低功耗优化技术可以在不影响性能的情况下有效降低芯片功耗.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号