期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王丽娜史晓华《计算机应用》2014,34(11):3121-3125

针对人脸轮廓提取中Chan-Vese模型计算量大、分割速度缓慢等问题,采用开放计算语言(OpenCL)并行编程模型,提出了一种基于图形处理器(GPU)和多核CPU加速的并行算法。该算法首先将模型的框架进行重构,消除模型中的数据依赖关系;然后,利用开放计算语言对算法进行并行化以及相应的优化。实验结果表明,与单线程算法相比,在NVIDIA GTX660和AMD FX-8530下达到了较高的加速比。相似文献

2.

多核CPU和GPU加速分子动力学模拟

林江宏林锦贤吕暾《计算机应用》2011,31(3):843-847

在多核中央处理器(CPU)—图形处理器(GPU)异构并行体系结构上,采用OpenMP和计算统一设备架构(CUDA)编程实现了基于AMBER力场的蛋白质分子动力学模拟程序。通过合理地将程序划分为CPU单线程、CPU多线程和GPU多线程执行部分,高效地利用了计算机的处理能力。性能测试结果表明,相对于优化后的CPU串行计算,多核CPU-GPU异构并行计算模型有强大的性能优势,特别是将占整个程序执行时间90%的作用力的计算移植到GPU上执行,获得了最高可达12倍的计算加速比。相似文献

3.

Fast parallel genetic programming: multi-core CPU versus many-core GPU

Darren M. Chitty 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2012,16(10):1795-1814

Genetic Programming (GP) is a computationally intensive technique which is also highly parallel in nature. In recent years, significant performance improvements have been achieved over a standard GP CPU-based approach by harnessing the parallel computational power of many-core graphics cards which have hundreds of processing cores. This enables both fitness cases and candidate solutions to be evaluated in parallel. However, this paper will demonstrate that by fully exploiting a multi-core CPU, similar performance gains can also be achieved. This paper will present a new GP model which demonstrates greater efficiency whilst also exploiting the cache memory. Furthermore, the model presented in this paper will utilise Streaming SIMD Extensions to gain further performance improvements. A parallel version of the GP model is also presented which optimises multiple thread execution and cache memory. The results presented will demonstrate that a multi-core CPU implementation of GP can yield performance levels that match and exceed those of the latest graphics card implementations of GP. Indeed, a performance gain of up to 420-fold over standard GP is demonstrated and a threefold gain over a graphics card implementation. 相似文献

4.

基于多核CPU+GPU运算的电磁场高效体绘制算法研究

下载免费PDF全文

陈宇峰张铂李林《计算机工程与应用》2018,54(18):218-222

雷达探测范围作为电磁场的一个典型代表,由于其在军事决策时扮演着重要的作用,所以对探测范围可视化的准确性和实时性的要求很严格。传统的面绘制三维数据场信息会造成大量的空间信息丢失。因此,采用体绘制技术来获取电磁场中的三维数据场信息。针对传统体绘制技术算法执行效率较低的问题,提出使用多核CPU+GPU的架构来加速体绘制,从而实现实时处理。实验表明,采用提出的方法可以大幅减少体绘制中光线绘制的时间,充分利用CPU的空闲存储资源和计算资源。相似文献

5.

多核CPU环境下小生境混合遗传算法的研究*

唐天兵谢祥宏韦凌云申文杰《计算机应用研究》2009,26(11):4073-4075

为克服遗传算法的缺点,利用小生境的启发作用,引入改进的模拟退火操作,构造了一个兼顾全局搜索与局部探测的混合遗传算法。针对该算法内在的良好并行性及串行计算难以发挥多核CPU计算优势的问题,将遗传操作和模拟退火操作设计成并行计算形式,利用OpenMP将其线程化。对TSP的求解验证了该算法的有效性,并行算法的加速比和计算效率随着TSP规模的增加而显著提高。相似文献

6.

一种基于OpenCL的高能效并行KNN算法及其GPU验证

《电子技术应用》2016,(2):14-16

近年来数据分类技术已经被广泛应用于各类问题中,作为最重要的分类算法之一,K最近邻法(KNN)也被广泛使用。在过去的近50年,人们就如何提高KNN的并行性能做出巨大努力。基于CUDA的KNN并行实现算法——CUKNN算法证明KNN在GPU上的并行实现比在CPU上串行实现的速度提升数十倍,然而,CUDA在实现过程中包含了大量的冗余计算。提出了一种并行冒泡的新型KNN并行算法,并通过OpenCL,在以GPU作为计算核心的异构系统上进行验证,结果显示提出的方法比CUDA快16倍。相似文献

7.

多核CPU下基因库的多种群遗传算法

孙如祥黄柏雄谢祥宏夏曼《传感器与微系统》2011,30(8):135-137

近年来,多种群遗传算法被广泛地应用到各领域中,鉴于该算法的有效性,在其基础上提出了一种用于在多核CPU环境下求解TSP问题的多种群遗传算法,利用接收到的最优个体构建基因库,种群之间通过共享基因库来实现种群间信息的交流.通过对TSP问题求解的数值实验表明:提出的算法提高了求解质量,而且在效率上也具有很大的优势. 相似文献

8.

Automatic CPU/GPU Generation of Multi-versioned OpenCL Kernels for C++ Scientific Applications

Rafael Sotomayor Luis Miguel Sanchez Javier Garcia Blas Javier Fernandez J. Daniel Garcia 《International journal of parallel programming》2017,45(2):262-282

Parallelism has become one of the most extended paradigms used to improve performance. However, it forces software developers to adapt applications and coding mechanisms to exploit the available computing devices. Legacy source code needs to be re-written to take advantage of multi- core and many-core computing devices. Writing parallel applications in a traditional way is hard, expensive, and time consuming. Furthermore, there is often more than one possible transformation or optimization that can be applied to a single piece of legacy code. Therefore many parallel versions of the same original sequential code need to be considered. In this paper, we describe an automatic parallel source code generation workflow (REWORK) for parallel heterogeneous platforms. REWORK automatically identifies promising kernels on legacy C++ source code and generates multiple specific versions of kernels for improving C++ applications, selecting the most adequate version based on both static source code and target platform characteristics. 相似文献

9.

An application-centric evaluation of OpenCL on multi-core CPUs

Jie Shen Jianbin Fang Henk Sips Ana Lucia Varbanescu 《Parallel Computing》2013

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we focus on the potential that these parallel applications have to exploit the performance of multi-core CPUs. Specifically, we analyze the method to systematically reuse and adapt the OpenCL code from GPUs to CPUs. We claim that this work is a necessary step for enabling inter-platform performance portability in OpenCL. 相似文献

10.

多核CPU下的K-means遥感影像分类并行方法

吴洁璇陈振杰张云倩骈宇哲周琛《计算机应用》2015,35(5):1296-1301

针对海量遥感影像快速分类的应用需求,提出一种基于K-means算法的遥感影像并行分类方法.该方法结合CPU下进程级与线程级模式的并行特征,设计融合进程级与线程级并行的两阶段数据粒度划分方法和任务调度方法,在保证精度的基础上实现并行加速.利用大数据量的多尺度遥感影像进行实验,结果表明:所提并行方法可大大减少遥感影像的分类时间,取得了良好的加速比(13.83),并可达到负载均衡,从而解决了大区域遥感影像快速分类的问题. 相似文献

11.

Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU

Georgios kalantzis Hidenobu Tachibana 《Computer methods and programs in biomedicine》2014

For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU–GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy. 相似文献

12.

基于OpenCL的双GPU基数排序算法

赵成龙施慧彬俞忻峰《计算机与现代化》2015,(1):27

为提高基数排序算法在异构并行平台下的资源利用率和算法加速比,提出基于OpenCL的双GPU基数排序算法。通过研究并行基数排序思想,以Y485P作为实验平台,使用OpenCL技术首先实现单GPU的基数排序算法,之后实现负载平衡的双GPU基数排序。测试结果表明,在使用单GPU时加速比为1.3x,使用双GPU时加速比为2.32x。相似文献

13.

异构集群中CPU与GPU协同调度算法的设计与实现

高原顾文杰丁雨恒彭晖陈泊宇顾雯轩《计算机工程与设计》2020,41(2):592-601

为有效提高异构的CPU/GPU集群计算性能,提出一种支持异构集群的CPU与GPU协同计算的两级动态调度算法。根据各节点计算能力评测结果和任务请求动态分发数据,在节点内CPU和GPU之间动态调度任务,使用数据缓存和数据处理双队列机制,提高异构集群的传输和处理效率。该算法实现了集群各节点"能者多劳",避免了单节点性能瓶颈造成的任务长尾现象。实验结果表明,该算法较传统MPI/GPU并行计算性能提高了11倍。相似文献

14.

基于GPU与CPU协作的实时波束形成实现方法*

李晓敏侯朝焕鄢社锋杨力《计算机应用研究》2011,28(4):1333-1335

波束形成的实时性一直是声纳和雷达等领域信号处理过程中的重点和难点。本文采用基于CUDA(Compute Unified Device Architecture,统一计算设备架构)的GPU(Graphic Processing Unit,图形处理器)与CPU协作处理方法,实现了宽带波束形成的实时处理。本方法的处理速度相较于matlab和CPU平台可以提高一至两个数量级,相较于同等处理速度的多DSP平台则体现了开发周期短、费用低、工作量小和可靠性高等众多优势。相似文献

15.

在集群多核CPU环境下的等高线并行提取方法 总被引：1，自引：0，他引：1

下载免费PDF全文

王宗跃马洪超徐宏根邬建伟彭检贵《计算机工程与应用》2010,46(17):5-7

分析集群环境下分布式存储编程模型和多核CPU环境下共享存储编程模型各自的优缺点,采用结合集群和多核CPU的并行环境来取长补短;并研究其在等高线提取中的相关并行算法,其中以建立三角网和跟踪等高线作为共享存储并行的研究实例;最后通过实验验证了该并行方案的可行性。相似文献

16.

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU,GPU and FPGA

Dan Zou Yong Dou Fei Xia 《Concurrency and Computation》2012,24(14):1625-1644

With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

17.

CPU与GPU上几种矩阵乘法的比较与分析 总被引：1，自引：0，他引：1

下载免费PDF全文

刘进锋郭雷《计算机工程与应用》2011,47(19):9-11

描述了矩阵乘法在CPU上的三种实现方法和在GPU上基于CUDA架构的四种实现方法,分析了高性能方法的原由,发现它们的共同特点都是合理地组织数据并加以利用,这样能有效地减少存取开销,极大地提高算法的速度。其中CPU上的最优实现方法比普通算法快了200多倍,GPU上的最优实现方法又比CPU上的最优实现方法快了约6倍。相似文献

18.

通用处理器和图像处理器新型融合架构分析

邹治海沈祥黄田祝永新《计算机应用》2011,31(Z1):168-171

CPU与图形处理器(GPU)作为两种主要的通用处理器,在协同工作时存在功耗过大、体积不易压缩、传输速度慢等问题,因而融合成为一种趋势。在分析两者技术特点及通过高性能基准程序实测其性能基础上,提出一种新型融合架构。该融合架构采用低功耗处理器进行任务分配,根据任务类型及计算量,平衡串行处理核心和并行处理核心之间的任务调度及使用效率;而两种处理核心专注于进行数据处理,根据不同任务采用不同组合方式。通过性能评估,该新融合架构在计算能力和功耗方面均有较大改善。相似文献

19.

Transparent partial page migration between CPU and GPU

Shiqing ZHANG Zheng QIN Yaohua YANG Li SHEN Zhiying WANG 《Frontiers of Computer Science》2020,14(3):143101-13

Despite the increasing investment in integrated GPU and next-generation interconnect research,discrete GPU connected by PCIe still account for the dominant position of the market,the management of data communication between CPU and GPU continues to evolve.Initially,the programmer explicitly controls the data transfer between CPU and GPU.To simplify programming and enable systemwide atomic memory operations,GPU vendors have developed a programming model that provides a single,virtual address space for accessing all CPU and GPU memories in the system.The page migration engine in this model automatically migrates pages between CPU and GPU on demand.To meet the needs of high-performance workloads,the page size tends to be larger.Limited by low bandwidth and high latency interconnects compared to GDDR,larger page migration has longer delay,which may reduce the overlap of computation and transmission,waste time to migrate unrequested data,block subsequent requests,and cause serious performance decline.In this paper,we propose partial page migration that only migrates the requested part of a page to reduce the migration unit,shorten the migration latency,and avoid the performance degradation of the full page migration when the page becomes larger.We show that partial page migration is possible to largely hide the performance overheads of full page migration.Compared with programmer controlled data transmission,when the page size is 2MB and the PCIe bandwidth is 16GB/sec,full page migration is 72.72×slower,while our partial page migration achieves 1.29×speedup.When the PCIe bandwidth is changed to 96GB/sec,full page migration is 18.85×slower,while our partial page migration provides 1.37×speedup.Additionally,we examine the performance impact that PCIe bandwidth and migration unit size have on execution time,enabling designers to make informed decisions. 相似文献

20.

多核CPU/GPU平台下的集合求交算法

王怀超赵雷《计算机工程》2013,39(4)

提出一个多核CPU/GPU混合平台下的集合求交算法.针对CPU端求交问题,利用对数据空间局部性和中序求交的思想,给出内向求交算法和Baeza-Yates改进算法,算法速度分别提升0.79倍和1.25倍.在GPU端,提出有效搜索区间思想,通过计算GPU中每个Block在其余列表上的有效搜索区间来缩小搜索范围,进而提升求交速度,速度平均提升40％.在混合平台采用时间隐藏技术将数据预处理和输入输出操作隐藏在GPU计算过程中,结果显示系统平均速度可提升85％. 相似文献