期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

余莹李肯立《计算机应用研究》2014,31(10)

针对多核CPU和GPU环境下图的深度优先搜索问题,提出多核CPU中实现并行DFS的新算法,通过有效利用内存带宽来提高性能,且当图增大时优势越明显.在此基础上提出一种混合方法,为DFS每一分支动态地选择最佳的实现:顺序执行;两种不同算法的多核执行;GPU执行.混合算法为每种大小的图提供相对更好的性能,且能避免高直径图上的最坏情况.通过比较多CPU和GPU系统,分析底层架构对DFS性能的影响.实验结果表明,一个高端single-socket GPU系统的DFS执行性能相当于一个高端4-socket CPU系统. 相似文献

2.

GPU加速分子动力学模拟中的电荷分布计算

张德好刘青昆宫利东《计算机应用与软件》2012,(10):79-81,93

在集群与GPU组成的异构并行计算平台上,使用MPI+CUDA混合编程模型,实现基于ABEEMσπ模型的分子动力学模拟中电荷分布的计算.通过对电荷分布分布求解中的计算部分移植到GPU上进行,并针对算法中通信开销大和资源未充分利用的问题,通过异构平台的异步并发方法进行优化,提高了求解效率.性能测试结果表明,相比于单纯MPI并行算法,优化后GPU加速的异构并行算法,在化学大分子模型电荷分布计算上,有着明显的性能优势. 相似文献

3.

广义稠密对称特征问题标准化算法在GPU集群上的有效实现

刘世芳赵永华于天禹黄荣锋《计算机科学》2020,47(4):6-12

广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。相似文献

4.

基于广度优先遍历加权图生成的启发式图分区

下载免费PDF全文

蹇冬宇程永利《计算机系统应用》2023,32(12):218-223

图分区质量极大程度上影响着计算机之间的通信开销和负载平衡, 这对于大规模并行图计算的性能是至关重要的. 然而, 随着图数据规模的越来越大, 图分区算法的执行时间成了一个不可避免的问题. 因此, 研究如何优化图分区算法的执行效率是有必要的. 本文提出了一个基于广度优先遍历加权图生成的启发式图分割方法, 该方法在实现较低的通信代价和较好负载平衡的同时, 只引入了少量的预处理时间开销. 实验结果表明, 本文的划分方法减少了复制因子, 降低通信开销, 并且引入的时间开销较小. 相似文献

5.

一种适用于GPU图像处理算法的合并存储结构

左宪禹张哲黄祥志葛强张理涛臧文乾《计算机工程与科学》2020,42(2):197-202

大多数图像处理算法都可利用GPU进行加速以达到更好的执行性能,但数据传输操作与核函数执行之间的调度策略问题仍是桎梏加速性能进一步提升的主要瓶颈。为了解决这个问题,通常采用GPU任务流将核函数执行与数据传输操作进行重叠,以隐藏部分数据传输与核函数执行耗时。但是,由于CUDA编程模型的特性以及GPU硬件资源的限制,在某些情况下,即使创建较多的任务流用于任务重叠,每个流上仍会存在串行执行的任务,导致加速效果无法进一步提升。因此,考虑利用CSS将待处理图像进行合并从而将单个流中的算子核函数及数据传输操作进行合并,以减少数据传输操作和核函数执行的固定代价及调用间隙。通过实验结果可知,提出的CSS结构不仅能在单流的情况下提高GPU图像处理算法执行性能,在多流的情况下其加速性能也得到了进一步提升,具有较好的实用性及可扩展性,适用于包含较多算子操作或较小尺寸图像批量处理的情况。此外,提出的方法对图像处理算法的GPU加速提供了新的研究思路。相似文献

6.

一种降低核间通信开销的调度算法

韩乐陈香兰李曦《计算机系统应用》2014,23(9):65-71

近年来,多核处理器在嵌入式领域得到越来越广泛的应用,但多核间不可避免的通信开销阻碍了系统性能大幅提升,因此研究如何降低核间通信开销变得尤为重要. 针对同构多核平台上周期依赖任务,提出一种降低核间通信开销的任务调度算法并在该基础上进行优化,通过对部分任务预先调度一个周期,将周期内任务间的数据依赖转换成周期间的数据依赖,从而缩短调度长度,提高系统性能. 对以上算法进行仿真模拟,并分别在双核和四核平台上进行多组实验. 结果表明：提出的调度优化算法可以显著降低周期依赖任务核间通信开销,提高执行效率. 相似文献

7.

一种高效的面向高并发图分析任务的存储系统

赵进姜新宇张宇廖小飞金海刘海坤杨赟张吉王彪余婷《中国科学:信息科学》2022,(1):111-128

随着现实世界中图计算需求的快速增长,同一平台上往往并发运行着大量迭代图分析任务.然而,现有的图计算系统主要是为了高效执行单个图分析任务而设计的.因此,当多个并发图分析任务同时在同一个底层图上并行执行时,现有图计算系统会面临巨大的数据访问开销.为了提高并发图分析任务的吞吐量,现有的核外并发图处理方案通过共享图数据减少并发... 相似文献

8.

面向分布式机器学习的大消息广播设计

辛逸杰谢彬李振兴《计算机系统应用》2020,29(1):1-13

MPI (Message Passing Interface)专为节点密集型大规模计算集群设计,然而,随着MPI+CUDA (Compute Unified Device Architecture)应用程序以及计算节点拥有GPU的计算机集群的出现,类似于MPI的传统通信库已无法满足.而在机器学习领域,也面临着同样的挑战,如Caff以及CNTK (Microsoft CognitiveToolkit)的深度学习框架,由于训练过程中, GPU会缓存庞大的数据量,而大部分机器学习训练的优化算法具有迭代性特点,导致GPU间的通信数据量大,通信频率高,这些已成为限制深度学习训练性能提升的主要因素之一,虽然推出了像NCCL(Nvidia Collective multi-GPU Communication Library)这种解决深度学习通信问题的集合通信库,但也存在不兼容MPI等问题.因此,设计一种更加高效、符合当前新趋势的通信加速机制便显得尤为重要,为解决上述新形势下的挑战,本文提出了两种新型通信广播机制:(1)一种基于MPI_Bcast的管道链PC (Pipelined Chain)通信机制:为GPU缓存提供高效的节点内外通信.(2)一种适用于多GPU集群系统的基于拓扑感知的管道链TA-PC (TopologyAware Pipelined Chain)通信机制:充分利用多GPU节点间的可用PCIe链路.为了验证提出的新型广播设计,分别在三种配置多样化的GPU集群上进行了实验:GPU密集型集群RX1、节点密集型集群RX2、均衡型集群RX3.实验中,将新的设计与MPI+NCCL1 MPI_Bcast进行对比实验,对于节点内通信和节点间的通信,分别取得了14倍和16.6倍左右的性能提升;与NCCL2的对比试验中,小中型消息取得10倍左右的性能提升,大型消息取得与其相当的性能水平,同时TA-PC设计相比于PC设计,在64GPU集群上实现50%左右的性能提升.实验结果充分说明,提出的解决方案在可移植性以及性能方面有较大的优势. 相似文献

9.

一种使用GPU加速地震叠前时间偏移的方法

张清谢海波赵开勇吴庆陈维王狮虎迟旭光褚晓文《微型机与应用》2011,30(10):87-90

应用GPU通用高性能编程技术实现一种加速地震叠前时间偏移的新方法.该技术是地震勘探处理的常规流程,其核心算法具有计算密集、数据独立性强、并行性高等特点.通过性能剖析获得其计算热点,通过CUDA技术对其进行并行化改造,并利用CUDA的流技术实现CPU到GPU的异步传输.通过集群环境下的性能测试,应用GPU并行化的PSTM程序可明显缩短运行时间. 相似文献

10.

兼顾费用与公平的带通信开销的多有向无环图调度

王宇新曹仕杰郭禾陈征陈鑫《计算机应用》2015,35(11):3017-3020

针对云环境下多有向无环图(DAG)工作流的调度算法应考虑执行时间、费用开销、通信开销、公平性等多个指标的问题,在模型带通信开销的DAG(CA-DAG)的基础上结合公平性算法提出一种优化完成时间的后向求异(BD)原则与兼顾费用和公平的多DAG调度策略CAFS.CAFS调度策略分为两个阶段:预调度阶段利用带通信开销的工作流费用优化(CACO)算法在考虑通信开销的同时求解所有任务的最优服务并优化费用,采用fairness算法得到较公平的调度顺序;调度阶段采用BD原则,根据在预调度阶段得出的调度顺序进一步优化整体的完成时间并执行调度.实验结果表明,CAFS调度算法具有较好的公平性,在不提高费用的基础上时间减少19.82%. 相似文献

11.

Efficient breadth first search on multi-GPU systems

Enrico Mastrostefano Massimo Bernaschi 《Journal of Parallel and Distributed Computing》2013

Simple algorithms for the execution of a Breadth First Search on large graphs lead, running on clusters of GPUs, to a situation of load unbalance among threads and un-coalesced memory accesses, resulting in pretty low performances. To obtain a significant improvement on a single GPU and to scale by using multiple GPUs, we resort to a suitable combination of operations to rearrange data before processing them. We propose a novel technique for mapping threads to data that achieves a perfect load balance by leveraging prefix-sum and binary search operations. To reduce the communication overhead, we perform a pruning operation on the set of edges that needs to be exchanged at each BFS level. The result is an algorithm that exploits at its best the parallelism available on a single GPU and minimizes communication among GPUs. We show that a cluster of GPUs can efficiently perform a distributed BFS on graphs with billions of nodes. 相似文献

12.

基于CUDA的并行布谷鸟搜索算法设计与实现 总被引：1，自引：0，他引：1

韦向远 ;杨辉华 ;谢谱模《计算机科学与探索》2014,(6):665-673

布谷鸟搜索（cuckoo search,CS）算法是近几年发展起来的智能元启发式算法,已经被成功应用于多种优化问题中。针对CS算法在求解大数据、大规模复杂问题时,计算时间过长的问题,提出了一种基于统一计算设备架构（compute unified device architecture,CUDA）的并行布谷鸟搜索算法。该算法的并行实现采用任务并行与数据并行相结合的方式,利用图形处理器（graphic processing unit,GPU）线程块与线程分别映射布谷鸟个体与个体的每一维数据,并行实现CS算法中的鸟巢位置更新、个体适应度评估、鸟巢重建、寻找最优个体操作。整个CS算法的寻优迭代过程完全通过GPU实现,降低了算法计算过程中CPU与GPU的通信开销。对4个经典基准测试函数进行了仿真实验,结果表明,相比标准CS算法,基于CUDA架构的并行CS算法在求解收敛性一致的前提下,在求解速度上获得了高达110倍的计算加速比。相似文献

13.

Active replication of multithreaded applications

Basile C. Kalbarczyk Z. Iyer R.K. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(5):448-465

Software-based active replication is expensive in terms of performance overhead. Multithreading can help improve performance; however, thread scheduling is a source of nondeterminism in replica behavior. To achieve strong replica consistency in multithreaded environments, this paper proposes intercepting mutex lock/unlock operations performed by threads on accessing the shared data and contributes with two algorithmic solutions: 1) a loose synchronization algorithm (LSA), which captures the natural concurrency in a leader replica and projects it on follower replicas through interreplica communication, and 2) a preemptive deterministic scheduler (PDS) algorithm, which removes the need for interreplica communication through the notion of round and by suspending threads when it is unable (yet) to schedule them deterministically. Failure behavior and performance of LSA and PDS implementations are evaluated in a triplicated system and compared with existing solutions. A performance evaluation indicates that LSA and PDS outperform existing solutions, with PDS offering lower throughput than LSA. A fault-injection campaign shows that PDS is more robust to errors due to the absence of interreplica communication. Hence, LSA and PDS represent a trade-off between performance and dependability. Finally, LSA and PDS are demonstrated in replicating the Apache Web server, a substantial real-world application. 相似文献

14.

Accelerating frequent itemset mining on graphics processing units

Fan Zhang Yan Zhang Jason D. Bakos 《The Journal of supercomputing》2013,66(1):94-117

In this paper we describe a new parallel Frequent Itemset Mining algorithm called “Frontier Expansion.” This implementation is optimized to achieve high performance on a heterogeneous platform consisting of a shared memory multiprocessor and multiple Graphics Processing Unit (GPU) coprocessors. Frontier Expansion is an improved data-parallel algorithm derived from the Equivalent Class Clustering (Eclat) method, in which a partial breadth-first search is utilized to exploit maximum parallelism while being constrained by the available memory capacity. In our approach, the vertical transaction lists are represented using a “bitset” representation and operated using wide bitwise operations across multiple threads on a GPU. We evaluate our approach using four NVIDIA Tesla GPUs and observed a 6–30× speedup relative to state-of-the-art sequential Eclat and FPGrowth implementations executed on a multicore CPU. 相似文献

15.

IB网上CPU-GPU异构超算平台容器性能评估及优化

下载免费PDF全文

胡鹤赵毅王宪贺《计算机工程与应用》2021,57(18):82-85

为了实现资源和系统环境的隔离,近年来新兴了多种虚拟化工具,容器便是其中之一。在超算资源上运行的问题通常是由软件配置引起的。容器的一个作用就是将依赖打包进轻量级可移植的环境中,这样可以提高超算应用程序的部署效率。为了解基于IB网的CPU-GPU异构超算平台上容器虚拟化技术的性能特征,使用标准基准测试工具对Docker容器进行了全面的性能评估。该方法能够评估容器在虚拟化宿主机过程中产生的性能开销,包括文件系统访问性能、并行通信性能及GPU计算性能。结果表明,容器具备近乎原生宿主机的性能,文件系统I/O开销及GPU计算开销与原生宿主机差别不大。随着网络负载的增大,容器的并行通信开销也相应增大。根据评估结果,提出了一种能够发挥超算平台容器性能的方法,为使用者有针对性地进行系统配置、合理设计应用程序提供依据。相似文献

16.

An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs

Youngsub Ko Youngmin Yi Soonhoi Ha 《Journal of Real-Time Image Processing》2014,9(1):5-18

H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research efforts to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation, mainly due to significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. Further, we incorporate frame-level parallelization technique to improve the overall throughput. Experimental results show that our proposed H.264 encoder has higher performance than x264 encoder. 相似文献

17.

Efficient data management for incoherent ray tracing

Xin Yang Duan-qing Xu Lei Zhao 《Applied Soft Computing》2013,13(1):1-8

To obtain good performance on the GPU hardware, it is necessary to design algorithms to manage data, access memory under GPU memory hierarchy, and schedule more efficient threads. In this paper, we propose an efficient data management and task management designed for GPU based ray tracing. Due to the dynamic and uncertainty in ray tracing, we design data-management layer and task-management layer combined with fuzzy spatial analysis, use the two-level ray sorting and a ray bucket structure to reorganize ray data, then a warp's threads can be scheduled to access coherent geometry and nodes data, reduce memory bandwidth, and dispatch the data locally. We schedule tasks in data-driven execution according to coherent data, propose an adaptive ray compaction to eliminate inactive threads, maintain task efficiency of threads in a warp, and design two heuristics to decrease the compaction cost. On the basis of it, we also introduce a memory-optimized dynamic traversal management to reduce incoherent memory access, and avoid frequent sorting computation and compaction operations. Our experiments demonstrate all of these work combined can achieve good performance. 相似文献

18.

CUIRRE: An open-source library for load balancing and characterizing irregular applications on GPUs

Tao Zhang Wei Shu Min-You Wu 《Journal of Parallel and Distributed Computing》2014

While Graphics Processing Units (GPUs) show high performance for problems with regular structures, they do not perform well for irregular tasks due to the mismatches between irregular problem structures and SIMD-like GPU architectures. In this paper, we introduce a new library, CUIRRE, for improving performance of irregular applications on GPUs. CUIRRE reduces the load imbalance of GPU threads resulting from irregular loop structures. In addition, CUIRRE can characterize irregular applications for their irregularity, thread granularity and GPU utilization. We employ this library to characterize and optimize both synthetic and real-world applications. The experimental results show that a 1.63× on average and up to 2.76× performance improvement can be achieved with the centralized task pool approach in the library at a 4.57% average overhead with static loading ratios. To avoid the cost of exhaustive searches of loading ratios, an adaptive loading ratio method is proposed to derive appropriate loading ratios for different inputs automatically at runtime. Our task pool approach outperforms other load balancing schemes such as the task stealing method and the persistent threads method. The CUIRRE library can easily be applied on many other irregular problems. 相似文献

19.

A GPU-based multi-resolution algorithm for simulation of seed dispersal

Jing FAN Hai-feng JI Xin-xin GUAN Ying TANG 《浙江大学学报:C卷英文版》2012,(11):816-827

In forest dynamics models, the intensive computation and load involved in the simulation of seed dispersal can become unbearably huge for large-scale forest analysis. To solve this problem, we propose a multi-resolution algorithm to compute seed dispersal on GPU. By exploiting the computation parallelism of seed dispersal, the computation of the whole forest plot is divided into multiple small plot cells, which are computed independently by parallel threads on GPU. To further improve the calculation efficiency with limited threads scale for GPU computation, we propose a hierarchical method to cluster the plot cells into a multi-resolution form according to the biological curves of tree seed dispersal. Experimental results show that our algorithm not only greatly reduces computational time but also obtains comparably correct results as compared to the naive GPU algorithm, which makes it especially suitable for large-scale forest modeling. 相似文献