首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
近年来异构并行计算在高性能科学计算和通用应用领域受到广泛研究。本文结合多种代表性并行计算模型,给出异构环境中的HBSP模型和程序开销计算方法。采用基于消息长度的线性模型使通信开销的计算更精确,解除原有BSP模型对h-rela-tion的限制,使程序和算法在异构环境中的设计更加灵活。当构成BSP计算机的各处理机速度相同且原有BSP算法达到最优(即各处理机上所分配的计算量与通信量完全均衡)时,HBSP模型等同于原有模型。  相似文献   

2.
In this work, we show that the submachine locality exposed by hierarchical bulk-synchronous computations can be efficiently turned into locality of reference on arbitrarily deep hierarchies. Specifically, we develop efficient schemes to simulate parallel programs written for the Decomposable BSP (a BSP variant which features a hierarchical decomposition into submachines) on the sequential Hierarchical Memory Model (HMM), which rewards the exploitation of temporal locality, and on its extension with block transfer, the BT model, which also rewards the exploitation of spatial locality. The simulations yield good hierarchy-conscious sequential algorithms from parallel ones, and provide evidence of the strict relation between submachine locality in parallel computation and locality of reference in the hierarchical memory setting. We also devise a generalization of the HMM result to the self-simulation of D-BSP augmented with hierarchical memory modules, which yields an interesting analog of Brent's lemma, thus proving that the enhanced model features a seamless integration of memory and network hierarchies.  相似文献   

3.
Molecular dynamics (MD) simulation has broad applications, and an increasing amount of computing power is needed to satisfy the large scale of the real world simulation. The advent of the many-core paradigm brings unprecedented computing power, but it remains a great challenge to harvest the computing power due to MD’s irregular memory-access pattern. To address this challenge, this paper presents a joint application/architecture study to enhance the scalability of MD on Godson-T-like many-core architecture. First, a preprocessing approach leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then three incremental optimization strategies–a novel data-layout to improve data locality, an on-chip locality-aware parallel algorithm to enhance data reuse, and a pipelining algorithm to hide latency to shared memory–are proposed to enhance on-chip parallelism for Godson-T many-core processor. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency of 0.99 on 64 cores, which is confirmed by a field-programmable gate array emulator. Also the performance per watt of MD on Godson-T is much higher than MD on a 16-cores Intel core i7 symmetric multiprocessor (SMP) and 26 times higher than MD on an 8-core 64-thread Sun T2 processor. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a hierarchical parallelization scheme is designed to map the MD algorithm to Godson-T many-core cluster and a simple performance model is derived, which suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.  相似文献   

4.
从应用角度出发,分析、归纳各种应用中的核心计算过程,利用符合多核处理器芯片架构的并行计算模型对这些核心计算过程进行优化,得出可以被重复利用的高性能可扩展的软件库,它既可以支持新应用的高效开发,也可以保证程序性能的可扩展性。以分层并行计算模型思想为指导,从应用驱动的并行程序性能优化的角度出发,首先提出了面向多核处理器芯片体系结构的并行算法设计模型,在此基础上对并行扫描算法进行分析优化,得出新的具有良好扩展性、高性能的g-scan算法。之后深入研究13种核心计算实体之一的稀疏线性代数计算实体,应用g-scan算法设计实现了新的稀疏矩阵-向量运算算法,并将其应用于结构工程领域中广泛使用的有限元分析,大大提升了其执行效率。  相似文献   

5.
6.
对采用多核处理器作为SMP集群系统的计算节点的系统上的一种混合编程模型-MPI+OpenMP混合编程模型进行了深入的研究.建立了两个矩阵乘的混合并行算法,在多核集群平台上与纯MPI算法分别进行了实验,并进行了性能方面的比较.试验表明,混合编程具有更好的性能.  相似文献   

7.
一种异步BSP模型及其程序优化技术   总被引:2,自引:0,他引:2  
基于BSP模型,该文提出了异步计算模型(CSA-BSP)。该模型更准确地描述了并行机的性能参数,引导用户编写高效率的并行程序,在CSA-BSP模型下,两个进程异步执行的位置至多相差p-1个超步;基于程序的执行时间,作者分析了BSP、A-BSP和CSA-BSP程序的效率,得出CSA-BSP程序的效率是最高的,在曙光并行机上,用“红黑格法”和“矩阵乘法”进行了验证,和BSP模型相比,这两个CSA-BSP程序的效率分别提高20%和37%;同时,其进程执行时间和最大可以降低8%,因此,按照CSA-BSP模型编程对于提高程序效率和改善系统的吞吐率,都有良好的效果。  相似文献   

8.
Scheduling is a fundamental issue in achieving high performance on metacomputers and computational grids. For the first time, the job scheduling problem for grid computing on metacomputers is studied as a combinatorial optimization problem. A cost model is proposed for modeling communication heterogeneity on computational grids. A processor allocation algorithm is developed which always finds an optimal processor allocation that minimizes the effective execution time of a job when the job is being scheduled. It is proven that the list scheduling (LS) algorithm can achieve reasonable worst-case performance bound in grid environments supporting distributed supercomputing with large applications. We compare the performance of various job scheduling and processor allocation algorithms for grid computing on metacomputers. We evaluate the performance of 128 combinations of two job scheduling algorithms, four initial job ordering strategies, four processor allocation algorithms, and four metacomputers by extensive simulation. It is found that the combination of largest job first (LJF) initial job ordering and minimum effective execution time (MEET) or largest machine first (LMF) processor allocation algorithm yields the best average-case performance, and the choice of FCFS and LS depends on the range of job sizes. It is also observed that communication heterogeneity does have significant impact on schedule lengths.  相似文献   

9.
Processor specialization has become the development trend of modern processor industry. It is quite possible that this will still be the main-stream in the next decades of semiconductor era. As the diversity of heterogeneous systems grows, organizing computation efficiently on systems with multiple kinds of heterogeneous processors is a challenging problem and will be a normality. In this paper, we analyze some state-of-the-art task scheduling algorithms of heterogeneous computing systems and propose a Degree of Node First (DONF) algorithm for task scheduling of fine-grained parallel programs on heterogeneous systems. The major innovations of DONF include:1) simplifying task priority calculation for directed acyclic graph (DAG) based fine-grained parallel programs which not only reduces the complexity of task selection but also enables the algorithm to solve the scheduling problem for dynamic DAGs; 2) building a novel communication model in the processor selection phase that makes the task scheduling much more efficient. They are achieved by exploring finegrained parallelism via a dataflow program execution model, and validated through experimental results with a selected set of benchmarks. The results on synthesized and real-world application DAGs show a very good performance. The proposed DONF algorithm significantly outperforms all the evaluated state-of-the-art heuristic algorithms in terms of scheduling length ratio (SLR) and efficiency.  相似文献   

10.
更实际的异构并行计算模型   总被引:4,自引:1,他引:3  
通过结合多种代表性并行计算模型,给出异构环境中的HBSP模型和程序开销计算方法。采用基于消息长度的线性模型具有通信开销的计算更精确、程序和算法在异构环境中的设计灵活、且可解除原有BSP模型对h-relation的限制等优点。当构成BSP计算机的各处理机速度相同且原有BSP算法达到最优(即各处理机上所分配的计算量与通信量完全均衡)时,HBSP模型等同于原有模型。  相似文献   

11.
陈涛  张玮 《微机发展》2007,17(1):139-141
在研究关联规则挖掘算法的基础上,对并行关联规则算法进行了比较全面的分析,并给出了并行数据挖掘的计算框架。提出了一个以计算服务器为中心节点的并行挖掘算法,可以发挥各局部节点的优势,无需各局部节点进行通信,减少了各局部节点的通信负荷。通过理论分析和实验数据验证,该算法具有较好的可扩展性和海量处理能力,特别是在节点数目较多的情况下更显示出优势。  相似文献   

12.
This paper presents a study and comparison of shape design sensitivity analysis algorithms that are based on the continuum adjoint variable method, the continuum direct differentiation method, and the finite difference method, implemented on a supermini computer with an attached array processor. The basic algorithms and their differences in evaluating shape design sensitivity coefficients are outlined. A solution method for solving a system of equations, using a general sparse storage technique, is used for numerical implementation of shape design sensitivity analysis. It is found that computing shape design sensitivity coefficients using the direct differentiation method is significantly more efficient than using the adjoint variable method or the finite difference method. A detailed performance evaluation of the methods, using an attached array processor, is presented. The performance of the attached array processor, compared to a supermini computer is shown to depend strongly on the type of computations to be carried out. When only parts of a program are running on an attached array processor, the CPU time distribution among the different subroutines of the program can change significantly, compared to using the host processor only.  相似文献   

13.
针对现有防危调度算法在软硬件失效情况下防危能力不足的问题,具体进行了以下工作:构建了一种分层防危实时调度模型,该模型从功能组件和安全分区两方面描述了安全关键实时应用的防危性需求,并给出一种基于分层调度思想的三级防危调度器框架。以该模型和框架为基础,提出了一种新的分层防危调度算法(HSS),该算法对安全关键实时应用中不同关键度的功能组件采用空间隔离机制,对同一功能组件内的不同分区采用时间隔离机制,兼顾实现了时空隔离的防危效果。仿真实验结果表明,HSS算法与其他同类算法相比,在防危效果和应用负载承受能力方面具有较好的表现。  相似文献   

14.
Makoto Kobayashi 《Software》1977,7(5):585-594
This paper proposes a set of new program restructuring algorithms which can be used to reorganize programs so as to increase their performance under two typical memory management strategies. The new algorithms are based on a recently proposed program behaviour model called the bounded locality intervals model, which allows us to give a precise definition of the localities of a program. The paging activities of a program restructured with the new algorithms under working-set and global LRU-like memory management strategies are simulated to evaluate the new algorithms. Some of them are shown to have quite satisfactory performance.  相似文献   

15.
Algorithms for scheduling independent tasks on to the processors of a multiprocessor system must trade-off processor load balance, memory locality, and scheduling overhead. Most existing algorithms, however, do not adequately balance these conflicting factors. This paper introduces the self-adjusting dynamic scheduling (SADS) class of algorithms that use a unified cost model to explicitly account for these factors at runtime. A dedicated processor performs scheduling in phases by maintaining a tree of partial schedules and incrementally assigning tasks to the least-cost schedule. A scheduling phase terminates whenever any processor becomes idle, at which time partial schedules are distributed to the processors. An extension of the basic SADS algorithm, called DBSADS, controls the scheduling overhead by giving higher priority to partial schedules with more task-to-processor assignments. These algorithms are compared to two distributed scheduling algorithms within a database application on an Intel Paragon distributed memory multiprocessor system.  相似文献   

16.
并行计算模型在集群环境下的适应性   总被引:1,自引:0,他引:1  
分析了并行计算机模型和集群系统的特点,研究了BSP并行计算模型在集群环境下的适应性,指出通过合理地设计并行算法,某些算法在集群环境下可以获得近似线性的加速比,并用常用的线性规划标准形改进单纯型求最优解,在集群系统上的并行算法验证了该结论。  相似文献   

17.
基于阶段并行模型的算法设计研究   总被引:1,自引:0,他引:1  
NOWs正成为并行计算领域的一个新的发展热点,以太网构成的微机集群系统是NOWs的一种重要实现形式。阶段并行模型是BSP模型的改进,它更接近于表述实际的机器行为,同时具有编程简单、独立于体系结构和执行性能可预测等特点。文章研究了群集系统中阶段并行模型上的并行算法设计,以FFT算法为例,进行了设计和分析,并给出了测试结果。  相似文献   

18.
基于层次化调度策略和动态数据复制的网格调度方法   总被引:2,自引:0,他引:2  
针对在网格中如何有效地进行任务调度和数据复制, 以便减少任务执行时间等问题, 提出了任务调度算法(ISS)和优化动态数据复制算法(ODHRA), 并构建一个方案将两种算法进行了有效结合。该方案采用ISS算法综合考虑任务等待队列的数量、任务需求数据的位置和站点的计算容量, 采用网络结构分级调度的方式, 配以适当的权重系数计算综合任务成本, 搜索出最佳计算节点区域; 采用ODHRA算法分析数据传输时间、存储访问延迟、等待在存储队列中的副本请求和节点间的距离, 在众多的副本中选取出最佳副本位置, 再结合副本放置和副本管理, 从而降低了文件访问时间。仿真结果表明, 提出的方案在平均任务执行时间方面, 与其他算法相比表现出了更好的性能。  相似文献   

19.
在嵌入式并行计算系统中,任务调度是决定系统性能的关键。多任务调度中,启发式调度法是一种设计简单且性能良好的调度方法。目前的调度算法大多是基于任务复制的,没有充分考虑前驱任务与其后继任务间的相关性。该文提出了一种基于相关任务优化(DTO)的调度算法,通过分析已用处理机的负载和空闲时间,尽量减少系统的调度长度和处理机数目。算法分析结果表明,DTO算法在性能上优于其他算法,对嵌入式并行计算系统中的多任务调度是一个较好的选择。  相似文献   

20.
Hierarchical ring-based multiprocessor systems are attractive and enjoy several advantages over other type of systems. They ensure unique paths between nodes, simple node interfaces and simple cross-ring connections. Furthermore, employing point-to-point links allows the system to run at high clock rate which increases bandwidth and decreases latency. This paper investigates the performance of hierarchical ring-based shared-memory multiprocessors. Rings in the hierarchy are composed of point-to-point, unidirectional links and apply the Scalable Coherent Interface (SCI) protocol. We pay special emphasis on the impact of locality on processor and interconnection design issues such as number of outstanding requests, and ring topology. We find that in order to exploit the power of hierarchical multiprocessors an accurate and appropriate model of locality must be used. Hierarchical multiprocessors that are well balanced (uniform) tend to provide lower latency and higher system throughput. For non-uniform systems, high degree of locality is required for the hierarchies to perform well. However, restricting the number of outstanding transactions per processor is important in decreasing packets latency and avoiding network contention.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号