首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper describes the design and implementation of an Efficient Architecture for Running THreads (EARTH) runtime system for a multi‐processor/multi‐node cluster. The (EARTH) model was designed to support the efficient execution of parallel (multi‐threaded) programs with irregular fine‐grain parallelism using off‐the‐shelf computers. Implementing an EARTH runtime system requires an explicitly threaded runtime system. For portability, we built this runtime system on top of Pthreads under Linux and used sockets for inter‐node communication. Moreover, in order to make the best use of the resources available on a cluster of symmetric multi‐processors (SMP), this implementation enables the overlapping of communication and computation. We used Threaded‐C, a language designed to implement the programming model supported by the EARTH architecture. This language allows the expression of various levels of parallelism and provides the primitives needed to manage the required communication and synchronization. The Threaded‐C programming language supports irregular fine‐grain parallelism through a two‐level hierarchy of threads and fibers. It also provides various synchronization and communication constructs that reflect the nature of EARTH's fibers—non‐preemptive execution with data‐driven scheduling—as well as the extensive use of split‐phase transactions on EARTH to execute long‐latency operations. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

2.
It is widely believed that superscalar and superpipelined extensions of RISC style architecture will dominate future processor design, and that needs of parallel computing will have little effect on processor architecture. This belief ignores the issues of memory latency and synchronization, and fails to recognize the opportunity to support a general semantic model for parallel computing. Efforts to extend the shared-memory model using standard microprocessors have led to systems that implement no satisfactory model of computing, and present the programmer with a difficult interface on which to build parallel computing applications. A more satisfactory model for parallel computing may be obtained on the basis of functional programming concepts and the principles of modular software construction. We recommend that designs for computers be built on such a general semantic model of parallel computation. Multithreading concepts and dataflow principles can frame the architecture of these new machines.  相似文献   

3.
多核并行技术在分子动力学模拟中的应用   总被引:1,自引:0,他引:1  
为了充分利用多核处理器资源,研究了一种用于分子动力学模拟中的多核并行技术。在多核处理器上利用OpenMP技术实现多线程创建与同步、动态设置子线程的调度运行方式以及负载均衡以减少子线程执行等待时间。通过对不同分子体系结构下的动力学模型测试,得出在不同子线程下并行计算的时间,并且得到了良好的性能加速比。实验结果表明,采用OpenMP并行技术可有效地提高电荷求解过程在分子动力学模拟运算中的时间效率,以及多核计算机资源的利用率。  相似文献   

4.

Dynamic island models are population-based algorithms for solving optimization problems, where the individuals of the population are distributed on islands. These subpopulations of individuals are processed by search algorithms on each island. In order to share information within this distributed search process, the individuals migrate from their initial island to another destination island at regular steps. In dynamic island models, the migration process evolves during the search according to the observed performance on the different islands. The purpose of this dynamic/adaptive management of the migrations is to send the individuals to the most promising islands, with regards to their current states. Therefore, our approach is related to the adaptive management of search operators in evolutionary algorithms. In this work, our main purpose is thus to precisely analyze dynamic migration policies. We propose a testing process that assigns gains to the algorithms applied on the islands in order to assess the adaptive ability of the migration policies, with regards to various scenarios. Instead of having one dynamic migration policy that is applied to the whole search process, as it has already been studied, we propose to associate a migration policy to each individual, which allows us to combine simultaneously different migration policies.

  相似文献   

5.
Hierarchically organized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dual-tier computers introduce many new factors into the programmer's performance model. We present a methodology for implementing block-structured numerical applications on dual-tier computers and a run-time infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, run-time geometric meta-data, and asynchronous collective communication. KeLP applications can effectively overlap communication with computation under conditions where nonblocking point-to-point message passing fails to do so. KeLP's abstractions hide considerable detail without sacrificing performance and dual-tier applications written in KeLP consistently outperform equivalent single-tier implementations written in MPI. We describe the KeLP2 model and show how it facilitates the implementation of five block-structured applications specially formulated to hide communication latency on dual-tiered architectures. We support our arguments with empirical data from applications running on various single- and dual-tier multicomputers. KeLP2 supports a migration path from single-tier to dual-tier platforms and we illustrate this capability with a detailed programming example  相似文献   

6.
Much research has focused on reducing and/or tolerating remote memory access latencies on distributed-memory parallel computers. Caching remote data is intended to reduce average access latency by handling as many remote memory accesses as possible using local copies of the data in the cache. Data-flow and multithreaded approaches help programs tolerate the latency of remote memory accesses by allowing processors to do other work while remote operations take place. The thread migration technique described here is a multithreaded architecture where threads migrate to remote processors that contain data they need. By exploiting access locality, the threads often use several data items from that processor before migrating to other processors for more data. Because the threads migrate in search of data, the approach is called Nomadic Threads. A prototype runtime system has been implemented on the CM5 and is portable to other distributed memory parallel computers.  相似文献   

7.
Multiprocessors in which a shared bus is used by the processor to communicate with common memory are an emerging class of machines where there is a need to support parallel programming languages. A language construct that is found in a number of parallel programming languages to support synchronization and communication in the interprocess rendezvous. Shared-bus multiprocessor require a protocol to keep the date in their caches coherent. There are two major categories of these protocols: invalidation and write-boadcast. This paper examines the requirements for cache coherence protocols to support efficient interprocessor rendezvous. The approach taken is to examine the memory referencing patterns to the run-time data structures during rendezvous execution. The appropriate coherence protocol is shown to be a function of the processor scheduling strategy used by the run-time system at synchronzation points during the rendezvous. When processes migrate freely as a result of the scheduling strategy, invalidation protocols are found to be more efficient. When migration is restricted by the scheduler, write-broadcast protocols are more efficient.  相似文献   

8.
The Cell processor is a heterogeneous multi-core processor with one power processing engine (PPE) core and eight synergistic processing engine (SPE) cores. There is a significant amount of ongoing research in programming models and tools that attempts to make it easy to exploit the computation power of the Cell architecture. In our work, we explore supporting OpenMP on the Cell processor. It is attractive to support OpenMP because programmers can continue using their familiar programming model, and existing code can be re-used. We base our work on IBM’s XL compiler, and developed new components in the XL compiler and a new runtime library. Three major issues are addressed: (1) synchronization support on heterogeneous cores; (2) code generation targeting the different instruction sets; (3) data transfers and implement the OpenMP memory model. We present experimental results for some SPEC OMP 2001 and NAS benchmarks to demonstrate the effectiveness of this approach. A visualization tool based on Paraver is also used to provide some insights into actual thread and synchronization behaviors.  相似文献   

9.
Hua Zhang  Joohan Lee  Ratan Guha 《Software》2008,38(10):1049-1071
Clusters, composed of symmetric multiprocessor (SMP) machines and heterogeneous machines, have become increasingly popular for high‐performance computing. Message‐passing libraries, such as message‐passing interface (MPI) and parallel virtual machine (PVM), are de facto parallel programming libraries for clusters that usually consist of homogeneous and uni‐processor machines. For SMP machines, MPI is combined with multithreading libraries like POSIX Thread and OpenMP to take advantage of the architecture. In addition to existing parallel programming libraries that are in C/C++ and FORTRAN programming languages, the Java programming language presents itself as another alternative with its object‐oriented framework, platform neutral byte code, and ever‐increasing performance. This paper presents a new parallel programming model and a library, VCluster, which implements this model. VCluster is based on migrating virtual threads instead of processes to support clusters of SMP machines more efficiently. The implementation uses thread migration, which can be used in dynamic load balancing. VCluster was developed in pure Java, utilizing the portability of Java to support clusters of heterogeneous machines. Several applications are developed to illustrate the use of this library and compare the usability and performance of VCluster with other approaches. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

10.
Multiple-context processors provide register resources that allow rapid context switching between several threads as a means of tolerating long communication and synchronization latencies. When scheduling threads on such a processor, we must first decide which threads should have their state loaded into the multiple contexts, and second, which loaded thread is to execute instructions at any given time. In this paper we show that both decisions are important, and that incorrect choices can lead to serious performance degradation. We propose thread prioritization as a means of guiding both levels of scheduling. Each thread has a priority that can change dynamically, and that the scheduler uses to allocate as many computation resources as possible to critical threads. We briefly describe its implementation, and we show simulation performance results for a number of simple benchmarks in which synchronization performance is critical.  相似文献   

11.
面向层次化NoC的混合并行编程模型   总被引:1,自引:0,他引:1       下载免费PDF全文
曹祥  易伟  潘红兵  高明伦  李丽 《计算机工程》2010,36(13):278-280
为更好发挥多核处理器的硬件性能,针对层次化的片上网络架构,提出MPI/OpenMP混合并行编程模型。运用基于MPI的任务级并行模型实现片内簇间的高效通信,采用OpenMP模型实现簇内四核的通信、同步和数据交换。实验结果表明,与单一并行编程模型相比,混合并行编程模型加速比提高了20%~50%。  相似文献   

12.
In this paper we benchmark the performance of the Cray T3D, IBM 9076 SP/1 and Intel Paragon XP/S parallel computers, using implementations of parallel algorithms for the computation of the vector outer-product A = uvT operation. The vector outer-product operation, although very simple in nature, requires the computation of a large number of floating-point operations and its parallelization induces a great level of communication between the processors. It is thus suited to measure the relative speed of the processor, memory subsystem and network capabilities of a parallel computer. It should not be considered a ‘toy problem’, since it arises in numerical methods in the context of the solution of systems of non-linear equations – still a difficult problem to solve. We present algorithms for both the explicit shared-memory and message-passing programming models together with theoretical computation models for those algorithms. Actual experiments were run on those computers, using Fortran 77 implementations of the algorithms. The results obtained with these experiments show that due to the high degree of communication between the processors one needs a parallel computer with fast communications and carefully implemented data exchange routines. The theoretical computation model allows prediction of the speed-up to be obtained for some problem size on a given number of processors. © 1997 John Wiley & Sons, Ltd.  相似文献   

13.
Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters   总被引:2,自引:0,他引:2  
Nowadays, NVIDIA's CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scales transparently to hundreds of cores: scientists throughout industry and academia are already using CUDA to achieve dramatic speedups on production and research codes. In this paper, we propose a parallel programming approach using hybrid CUDA OpenMP, and MPI programming, which partition loop iterations according to the number of C1060 GPU nodes in a GPU cluster which consists of one C1060 and one S1070. Loop iterations assigned to one MPI process are processed in parallel by CUDA run by the processor cores in the same computational node.  相似文献   

14.
基于平衡负载、减小通信开销的考虑,对于非均衡负载节点并行机提出了两种并行遗传算法一动态负载平衡的孤岛模型和主从模型,并与基本的孤岛模型做了比较。两种算法在实际使用中均取得了较好的效果。  相似文献   

15.
That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd.  相似文献   

16.
更实际的并行计算模型   总被引:7,自引:0,他引:7  
过去所报导的大量并行算法在小规模的并行机上均运行得很好,然而将其移植到大规模并行机上运行时性能却很差。原因之一就是并行计算模型(如PRAM)过于抽象,略去了一些诸如通信、同步等算法运行时不可忽略的因素。本文介绍目前所提出的几个较能反映近代并行机性能的更为实际的并行计算模型,包括异步PRAM,BSP,logP和C3模型等。当然这些模型在与真实并行机吻合的程度、可使用性和分析较复杂算法时的可操作性等方面尚存异议,但是它们的确打开了研究并行计其模型的新途径,成为当今并行算法研究的热点之一。  相似文献   

17.
Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without considering whether the computing node is multicore architecture or not. As a result, every processor core has to communicate directly with the master node for requesting new tasks no matter the fact that the processor cores on the same node can communicate with each other through the underlying shared memory. To address the problem of higher communication overhead, in this paper we propose to adopt hybrid MPI and OpenMP programming model to design two-level parallel loop self-scheduling schemes. In the first level, each computing node runs an MPI process for inter-node communications. In the second level, each processor core runs an OpenMP thread to execute the iterations assigned for its resident node. Experimental results show that our method outperforms the previous works.  相似文献   

18.
§1.引 言 谱方法是上世纪七十年代发展起来的应用于大规模数值天气预报模式系统的数值方法.近年来谱方法已成为求解偏微分方程的重要数值方法之一[1,2],与经典的网格点方法相比,它具有计算精度高、稳定性好、程序简单而有效的突出特点.谱方法虽然有计算量和存储量均大的缺点,但超级并行计算技术的发展推动了谱方法的进一步发展和应用[3,4].近十年来,谱方法在数值天气预报领域的应用越来越广泛,不仅应用于全球数值天气预报模式而且应用于有限区域数值天气预报模式.  相似文献   

19.
郭绚  郭平  郑守淇 《计算机学报》1999,22(6):591-595
介绍了一基于PVM并行环境的并行遗传算法的C++类库ParaGA的设计和实现,ParaGA以使用方便和灵活为主要目标,提供了透明的并行机制,使不具有并行程序经验的用户可以方便地编写并行遗传算法的程序,高级用户也可通过类库提供的若干方法来获得的优化的可行性能,类库采用粗粒度模型,支持并行遗传算法的3种迁移模式及SPMD和Master/Slave两种编程模式,ParaGA也提供了实现负载平衡分与及利用  相似文献   

20.
Current trend of research on multithreading processors is toward the chip multithreading (CMT), which exploits thread level parallelism (TLP) and improves performance of softwares built on traditional threading components, e.g., Pthread. There exist commercially available processors that support simultaneous multithreading (SMT) on multicore processors. But they are basically based on the conventional sequential execution model, and execute multiple threads in parallel under the control of OS that handles interruptions. Moreover, there exist few languages or programming techniques to utilize the multicore processors effectively. We are taking another approach to develop a multithreading processor, which is dedicated to TLP. Our processor, named Fuce, is based on the continuation-based multithreading. A thread is defined as a block of sequentially ordered instructions which are executed without interruption. Every thread execution is triggered only by the event called continuation. This paper first introduces the continuation-based multithread execution model and its processor architecture then gives multithreaded programming techniques and the continuation-based multithreading language system CML. Last, the performance of the Fuce processor is evaluated by means of the clock-level software simulation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号