首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
In distributed shared memory multiprocessor systems, parallel tasks communicate through sharing memory data. As the system size increases, such communication cost becomes the main factor that limits the overall parallelism and performance. In this paper, we propose a new solution to the problem through judiciously managing the relevant resource, namely, the shared data and the interconnection network (IN) through which the sharing is carried out. In this approach, communication cost is minimized by means of data migration/allocation which is based on analyzing general layered task graphs, sharing behavior of parallel tasks, and network topology. Our method is not applicable for read only variables. Further, for the time being, the usefulness of the method is limited to multiprocessors where no cache coherence mechanism is implemented. Four typical interconnection topologies for multiprocessors are considered, namely, shared-bus, hierarchical-bus, 2-D mesh, and fat-tree structures. Efficient data allocation algorithms for each of the four network topologies are developed that make decision on data allocation/migration at the compile time. The complexity of one algorithm isO(np) for shared-bus andO(n2p) for the remaining three in a system withnprocessors executing ap-layer task graph for one shared variable. We have also given an algorithm to determine optimal allocation/migration scheme for multiple shared variables. However, the cost of the algorithm become prohibitive when the number of shared variables is high. Therefore, a heuristic of low complexity is suggested. The heuristic is optimal for some topologies.  相似文献   

2.
We give processor-allocation algorithms for grid architectures, where the objective is to select processors from a set of available processors to minimize the average number of communication hops. The associated clustering problem is as follows: Given n points in d , find a size-k subset with minimum average pairwise L 1 distance. We present a natural approximation algorithm and show that it is a -approximation for two-dimensional grids; in d dimensions, the approximation guarantee is , which is tight. We also give a polynomial-time approximation scheme (PTAS) for constant dimension d, and we report on experimental results.  相似文献   

3.
现代多核处理器结构的设计使得集成在同一块芯片上的多个执行核共享各种硬件资源,如片上最后一级Cache、内存控制器、前端总线以及硬件预取单元等,而多线程的并行执行导致核与核之间其享资源的争用,造成系统整体性能的下降,如何有效地解决多核共享资源冲突来提升系统的整体性能以及应用程序的服务质量成为当今研究的热点.文章首先概要介...  相似文献   

4.
Enhancing fault-tolerance in rate-monotonic scheduling   总被引:1,自引:0,他引:1  
In this paper, we address the problem of supporting timeliness and dependability at the level of task scheduling. We consider the problem of scheduling a set of tasks, each of which, for fault-tolerance purposes, has multiple versions, onto the minimum number of processors. On each individual processor, the tasks are guaranteed their deadlines by the Rate-Monotonic algorithm. A simple online allocation heuristic is proposed. It is proven thatN2.33N 0+, whereN is the number of processors required to feasibly schedule a set of tasks by the heuristic,N 0 is the minimum number of processors required to feasibly schedule the same set of tasks, and is the maximum redundancy degree a task can have. The bound is also shown to be a tight upper bound. The average-case performance of the heuristic is studied through simulation. It is shown that the heuristic performs surprisingly well on the average.  相似文献   

5.
An increasing interest is being shown in the use of multimicroprocessor systems in a variety of applications. It is often desirable that the constituent processors have a fast communications link for the sharing of data and/or results. Both of these aims can be achieved by the sharing of blocks of memory between the two or more processors; this method has the added advantage of little, or in some cases no, software overhead. The design of a shared memory system for Motorola MC6809E microprocessors is presented.  相似文献   

6.
We consider a simple restriction of the PRAM model (called PPRAM), where the input is arbitrarily partitioned between a fixed set of p processors and the shared memory is restricted to m cells. This model allows for investigation of the tradeoffs/ bottlenecks with respect to the communication bandwidth (modeled by the shared memory size m ) and the number of processors p . The model is quite simple and allows the design of optimal algorithms without losing the effect of communication bottlenecks. We have focused on the PPRAM complexity of problems that have $\tilde{O}$ (n) sequential solutions (where n is the input size), and where m ≤ p ≤ n . We show essentially tight time bounds (up to logarithmic factors) for several problems in this model such as summing, Boolean threshold, routing, integer sorting, list reversal and k -selection. We get typically two sorts of complexity behaviors for these problems: One type is $\tilde{O}$ (n/p + p/m) , which means that the time scales with the number of processors and with memory size (in appropriate ranges) but not with both. The other is $\tilde{O}$ (n/m) , which means that the running time does not scale with p and reflects a communication bottleneck (as long as m < p ). We are not aware of any problem whose complexity scales with both p and m (e.g. $O(n/\sqrt{m \cdot p})$ ). This might explain why in actual implementations one often fails to get p -scalability for p close to n .  相似文献   

7.
The general problem considered in the paper is partitioning of a matrix operation between processors of a parallel system in an optimum load-balanced way without potential memory contention. The considered parallel system is defined by several features the main of which is availability of a virtual shared memory divided into segments. If partitioning of a matrix operation causes parallel access to the same memory segment with writing data to the segment by at least one processor, then contention between processors arises which implies performance degradation. To eliminate such situation, a restriction is imposed on a class of possible partitionings, so that no two processors would write data to the same segment. On the resulting class of contention-free partitionings, a load-balanced optimum partitioning is defined as satisfying independent minimax criteria. The main result of the paper is an algorithm for finding the optimum partitioning by means of analytical solution of respective minimax problems. The paper also discusses implementation and performance issues related to the algorithm, on the basis of experience at Kendall Square Research Corporation, where the partitioning algorithm was used for creating high-performance parallel matrix libraries  相似文献   

8.
Conventional memory blocks have a single address input and a single, usually bidirectional, data output. Dual-port memories have two address inputs and two data ports. These memories have been designed to facilitate the exchange of data between CPUs within a multiprocessor system. Each microprocessor can access the multiport memory and therefore read the data of another processor or leave data for another processor. There are two problems in the design of multiport memory systems. The first, and more trivial, concerns the way in which each processor supplies an address to the memory and how it accesses the memory data bus. This is not a particularly complex problem and the designer biggest worry is how to design the interface with the least number of multiplexers and buffers. Whenever a processor wishes to access the multiport memory, it takes control of the address and data bus and then accesses the memory. A more fundamental design problem is posed when two or more processors try to access the memory nearly simultaneously. Memory contention is solved by the use of an arbitration circuit that arbitrates between the contending processors, grants access to only one processor and forces the others to wait. Fortunately, it is no longer necessary for all designers to construct their own dual-port memories from discrete components, since several manufacturers now put the memory, address and data multiplexers plus arbitration circuits on chip. IDT's application note shows how its dual-port memory operates and how it is used in multiprocessor systems.  相似文献   

9.
已知一个无向图G(V,E),|V|=n,|E|=m,本文基于SIMD共享存贮模型,运用数据在图中快速传播原理,建议了一个新的求图的连通分支算法,具体来讲,在SIMD—CREW共享存贮模型上,求图的连通分支需O(log2n)时间、O(n2/logn)处理器;而在SIMD—CRCW共享存贮模型上需O(logn)时间、O(n2)处理器,建议的算法同著名的Hirschberg算法相比,其主要差别表现在:1)采用的求解方法不同;2)建议的算法简单易懂  相似文献   

10.
We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. While present parallel computers of this type only support the programming model of a shared memory but often process simultaneous access by several processors to the shared memory sequentially, theoretical computer science offers solutions for this problem that are provably fast and asymptotically optimal. But the constants in these constructions seemed to be too large to let them be competitive. We modify these constructions under engineering aspects and improve the price/performance ratio by roughly a factor of 6. The resulting machine has surprisingly good price/performance ratio even if compared with distributed memory machines. For almost all access patterns of all processors into the shared memory, access is as fast as the access of only a single processor. Received: 29 June 1993 / 22 June 1999  相似文献   

11.
In this paper, we study an online scheduling problem with moldable parallel tasks on m processors. Each moldable task can be processed simultaneously on any number of processors of a parallel computer, and the processing time of a moldable task depends on the number of processors allotted to it. Tasks arrive one by one. Upon arrival of each task, the scheduler has to determine both the number of processors and the starting time for the task. Moreover, these decisions cannot be changed in the future. The objective is to attain a schedule such that the longest completion time over all tasks, i.e., the makespan, is minimized. First, we provide a general framework to show that any \(\rho \)-bounded algorithm for scheduling of rigid parallel tasks (the number of processors for a task is fixed a prior) can be extended to yield an algorithm for scheduling of moldable tasks with a competitive ratio of \(4\rho \) if the ratio \(\rho \) is known beforehand. As a consequence, we achieve the first constant competitive ratio, 26.65, for the moldable parallel tasks scheduling problem. Next, we provide an improved algorithm with a competitive ratio of at most 16.74.  相似文献   

12.
用于多种计算机系统和指令系统仿真的Virtutech Simics只提供一个简单的顺序扁平侦听式高速缓存一致性(Snoo-ping Cache Coherence Protocol)模型支持MESI协议,从而制约了可仿真的并行处理器个数。以下将基于目录的分布式高速缓存一致性协议(Distributed Directory-based Cache Coherence Protocol)模型应用于Simics中并给出基于Simics的分布式一致性协议的仿真结果。这一结果证实分布式协议能降低事件总数,减少网络中的事件。本文提出一个简单的基于目录的分布式高速缓存一致性协议,从而解决制约Simics的可扩放性问题。  相似文献   

13.
In this paper, we present a software tool, RTS (real time simulator), that analyses the time cost behaviour of parallel computations through simulation. It is assumed in RTS that the computer system which supports the executions of parallel computations has a limited number of processors all processors have the same speed and they communicate with each other through a shared memory. In RTS, the time cost of a parallel computation is defined as a function of the input, the algorithm, the data structure, the processor speed, the number of processors, the processor power allocation, the communication and the execution environment. How RTS models the time cost is first discussed in the paper. In the model, a locking technique is used to manipulate the access to the shared memory, processing power is equally allocated among all the operations that are currently being performed in parallel in the computer system, and the number of operations in the execution environment of a parallel computation changes from time to time. How RTS works and how the simulation is used to do time cost analysis are also discussed.  相似文献   

14.
In this paper, we propose a method about task scheduling and data assignment on heterogeneous hybrid memory multiprocessor systems for real‐time applications. In a heterogeneous hybrid memory multiprocessor system, an important problem is how to schedule real‐time application tasks to processors and assign data to hybrid memories. The hybrid memory consists of dynamic random access memory and solid state drives when considering the performance of solid state drives into the scheduling policy. To solve this problem, we propose two heuristic algorithms called improvement greedy algorithm and the data assignment according to the task scheduling algorithm, which generate a near‐optimal solution for real‐time applications in polynomial time. We evaluate the performance of our algorithms by comparing them with a greedy algorithm, which is commonly used to solve heterogeneous task scheduling problem. Based on our extensive simulation study, we observe that our algorithms exhibit excellent performance and demonstrate that considering data allocation in task scheduling is significant for saving energy. We conduct experiments on two heterogeneous multiprocessor systems. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

15.
The latest trends in high performance computing systems show an increasing demand on the use of a large scale multicore system in an efficient way so that high compute‐intensive applications can be executed reasonably well. However, the exploitation of the degree of parallelism available at each multicore component can be limited by the poor utilization of the memory hierarchy. Actually, the multicore architecture introduces some distinct features that are already observed in shared memory and distributed environments. One example is that subsets of cores can share different subsets of memory. In order to achieve high performance, it is imperative that a careful allocation scheme of an application is carried out on the available cores, based on a scheduling specification that considers not only processors characteristics but also memory contention. This paper proposes a multicore cluster representation that captures relevant performance characteristics in multicores systems such as the influence of memory hierarchy and contention on application performance. Improved performance was achieved by a branch‐and‐bound application applied to the partitioning sets problem that incorporated a memory aware load balancing strategy based on the proposed multicore cluster representation. An in‐depth analysis on this application execution showed its applicability to modern systems. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

16.
We revisit the problem of finding \(k\) paths with a minimum number of shared edges between two vertices of a graph. An edge is called shared if it is used in more than one of the \(k\) paths. We provide a \({\lfloor {k/2}\rfloor }\) -approximation algorithm for this problem, improving the best previous approximation factor of \(k-1\) . We also provide the first approximation algorithm for the problem with a sublinear approximation factor of \(O(n^{3/4})\) , where \(n\) is the number of vertices in the input graph. For sparse graphs, such as bounded-degree and planar graphs, we show that the approximation factor of our algorithm can be improved to \(O(\sqrt{n})\) . While the problem is NP-hard, and even hard to approximate to within an \(O(\log n)\) factor, we show that the problem is polynomially solvable when \(k\) is a constant. This settles an open problem posed by Omran et al. regarding the complexity of the problem for small values of \(k\) . We present most of our results in a more general form where each edge of the graph has a sharing cost and a sharing capacity, and there is a vulnerability parameter \(r\) that determines the number of times an edge can be used among different paths before it is counted as a shared/vulnerable edge.  相似文献   

17.
Sharing the memory in a parallel computer suggests that there is a possibility of many processors requesting a shared variable at virtually the same time. This idea has been formalized in the "hot spot" traffic model, where a fixed fraction of memory requests is for a single shared variable. Under the model, it has been shown that such concurrent requests to shared variable can create contention serious enough to stall large machines. As an effective method of alleviating this type of contention, "combining," in which several requests for the same variable can be combined into a single request, has been suggested. The NYU Ultracomputer and the IBM RP3 machine use "pairwise" combining, in which only two requests for the same variable can be combined at a switch of the processor-memory intereconnection. We study the effectiveness of combining. In particular, it turns out that pairwise combining is slightly too restrictive to handle hot spots if the machine size becomes very large. We suggest ways to overcome this weakness.  相似文献   

18.
The butterfly barrier   总被引:3,自引:0,他引:3  
We describe and algorithm for barrier synchronization that requires only read and write to shared store. The algorithm is faster than the traditionallocked counter approach for two processors and has an attractive log2 N time scaling for largerN. The algorithm is free of hot spots and critical regions and requires a shared memory bandwidth which grows linearly withN, the number of participating processors. We verify the technique using both a real shared memory multiprocessor, for numbers of processors up to 30, and a shared memory multiprocessor simulator, for number of processors up to 256.Work performed under the auspices of the U.S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W-7405-ENG-48.  相似文献   

19.
We present a technique for analyzing the number of cache misses incurred by multithreaded cache oblivious algorithms on an idealized parallel machine in which each processor has a private cache. We specialize this technique to computations executed by the Cilk work-stealing scheduler on a machine with dag-consistent shared memory. We show that a multithreaded cache oblivious matrix multiplication incurs cache misses when executed by the Cilk scheduler on a machine with P processors, each with a cache of size Z, with high probability. This bound is tighter than previously published bounds. We also present a new multithreaded cache oblivious algorithm for 1D stencil computations incurring cache misses with high probability, one for Gaussian elimination and back substitution, and one for the length computation part of the longest common subsequence problem incurring cache misses with high probability. This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No. NBCH30390004.  相似文献   

20.
The Method of Layers   总被引:1,自引:0,他引:1  
Distributed applications are being developed that contain one or more layers of software servers. Software processes within such systems suffer contention delays both for shared hardware and at the software servers. The responsiveness of these systems is affected by the software design, the threading level and number of instances of software processes, and the allocation of processes to processors. The Method of Layers (MOL) is proposed to provide performance estimates for such systems. The MOL uses the mean value analysis (MVA) linearizer algorithm as a subprogram to assist in predicting model performance measures  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号