首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
The main memory access latency can significantly slow down the overall performance of a computer system due to the fact that average cycle time of the main memory is typically a factor of 5–10 times higher than that of a processor. To cope with this problem, in addition to the use of caches, the main memory of a multiprocessor architecture is usually organized into multiple modules or banks. Although such organization enhances memory bandwidth, the amount of data that the multiprocessor can retrieve in the same memory cycle, conflicts due to simultaneous attempts to access the same memory module may reduce the effective bandwidth. Therefore, efficient mapping schemes are required to distribute data in such a way that regular patterns, called templates, of various structures can be retrieved in parallel without memory conflicts. Prior work on data mappings mostly dealt with conflict-free access to templates such as rows, columns, or diagonals of (multidimensional) arrays, and only limited attention has been paid to access templates of nonnumeric structures such as trees. In this paper, we study optimal and balanced mappings for accessing path and subtree templates of trees, where a mapping will be called optimal if it allows conflict-free access to templates with as few memory banks as possible. An optimal mapping will also be called balanced if it distributes as evenly as possible the nodes of the entire tree among the memory banks available. In particular, based on Latinsquares, we propose an optimal and balanced mapping for leaf-to-root paths of q-ary trees. Another (recursive) mapping for leaf-to-root paths of binary trees raises interesting combinatorial problems. We also derive an optimal and balanced mapping to access complete t-ary subtrees of complete q-ary trees, where 2⩽tq, and an optimal mapping for subtrees of binomial trees.  相似文献   

2.
Diffracting trees are an effective and highly scalable distributed- parallel technique for shared counting and load balancing. This paper presents the first steady-state combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic design questions. Our model is simple and sufficiently high level to overcome many implementation-specific details, and yet as we show it is rich enough to predict empirically observed behaviors accurately. As a result of our analysis we were able to identify starvation problems in the original diffracting tree algorithm and modify it to a create a more stable version. We were also able to identify the range in which the diffracting tree performs most efficiently, and the ranges in which its performance degrades. We believe our model and modeling approach open the way to steady-state analysis of other distributed-parallel structures such as counting networks and elimination trees. Received October 1996, and in final form July 1997.  相似文献   

3.
A distributed counter is a concurrent object which provides a fetch-and-increment operation on a shared value. On the basis of a distributed counter, one can implement various fundamental data structures, such as queues or stacks. We present the counting pyramid, an efficient implementation of a distributed counter in a message passing system, which is based on software combining. The counting pyramid adapts gracefully to changing access patterns, guarantees linearizability, and offers more general fetch-and-Φ operations. We analyze the expected performance of the counting pyramid, using queueing theory and simulation. We show that the latency of the counting pyramid is asymptotically optimal.  相似文献   

4.
在大规模并分布式共享主存多处理机系统中,尽可能减少系统中远程访问时延,是提高系统整体性能的关键。该文提出了一种路由Cache结构,并详细介绍了基于路由器Cache的一致性协议,该协议在减少系统远程访问延时,提高系统有效带宽方面有较好的效果。  相似文献   

5.
Reactive diffracting trees are efficient distributed objects that support synchronization, by distributing sets of memory accesses to different memory banks in a coordinated manner. They adjust their size in order to retain their efficiency in the presence of different contention levels. Their adjustment is sensitive to parameters that have to be manually determined after experimentation. Since these parameters depend on the application as well as on the system configuration and load, determining their optimal values is difficult in practice. Moreover, the adjustments are done one level at a time, hence the cost of multi-level adjustments can be high.  相似文献   

6.
We study conflict-free data distribution schemes in parallel memories in multiprocessor system architectures. Given a host graph G, the problem is to map the nodes of G into memory modules such that any instance of a template type T in G can be accessed without memory conflicts. A conflict occurs if two or more nodes of T are mapped to the same memory module. The mapping algorithm should: (i) be fast in terms of data access (possibly mapping each node in constant time); (ii) minimize the required number of memory modules for accessing any instance in G of the given template type; and (iii) guarantee load balancing on the modules. In this paper, we consider conflict-free access to star templates, i.e., to any node of G along with all of its neighbors. Such a template type arises in many classical algorithms like breadth-first search in a graph, message broadcasting in networks, and nearest neighbor based approximation in numerical computation. We consider the star-template access problem on two specific host graphs—tori and hypercubes—that are also popular interconnection network topologies. The proposed conflict-free mappings on these graphs are fast, use an optimal or provably good number of memory modules, and guarantee load balancing.  相似文献   

7.
在多处理机系统中,负载平衡是提高并行处理效率的一条重要途径。基于分布存贮的TRANSCUBE多处理机环境,本文提出一种分布式动态负载平衡算法。算法采用接收者开始的异步调度策略,通过“握手”协议在空载和重载处理机间建立联系,并自动实现任务(或进程)从重载处理机到空载处理机的迁移,该算法适于并行解具有动态特性的应用问题,而且在问题规模较大和处理机负载变化较慢时,性能较好。  相似文献   

8.
9.
In a distributed shared memory (DSM) multiprocessor, the processors cooperate in solving a parallel application by accessing the shared memory. The latency of a memory access depends on several factors, including the distance to the nearest valid data copy, data sharing conditions, and traffic of other processors. To provide a better understanding of DSM performance and to support application tuning and compiler development for DSM systems, this paper extends microbenchmarking techniques to characterize the important aspects of a DSM system. We present an experiment-based methodology for characterizing the memory, communication, scheduling, and synchronization performance, and apply it to the Convex SPP1000. We present carefully designed microbenchmarks to characterize the performance of the local and remote memory, producer-consumer communication involving two or more processors, and the effects on performance when multiple processors contend for utilization of the distributed memory and the interconnection network  相似文献   

10.
Interconnection structures that can provide access to multiple levels of a shared memory hierarchy in a multiprocessor are investigated. The results are also applicable to distributed memory architectures in which localities of communication can be statically defined. All the structures presented conform in some fashion to the binary cube topology, with per-processor logic cost ranging from O(log N) to O(log2N). The results illustrate that without resorting to separate networks for access at each level, several architectures can provide fast access at tower levels in the hierarchy and progressively slower access at higher levels. Even at the highest communication level (corresponding to systemwide communication), messages encounter less delay than in a nonhierarchical access situation.  相似文献   

11.
Scalable,parallel computers: Alternatives,issues, and challenges   总被引:3,自引:0,他引:3  
The 1990s will be the era of scalable computers. By giving up uniform memory access, computers can be built that scale over a range of several thousand. These provide highpeak announced performance (PAP), by using powerful, distributed CMOS microprocessor-primary memory pairs interconnected by a high performance switch (network). The parameters that determine these structures and their utility include: whether hardware (a multiprocessor) or software (a multicomputer) is used to maintain a distributed, or shared virtual memory (DSM) environemnt; the power of computing nodes (these improve at 60% per year); the size and scalability of the switch; distributability (the ability to connect to geographically dispersed computers including workstations); and all forms of software to exploit their inherent parallelism. To a great extent, viability is determined by a computer's generality—the ability to efficiently handle a range of work that requires varying processing (from serial to fully parallel), memory, and I/O resources. A taxonomy and evolutionary time line outlines the next decade of computer evolution, included distributed workstations, based on scalability and parallelism. Workstations can be the best scalables.  相似文献   

12.
A threshold counter is a shared data structure that assumes integer values. It provides two operations: changes the current counter value from v to v+1, while returns the value v/w, where v is the current counter value and w is a fixed constant. Thus, the operation returns the “approximate” value of the counter to within the constant w. Threshold counters have many potential uses, including software barrier synchronization. Threshold networks are a class of distributed data structures that can be used to construct highly-concurrent, low-contention implementations of shared threshold counters. In this paper, we give the first proof that any threshold network construction of a threshold counter can be extended to support a operation that changes the counter value from v to v−1.  相似文献   

13.
Algorithms for scheduling independent tasks on to the processors of a multiprocessor system must trade-off processor load balance, memory locality, and scheduling overhead. Most existing algorithms, however, do not adequately balance these conflicting factors. This paper introduces the self-adjusting dynamic scheduling (SADS) class of algorithms that use a unified cost model to explicitly account for these factors at runtime. A dedicated processor performs scheduling in phases by maintaining a tree of partial schedules and incrementally assigning tasks to the least-cost schedule. A scheduling phase terminates whenever any processor becomes idle, at which time partial schedules are distributed to the processors. An extension of the basic SADS algorithm, called DBSADS, controls the scheduling overhead by giving higher priority to partial schedules with more task-to-processor assignments. These algorithms are compared to two distributed scheduling algorithms within a database application on an Intel Paragon distributed memory multiprocessor system.  相似文献   

14.
We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity — similarly to distributed virtual shared memory (DVSM) systems —leaving simpler hardware to maintain shared memory coherence at a cache line granularity.

By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all-hardware designs.  相似文献   


15.
Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP)  相似文献   

16.
We present a new algorithm for implementing a concurrent B-tree on a multiprocessor. The algorithm replicates B-tree nodes on multiple processors (particularly nodes near the root of the tree) to eliminate bottlenecks caused by contention for a single copy of each node. In contrast to other replication or caching strategies that provide some form of coherence, the algorithm uses a novel replication strategy, calledmulti-version memory. Multi-version memory weakens the semantics of coherent caches by allowing readers to access “old versions” of data. As a result, readers can run in parallel with a writer. Using multi-version memory for the internal nodes of a B-tree reduces the synchronization requirements and delays associated with updating the internal nodes. Experiments comparing the B-tree algorithm based on multi-version memory to other algorithms based on coherent replication show that using multi-version memory enhances throughput significantly, even for small numbers of processors, and also allows throughput to scale with increasing numbers of processors long past the point where other algorithms saturate and start to thrash.  相似文献   

17.
18.
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low.  相似文献   

19.
A synchronizer with a phase counter (sometimes called asynchronous phase clock) is an asynchronous distributed algorithm, where each node maintains a local "pulse counter" that simulates the global clock in a synchronous network. In this paper, we present a time-optimal self-stabilizing scheme for such a synchronizer, assuming unbounded counters. We give a simple rule by which each node can compute its pulse number as a function of its neighbors' pulse numbers. We also show that some of the popular correction functions for phase clock synchronization are not self-stabilizing in asynchronous networks. Using our rule, the counters stabilize in time bounded by the diameter of the network, without invoking global operations. We argue that the use of unbounded counters can be justified by the availability of memory for counters that are large enough to be practically unbounded and by the existence of reset protocols that can be used to restart the counters in some rare cases where faults will make this necessary.  相似文献   

20.
The efficiency of the basic operations of a NUMA (nonuniform memory access) multiprocessor determines the parallel processing performance on a NUMA multiprocessor. The authors present several analytical models for predicting and evaluating the overhead of interprocessor communication, process scheduling, process synchronization, and remote memory access, where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared-memory multiprocessor. Analytical and experimental results give a comprehensive understanding of the various effects, which are important for the effective use of NUMA shared-memory multiprocessor. The results presented can be used to determine optimal strategies in developing an efficient programming environment for a NUMA system  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号