共查询到20条相似文献,搜索用时 406 毫秒
1.
《Journal of Parallel and Distributed Computing》2000,60(8):998-1027
The main memory access latency can significantly slow down the overall performance of a computer system due to the fact that average cycle time of the main memory is typically a factor of 5–10 times higher than that of a processor. To cope with this problem, in addition to the use of caches, the main memory of a multiprocessor architecture is usually organized into multiple modules or banks. Although such organization enhances memory bandwidth, the amount of data that the multiprocessor can retrieve in the same memory cycle, conflicts due to simultaneous attempts to access the same memory module may reduce the effective bandwidth. Therefore, efficient mapping schemes are required to distribute data in such a way that regular patterns, called templates, of various structures can be retrieved in parallel without memory conflicts. Prior work on data mappings mostly dealt with conflict-free access to templates such as rows, columns, or diagonals of (multidimensional) arrays, and only limited attention has been paid to access templates of nonnumeric structures such as trees. In this paper, we study optimal and balanced mappings for accessing path and subtree templates of trees, where a mapping will be called optimal if it allows conflict-free access to templates with as few memory banks as possible. An optimal mapping will also be called balanced if it distributes as evenly as possible the nodes of the entire tree among the memory banks available. In particular, based on Latinsquares, we propose an optimal and balanced mapping for leaf-to-root paths of q-ary trees. Another (recursive) mapping for leaf-to-root paths of binary trees raises interesting combinatorial problems. We also derive an optimal and balanced mapping to access complete t-ary subtrees of complete q-ary trees, where 2⩽t⩽q, and an optimal mapping for subtrees of binomial trees. 相似文献
2.
Diffracting trees are an effective and highly scalable distributed- parallel technique for shared counting and load balancing. This paper presents
the first steady-state combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic
design questions. Our model is simple and sufficiently high level to overcome many implementation-specific details, and yet
as we show it is rich enough to predict empirically observed behaviors accurately. As a result of our analysis we were able
to identify starvation problems in the original diffracting tree algorithm and modify it to a create a more stable version.
We were also able to identify the range in which the diffracting tree performs most efficiently, and the ranges in which its
performance degrades. We believe our model and modeling approach open the way to steady-state analysis of other distributed-parallel
structures such as counting networks and elimination trees.
Received October 1996, and in final form July 1997. 相似文献
3.
《Journal of Parallel and Distributed Computing》2004,64(4):449-460
A distributed counter is a concurrent object which provides a fetch-and-increment operation on a shared value. On the basis of a distributed counter, one can implement various fundamental data structures, such as queues or stacks. We present the counting pyramid, an efficient implementation of a distributed counter in a message passing system, which is based on software combining. The counting pyramid adapts gracefully to changing access patterns, guarantees linearizability, and offers more general fetch-and-Φ operations. We analyze the expected performance of the counting pyramid, using queueing theory and simulation. We show that the latency of the counting pyramid is asymptotically optimal. 相似文献
4.
5.
Phuong Hoai HaAuthor Vitae Marina PapatriantafilouAuthor VitaePhilippas Tsigas 《Journal of Parallel and Distributed Computing》2007
Reactive diffracting trees are efficient distributed objects that support synchronization, by distributing sets of memory accesses to different memory banks in a coordinated manner. They adjust their size in order to retain their efficiency in the presence of different contention levels. Their adjustment is sensitive to parameters that have to be manually determined after experimentation. Since these parameters depend on the application as well as on the system configuration and load, determining their optimal values is difficult in practice. Moreover, the adjustments are done one level at a time, hence the cost of multi-level adjustments can be high. 相似文献
6.
《Journal of Parallel and Distributed Computing》2006,66(11):1431-1441
We study conflict-free data distribution schemes in parallel memories in multiprocessor system architectures. Given a host graph G, the problem is to map the nodes of G into memory modules such that any instance of a template type T in G can be accessed without memory conflicts. A conflict occurs if two or more nodes of T are mapped to the same memory module. The mapping algorithm should: (i) be fast in terms of data access (possibly mapping each node in constant time); (ii) minimize the required number of memory modules for accessing any instance in G of the given template type; and (iii) guarantee load balancing on the modules. In this paper, we consider conflict-free access to star templates, i.e., to any node of G along with all of its neighbors. Such a template type arises in many classical algorithms like breadth-first search in a graph, message broadcasting in networks, and nearest neighbor based approximation in numerical computation. We consider the star-template access problem on two specific host graphs—tori and hypercubes—that are also popular interconnection network topologies. The proposed conflict-free mappings on these graphs are fast, use an optimal or provably good number of memory modules, and guarantee load balancing. 相似文献
7.
8.
9.
Abandah G.A. Davidson E.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(2):206-216
In a distributed shared memory (DSM) multiprocessor, the processors cooperate in solving a parallel application by accessing the shared memory. The latency of a memory access depends on several factors, including the distance to the nearest valid data copy, data sharing conditions, and traffic of other processors. To provide a better understanding of DSM performance and to support application tuning and compiler development for DSM systems, this paper extends microbenchmarking techniques to characterize the important aspects of a DSM system. We present an experiment-based methodology for characterizing the memory, communication, scheduling, and synchronization performance, and apply it to the Convex SPP1000. We present carefully designed microbenchmarks to characterize the performance of the local and remote memory, producer-consumer communication involving two or more processors, and the effects on performance when multiple processors contend for utilization of the distributed memory and the interconnection network 相似文献
10.
Interconnection structures that can provide access to multiple levels of a shared memory hierarchy in a multiprocessor are investigated. The results are also applicable to distributed memory architectures in which localities of communication can be statically defined. All the structures presented conform in some fashion to the binary cube topology, with per-processor logic cost ranging from O(log N) to O(log2N). The results illustrate that without resorting to separate networks for access at each level, several architectures can provide fast access at tower levels in the hierarchy and progressively slower access at higher levels. Even at the highest communication level (corresponding to systemwide communication), messages encounter less delay than in a nonhierarchical access situation. 相似文献
11.
Scalable,parallel computers: Alternatives,issues, and challenges 总被引:3,自引:0,他引:3
Gordon Bell 《International journal of parallel programming》1994,22(1):3-46
The 1990s will be the era of scalable computers. By giving up uniform memory access, computers can be built that scale over
a range of several thousand. These provide highpeak announced performance (PAP), by using powerful, distributed CMOS microprocessor-primary memory pairs interconnected by a high performance switch
(network). The parameters that determine these structures and their utility include: whether hardware (a multiprocessor) or
software (a multicomputer) is used to maintain a distributed, or shared virtual memory (DSM) environemnt; the power of computing
nodes (these improve at 60% per year); the size and scalability of the switch; distributability (the ability to connect to
geographically dispersed computers including workstations); and all forms of software to exploit their inherent parallelism.
To a great extent, viability is determined by a computer's generality—the ability to efficiently handle a range of work that
requires varying processing (from serial to fully parallel), memory, and I/O resources. A taxonomy and evolutionary time line
outlines the next decade of computer evolution, included distributed workstations, based on scalability and parallelism. Workstations
can be the best scalables. 相似文献
12.
Costas Busch Neophytos Demetriou Maurice Herlihy Marios Mavronicolas 《Theoretical computer science》2002,270(1-2):811-826
A threshold counter is a shared data structure that assumes integer values. It provides two operations:
changes the current counter value from v to v+1, while
returns the value v/w, where v is the current counter value and w is a fixed constant. Thus, the
operation returns the “approximate” value of the counter to within the constant w. Threshold counters have many potential uses, including software barrier synchronization. Threshold networks are a class of distributed data structures that can be used to construct highly-concurrent, low-contention implementations of shared threshold counters. In this paper, we give the first proof that any threshold network construction of a threshold counter can be extended to support a
operation that changes the counter value from v to v−1. 相似文献
13.
Hamidzadeh B. Lau Ying Kit Lilja D.J. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(11):1151-1163
Algorithms for scheduling independent tasks on to the processors of a multiprocessor system must trade-off processor load balance, memory locality, and scheduling overhead. Most existing algorithms, however, do not adequately balance these conflicting factors. This paper introduces the self-adjusting dynamic scheduling (SADS) class of algorithms that use a unified cost model to explicitly account for these factors at runtime. A dedicated processor performs scheduling in phases by maintaining a tree of partial schedules and incrementally assigning tasks to the least-cost schedule. A scheduling phase terminates whenever any processor becomes idle, at which time partial schedules are distributed to the processors. An extension of the basic SADS algorithm, called DBSADS, controls the scheduling overhead by giving higher priority to partial schedules with more task-to-processor assignments. These algorithms are compared to two distributed scheduling algorithms within a database application on an Intel Paragon distributed memory multiprocessor system. 相似文献
14.
Ashley Saulsbury Tim Wilkinson John Carter Anders Landin 《Future Generation Computer Systems》1995,11(6):553-566
We present design details and some initial performance results of a novel scalable shared memory multiprocessor architecture. This architecture features the automatic data migration and replication capabilities of cache-only memory architecture (COMA) machines, without the accompanying hardware complexity. A software layer manages cache space allocation at a page-granularity — similarly to distributed virtual shared memory (DVSM) systems —leaving simpler hardware to maintain shared memory coherence at a cache line granularity.
By reducing the hardware complexity, the machine cost and development time are reduced. We call the resulting hybrid hardware and software multiprocessor architecture Simple COMA. Preliminary results indicate that the performance of Simple COMA is comparable to that of more complex contemporary all-hardware designs. 相似文献
15.
Clemencon C. Mukherjee B. Schwan K. 《IEEE transactions on pattern analysis and machine intelligence》1996,22(2):132-152
Any parallel program has abstractions that are shared by the program's multiple processes. Such shared abstractions can considerably affect the performance of parallel programs, on both distributed and shared memory multiprocessors. As a result, their implementation must be efficient, and such efficiency should be achieved without unduly compromising program portability and maintainability. The primary contribution of the DSA library is its representation of shared abstractions as objects that may be internally distributed across different nodes of a parallel machine. Such distributed shared abstractions (DSA) are encapsulated so that their implementations are easily changed while maintaining program portability across parallel architectures. The principal results presented are: a demonstration that the fragmentation of object state across different nodes of a multiprocessor machine can significantly improve program performance; and that such object fragmentation can be achieved without compromising portability by changing object interfaces. These results are demonstrated using implementations of the DSA library on several medium scale multiprocessors, including the BBN Butterfly, Kendall Square Research, and SGI shared memory multiprocessors. The DSA library's evaluation uses synthetic workloads and a parallel implementation of a branch and bound algorithm for solving the traveling salesperson problem (TSP) 相似文献
16.
《Journal of Parallel and Distributed Computing》1996,32(1):28-48
We present a new algorithm for implementing a concurrent B-tree on a multiprocessor. The algorithm replicates B-tree nodes on multiple processors (particularly nodes near the root of the tree) to eliminate bottlenecks caused by contention for a single copy of each node. In contrast to other replication or caching strategies that provide some form of coherence, the algorithm uses a novel replication strategy, calledmulti-version memory. Multi-version memory weakens the semantics of coherent caches by allowing readers to access “old versions” of data. As a result, readers can run in parallel with a writer. Using multi-version memory for the internal nodes of a B-tree reduces the synchronization requirements and delays associated with updating the internal nodes. Experiments comparing the B-tree algorithm based on multi-version memory to other algorithms based on coherent replication show that using multi-version memory enhances throughput significantly, even for small numbers of processors, and also allows throughput to scale with increasing numbers of processors long past the point where other algorithms saturate and start to thrash. 相似文献
17.
18.
To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current processors, the translation is done in a TLB (translation lookaside buffer), before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We describe and compare five options for virtual address translation in the context of distributed shared memory (DSM) multiprocessors, including CC-NUMAs (cache-coherent non-uniform memory access architectures) and COMAs (cache only memory access architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with address translation done in memory because the frequency of translations is very low. 相似文献
19.
Awerbuch B. Kutten S. Mansour Y. Patt-Shamir B. Varghese G. 《Dependable and Secure Computing, IEEE Transactions on》2007,4(3):180-190
A synchronizer with a phase counter (sometimes called asynchronous phase clock) is an asynchronous distributed algorithm, where each node maintains a local "pulse counter" that simulates the global clock in a synchronous network. In this paper, we present a time-optimal self-stabilizing scheme for such a synchronizer, assuming unbounded counters. We give a simple rule by which each node can compute its pulse number as a function of its neighbors' pulse numbers. We also show that some of the popular correction functions for phase clock synchronization are not self-stabilizing in asynchronous networks. Using our rule, the counters stabilize in time bounded by the diameter of the network, without invoking global operations. We argue that the use of unbounded counters can be justified by the availability of memory for counters that are large enough to be practically unbounded and by the existence of reset protocols that can be used to restart the counters in some rare cases where faults will make this necessary. 相似文献
20.
Zhang X. Qin X. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(10):1059-1068
The efficiency of the basic operations of a NUMA (nonuniform memory access) multiprocessor determines the parallel processing performance on a NUMA multiprocessor. The authors present several analytical models for predicting and evaluating the overhead of interprocessor communication, process scheduling, process synchronization, and remote memory access, where network contention and memory contention are considered. Performance measurements to support the models and analyses through several numerical examples have been done on the BBN GP1000, a NUMA shared-memory multiprocessor. Analytical and experimental results give a comprehensive understanding of the various effects, which are important for the effective use of NUMA shared-memory multiprocessor. The results presented can be used to determine optimal strategies in developing an efficient programming environment for a NUMA system 相似文献