首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
There are many parallel computations that are tree structured. The structure of a tree is usually unpredictable at compiler-time; the tree grows gradually during the course of a computation. The dynamic tree embedding problem is to distribute the processes of a parallel computation over processors in a parallel computer such that processors perform roughly the same amount of computation, and that communicating processes are assigned to processors that are close to each other. In this paper, we establish lower bounds for the performance ratio of dynamic tree embedding in bipartite static networks, including numerous important networks such asn-dimensional meshes,n-dimensional tori,k-aryn-cubes, cube-connected cycles, and butterflies. Our lower bounds are obtained from both tree and network properties and are applicable to a general class of dynamic tree embedding algorithms.  相似文献   

2.
构造了一类基于Euler-Richardson局部外插的并行算法,设计了使各处理机计算量分配更加平衡的方案,分析了方法的精度,稳定性,计算复杂性以及加速比和效率。数值试验结果表明方法是有效的,文中所构造的算法可用于大系统的数字仿真和科学计算。  相似文献   

3.
In this paper a parallel algorithm is given that, given a graph G=(V,E) , decides whether G is a series parallel graph, and, if so, builds a decomposition tree for G of series and parallel composition rules. The algorithm uses O(log \kern -1pt |E|log ^\ast \kern -1pt |E|) time and O(|E|) operations on an EREW PRAM, and O(log \kern -1pt |E|) time and O(|E|) operations on a CRCW PRAM. The results hold for undirected as well as for directed graphs. Algorithms with the same resource bounds are described for the recognition of graphs of treewidth two, and for constructing tree decompositions of treewidth two. Hence efficient parallel algorithms can be found for a large number of graph problems on series parallel graphs and graphs with treewidth two. These include many well-known problems like all problems that can be stated in monadic second-order logic. Received July 15, 1997; revised January 29, 1999, and June 23, 1999.  相似文献   

4.
We apply the methodology of competitive analysis of algorithms to the implementation of programs on parallel machines. We consider the problem of finding the best on-line distributed scheduling strategy that executes in parallel an unknown directed acyclic graph (dag) which represents the data dependency relation graph of a parallel program and which is revealed as execution proceeds. We study the competitive ratio of some important classes of dags assuming a fixed communication delay ratio τ that captures the average interprocessor communication measured in instruction cycles. We provide competitive algorithms for divide-and-conquer dags, trees, and general dags, when the number of processors depends on the size of the input dag and when the number of processors is fixed. Our major result is a lower bound Ω (τ / log τ ) of the competitive ratio for trees; it shows that it is impossible to design compilers that produce almost optimal execution code for all parallel programs. This fundamental result holds for almost any reasonable distributed memory parallel computation model, including the LogP and BSP model. Received March 5, 1996; revised March 11, 1997.  相似文献   

5.
A number of recent studies have revealed that the Optical Transpose Interconnection Systems (or OTIS) are promising candidates for future high-performance parallel computers. In this paper, we present and evaluate two general methods for algorithm development on the OTIS. The proposed methods are general in the sense that no specific factor network or problem domain is assumed. The proposed methods allow efficient mapping of a wide class of algorithms into the OTIS. These methods are based on grids and pipelines as popular structures that support a vast body of parallel applications including linear algebra, divide-and-conquer type of algorithms, sorting, and FFT computation. Timing models for measuring the performance of the proposed methods are also provided. Through these models, the performance of various algorithms on the OTIS are evaluated and compared with their counterparts on conventional electronic interconnection systems. This study confirms the viability of the OTIS as an attractive alternative for large-scale parallel architectures. Finally, we show how the proposed methods can be used to design parallel algorithms for linear algebra on the OTIS.  相似文献   

6.
Dehne  Dittrich  Hutchinson 《Algorithmica》2008,36(2):97-122
Abstract. External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems.  相似文献   

7.
This paper introduces a model for parallel computation, called thedistributed randomaccess machine (DRAM), in which the communication requirements of parallel algorithms can be evaluated. A DRAM is an abstraction of a parallel computer in which memory accesses are implemented by routing messages through a communication network. A DRAM explicitly models the congestion of messages across cuts of the network.We introduce the notion of aconservative algorithm as one whose communication requirements at each step can be bounded by the congestion of pointers of the input data structure across cuts of a DRAM. We give a simple lemma that shows how to shortcut pointers in a data structure so that remote processors can communicate without causing undue congestion. We giveO(lgn)-step, linear-processor, linear-space, conservative algorithms for a variety of problems onn-node trees, such as computing treewalk numberings, finding the separator of a tree, and evaluating all subexpressions in an expression tree. We giveO(lg2 n)-step, linear-processor, linear-space, conservative algorithms for problems on graphs of sizen, including finding a minimum-cost spanning forest, computing biconnected components, and constructing an Eulerian cycle. Most of these algorithms use as a subroutine a generalization of the prefix computation to trees. We show that any suchtreefix computation can be performed inO(lgn) steps using a conservative variant of Miller and Reif's tree-contraction technique.This research was supported in part by the Defense Advanced Research Projects Agency under Contract N00014-80-C-0622 and by the Office of Naval Research under Contract N00014-86-K-0593. Charles Leiserson is supported in part by an NSF Presidential Young Investigator Award with matching funds provided by AT&T Bell Laboratories and Xerox Corporation. Bruce Maggs is supported in part by an NSF Fellowship.  相似文献   

8.
Simulating Boolean Circuits on a DNA Computer   总被引:6,自引:0,他引:6  
M. Ogihara  A. Ray 《Algorithmica》1999,25(2-3):239-250
We demonstrate that DNA computers can simulate Boolean circuits with a small overhead. Boolean circuits embody the notion of massively parallel signal processing and are frequently encountered in many parallel algorithms. Many important problems such as sorting, integer arithmetic, and matrix multiplication are known to be computable by small size Boolean circuits much faster than by ordinary sequential digital computers. This paper shows that DNA chemistry allows one to simulate large semi-unbounded fan-in Boolean circuits with a logarithmic slowdown in computation time. Also, for the class NC 1 , the slowdown can be reduced to a constant. In this algorithm we have encoded the inputs, the Boolean AND gates, and the OR gates to DNA oligonucleotide sequences. We operate on the gates and the inputs by standard molecular techniques of sequence-specific annealing, ligation, separation by size, amplification, sequence-specific cleavage, and detection by size. Additional steps of amplification are not necessary for NC 1 circuits. The feasibility of the DNA algorithm has been successfully tested on a small circuit by actual biochemical experiments. Received May 29, 1997; revised February 15, 1998.  相似文献   

9.
In this paper we present deterministic parallel algorithms for the coarse-grained multicomputer (CGM) and bulk synchronous parallel (BSP) models for solving the following well-known graph problems: (1) list ranking, (2) Euler tour construction in a tree, (3) computing the connected components and spanning forest, (4) lowest common ancestor preprocessing, (5) tree contraction and expression tree evaluation, (6) computing an ear decomposition or open ear decomposition, and (7) 2-edge connectivity and biconnectivity (testing and component computation). The algorithms require O(log p) communication rounds with linear sequential work per round (p = no. processors, N = total input size). Each processor creates, during the entire algorithm, messages of total size O(log (p) (N/p)) . The algorithms assume that the local memory per processor (i.e., N/p ) is larger than p ε , for some fixed ε > 0 . Our results imply BSP algorithms with O(log p) supersteps, O(g log (p) (N/p)) communication time, and O(log (p) (N/p)) local computation time. It is important to observe that the number of communication rounds/ supersteps obtained in this paper is independent of the problem size, and grows only logarithmically with respect to p . With growing problem size, only the sizes of the messages grow but the total number of messages remains unchanged. Due to the considerable protocol overhead associated with each message transmission, this is an important property. The result for Problem (1) is a considerable improvement over those previously reported. The algorithms for Problems (2)—(7) are the first practically relevant parallel algorithms for these standard graph problems. Received July 5, 2000; revised April 16, 2001.  相似文献   

10.
Design and Implementation of a Practical Parallel Delaunay Algorithm   总被引:1,自引:0,他引:1  
This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations) from O(n log 2 n) to O(n log n) . The depth (parallel time) is O( log 3 n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants. Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPI-based toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster. Received June 1, 1997; revised March 10, 1998.  相似文献   

11.
The range tree is a fundamental data structure for multidimensional point sets, and, as such, is central in a wide range of geometric and database applications. In this paper we describe the first nontrivial adaptation of range trees to the parallel distributed memory setting (BSP-like models). Given a set of n points in d -dimensional Cartesian space, we show how to construct on a coarse-grained multicomputer a distributed range tree T in time O( s / p + T c (s,p)) , where s = n log d-1 n is the size of the sequential data structure and T c (s,p) is the time to perform an h -relation with h=Θ (s/p) . We then show how T can be used to answer a given set Q of m=O(n) range queries in time O((s log m)/p + T c (s,p)) and O((s log m)/p + T c (s,p) + k/p) , where k is the number of results to be reported. These parallel construction and search algorithms are both highly efficient, in that their running times are the sequential time divided by the number of processors, plus a constant number of parallel communication rounds. Received June 1, 1997; revised March 10, 1998.  相似文献   

12.
The synchronization barrier is a point in the program where the processing elements (PEs) wait until all the PEs have arrived at this point. In a reduction computation, given a commutative and associative binary operationop, one needs to reduce valuesa 0,...,a N-1, stored in PEs 0,...,N-1 to a single valuea *=a 0 op a, op...op a N -1 and then to broadcast the resulta * to all PEs. This computation is often followed by a synchronization barrier. Routines to perform these functions are frequently required in parallel programs. Simple and efficient, workingC-language routines for the parallel barrier synchronization and reduction computations are presented. The codes are appropriate for a CREW (concurrent-read-exclusive-write) or EREW parallel random access shared memory MIMD computer. They require only shared memory read and write; no locks, semaphores etc. are needed. The running time of each of these routines isO(logN). The amount of shared memory required and the number of shared memory accesses generated are botO(N). These are the asymptotically minimum values for the three parameters. The algorithms employ the obvious computational scheme involving a binary tree. Examples of applications for these routines and results of performance testing on the Sequent Balance 21000 computer are presented.An abstract of this article appeared inProc. 1989 Int. Conf. Parallel Processing, p. II-175.  相似文献   

13.
《Parallel Computing》1997,23(3):271-289
Parallel mappings of Kohonen's self organizing map (SOM) and learning vector quantization (LVQ) algorithms are presented for a tree shape parallel computer system called TUTNC (Tampere University of Technology Neural Computer). The lattice of neurons in SOM is partitioned columnwise to parallel processors in a neuron parallel manner. In addition, an efficient method is presented for the neighborhood computation to make the computation time independent of SOM size and processor count. The tree shape architecture is shown to match well the requirements of mapped algorithms and their relations in such a prototype system TUTNC are studied. Performance has been measured for sample configurations and estimated for a larger system. Comparisons to other implementations on various platforms show, that good performance per processor has been achieved.  相似文献   

14.
This article presents a parallel self-verified solver for dense linear systems of equations. This kind of solver is commonly used in many different kinds of real applications which deal with large matrices. Nevertheless, two key problems appear to limit the use of linear system solvers to a more extensive range of real applications: solution correctness and high computational cost. In order to solve the first one, verified computing would be an interesting choice. An algorithm that uses this concept is able to find a highly accurate and automatically verified result providing more reliability. However, the performance of these algorithms quickly becomes a drawback. Aiming at a better performance, parallel computing techniques were employed. Two main parts of this method were parallelized: the computation of the approximate inverse of matrix A and the preconditioning step. The results obtained show that these optimizations increase significantly the overall performance.  相似文献   

15.
In this paper fast parallel Preconditioned Conjugate Gradient (PCG) algorithms for robot manipulator forward dynamics, or dynamic simulation, problem are presented. By exploiting the inherent structure of the forward dynamics problem, suitable preconditioners are devised to accelerate the iterations. Also, based on the choice of preconditioners, a modified dynamic formulation is used to speedup both serial and parallel computation of each iteration. The implementation of the parallel algorithms on two interconnected processor arrays is discussed and their computation and communication complexities are analyzed. The simulation results for a Puma Arm are presented to illustrate the effectiveness of the proposed preconditioners. With a faster convergence due to preconditioning and a faster computation of iterations due to parallelization, the developed parallel PCG algorithms represent the fastest alternative for parallel computation of the problem withO(n) processors.  相似文献   

16.
We present parallel algorithms to construct binary trees with almost optimal weighted path length. Specifically, assuming that weights are normalized (to sum up to one) and error refers to the (absolute) difference between the weighted path length of a given tree and the optimal tree with the same weights, we present anO (logn)-time andn(log lognl logn)-EREW-processor algorithm which constructs a tree with error less than 0.18, andO (k logn log* n)-time andn-CREW-processor algorithm which produces a tree with error at most l/n k , and anO (k 2 logn)-time andn 2-CREW-processor algorithm which produces a tree with error at most l/n k . As well, we describe two sequential algorithms, anO(kn)-time algorithm which produces a tree with error at most l/n k , and anO(kn)-time algorithm which produces a tree with error at most . The last two algorithms use different computation models.The first author's research was supported in part by NSERC Research Grant 3053. A part of this work was done while the second author was at the University of British Columbia.  相似文献   

17.
更实际的并行计算模型   总被引:7,自引:0,他引:7  
过去所报导的大量并行算法在小规模的并行机上均运行得很好,然而将其移植到大规模并行机上运行时性能却很差。原因之一就是并行计算模型(如PRAM)过于抽象,略去了一些诸如通信、同步等算法运行时不可忽略的因素。本文介绍目前所提出的几个较能反映近代并行机性能的更为实际的并行计算模型,包括异步PRAM,BSP,logP和C3模型等。当然这些模型在与真实并行机吻合的程度、可使用性和分析较复杂算法时的可操作性等方面尚存异议,但是它们的确打开了研究并行计其模型的新途径,成为当今并行算法研究的热点之一。  相似文献   

18.
The suffix tree is a key data structure for biological sequence analysis, since it permits efficient solutions to many string-based problems. Constructing large suffix trees is challenging because of high memory overheads and poor memory locality. Even though efficient suffix tree construction algorithms exist, their run-time is still very high for long DNA sequences such as whole human chromosomes. In this paper, we are using a hierarchical grid system as a computational platform in order to reduce this run-time significantly. To achieve an efficient mapping onto this type of architecture we introduce a parallel suffix tree construction algorithm that makes use of a new data structure called the common prefix suffix tree. Using this algorithm together with a dynamic load balancing strategy we show that our distributed grid implementation leads to significant run-time savings.  相似文献   

19.
Parallel Formulations of Decision-Tree Classification Algorithms   总被引:5,自引:0,他引:5  
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.  相似文献   

20.
Traditionally, the block-based medial axis transform (BB-MAT) and the chessboard distance transform (CDT) were usually viewed as two completely different image computation problems, especially for three dimensional (3D) space. In fact, there exist some equivalent properties between them. The relationship between both of them is first derived and proved in this paper. One of the significant properties is that CDT for 3D binary image V is equal to BB-MAT for image V' where it denotes the inverse image of V. In a parallel algorithm, a cost is defined as the product of the time complexity and the number of processors used. The main contribution of this work is to reduce the costs of 3D BB-MAT and 3D CDT problems proposed by Wang [65]. Based on the reverse-dominance technique which is redefined from dominance concept, we achieve the computation of the 3D CDT problem by implementing the 3D BB-MAT algorithm first. For a 3D binary image of size N3, our parallel algorithm can be run in O(logN) time using N3 processors on the concurrent read exclusive write (CREW) parallel random access machine (PRAM) model to solve both 3D BB-MAT and 3D CDT problems, respectively. The presented results for the cost are reduced in comparison with those of Wang's. To the best of our knowledge, this work is the lowest costs for the 3D BB-MAT and 3D CDT algorithms known. In parallel algorithms, the running time can be divided into computation time and communication time. The experimental results of the running, communication and computation times for the different problem sizes are implemented in an HP Superdome with SMP/CC-NUMA (symmetric multiprocessor/cache coherent non-uniform memory access) architecture. We conclude that the parallel computer (i.e., SMP/CC-NUMA architecture or cluster system) is more suitable for solving problems with a large amount of input size.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号