期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An efficient implementation of parallel eigenvalue computation for massively parallel processing 总被引：4，自引：0，他引：4

Takahiro Katagiri Yasumasa Kanada 《Parallel Computing》2001,27(14):1831-1845

This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201. 相似文献

2.

An optimal algorithm for Gaussian elimination of band matrices on an MIMD computer

《Parallel Computing》1990,15(1-3):133-145

This paper describes a parallel algorithm for the LU decomposition of band matrices using Gaussian elimination. The matrix dimension is n × n with 2r−1 diagonals. In the case when 1 r 2 p an optimal number of the processors, , is determined according to the equation . When 2 p r n a number of processors, p, statged by Veldhorst is adopted (see [7]). For band matrix with 2r-1 diagonals (1 r 2p) the task scheduling procedure with the aim to obtain maximal parallelism in system operation, i.e. good load balancing, is defined. The architecture of the system is of MIMD type. The connection between the processors is realised via a common bus. Communication and synchronization is performed by message passing technique. 相似文献

3.

Support for Efficient Programming on the SB-PRAM

Thomas Grün Thomas Rauber Jochen Röhrig 《International journal of parallel programming》1998,26(3):209-240

The SB-PRAM is a shared-memory parallel computer that has been designed according to the PRAM model from theoretical computer science. The SB-PRAM realizes a concurrent-read, concurrent-write PRAM where each processor can access the global memory in unit time. This article describes the programming environment of the SB-PRAM that enables a programmer to develop efficient and portable programs without dealing with architectural details of the machine. In particular, we discuss compiler and operating system issues and show that the runtime functions of the P4 environment and several parallel data structures can be implemented very efficiently by using special features of the SB-PRAM. In contrast to other parallel machines, the synchronization of processors and the management of concurrent accesses to the global memory only require a few machine instructions independent of the number of processors participating in the operation. This efficient implementation of the runtime system is the basis for good performance of many challenging applications. 相似文献

4.

Processor allocation and task scheduling of matrix chain products on parallel systems

Lee H. Kim J. Hong S.J. Lee S. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(4):394-407

The problem of finding an optimal product sequence for sequential multiplication of a chain of matrices (the matrix chain ordering problem, MCOP) is well-known. We consider the problem of finding an optimal product schedule for evaluating a chain of matrix products on a parallel computer (the matrix chain scheduling problem, MCSP). The difference between MCSP and MCOP is that MCOP pertains to a product sequence for single processor systems and MCSP pertains to a sequence of concurrent matrix products for parallel systems. The approach of parallelizing each matrix product after finding an optimal product sequence for single processor systems does not always guarantee minimum evaluation time on parallel systems since each parallelized matrix product may use processors inefficiently. We introduce a new processor scheduling algorithm for MCSP which reduces the evaluation time of a chain of matrix products on a parallel computer, even at the expense of a slight increase in the total number of operations. Given a chain of n matrices and a matrix product utilizing at most P/k processors in a P-processor system, the proposed algorithm approaches k(n-1)/(n+klog(k)-k) times the performance of parallel evaluation using the optimal sequence found for MCOP. Experiments performed on a Fujitsu AP1000 multicomputer also show that the proposed algorithm significantly decreases the time required to evaluate a chain of matrix products in parallel systems. 相似文献

5.

Odd Memory Systems: A New Approach

Seznec A. Lenfant J. 《Journal of Parallel and Distributed Computing》1995,26(2)

To reject the use of a prime (or odd) number N of memory banks in a vector processor, it is generally advanced that address computation for such a memory system would require systematic Euclidean division by the number N. We first show that the Chinese Remainder Theorem allows one to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean division. Massively parallel SIMD computers may have thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors from memory to the processors through the interconnection network has very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be easily controlled. 相似文献

6.

Performance Measures for Evaluating Algorithms for SIMD Machines

《IEEE transactions on pattern analysis and machine intelligence》1982,(4):319-331

This paper examines measures for evaluating the performance of algorithms for single instruction stream–multiple data stream (SIMD) machines. The SIMD mode of parallelism involves using a large number of processors synchronized together. All processors execute the same instruction at the same time; however, each processor operates on a different data item. The complexity of parallel algorithms is, in general, a function of the machine size (number of processors), problem size, and type of interconnection network used to provide communications among the processors. Measures which quantify the effect of changing the machine-size/problem-size/network-type relationships are therefore needed. A number of such measures are presented and are applied to an example SIMD algorithm from the image processing problem domain. The measures discussed and compared include execution time, speed, parallel efficiency, overhead ratio, processor utilization, redundancy, cost effectiveness, speed-up of the parallel algorithm over the corresponding serial algorithm, and an additive measure called "sprice" which assigns a weighted value to computations and processors. 相似文献

7.

Experience with a genetic algorithm implemented on a multiprocessor computer

G.E. Plassman J. Sobieszczanski-Sobieski 《Structural and Multidisciplinary Optimization》2001,22(2):102-115

Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the nondistributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent. Of particular interest to the user, corresponding elapsed time compression factors approaching 128 are realized on 128 processors. Received October 18, 2000 相似文献

8.

Parallel finite element analysis using Jacobi-conditioned conjugate gradient algorithm 总被引：5，自引：0，他引：5

A. I. Khan B. H. V. Topping 《Advances in Engineering Software》1996,25(2-3):309-319

In this paper a modified parallel Jacobi-conditioned conjugate gradient (CG) method is proposed for solving linear elastic finite element system of equations. The conventional element-by-element and diagonally conditioned approaches are discussed with respect to parallel implementation on distributed memory MIMD architectures. The effects of communication overheads on the efficiency of the parallel CG solver are considered and it is shown that for the efficient performance of a parallel CG solver, the interprocessor communication has to be carried out concurrently. A concurrent communication scheme is proposed by relating the semi-bandwidth of the stiffness matrix with the number of independent degrees of freedom and the number of processors and inducing directionalization of communication within the processor pipeline. With the aid of two examples the effectiveness of the proposed method is demonstrated showing that the cost of communication remains low and relatively insensitive to the increase in the number of processors. 相似文献

9.

Multiprocessor synchronization for concurrent loops 总被引：1，自引：0，他引：1

Wolfe M. 《Software, IEEE》1988,5(1):34-42

Execution of concurrent loops on multiprocessor computers often requires synchronizing the processors. Synchronization schemes are surveyed that are suitable for automatic problem decomposition. The model of a shared-memory multiprocessor is used, as is the concurrent-loop paradigm, which is to compile a loop so each processor is assigned a different loop iteration. The discussion covers data dependence, removing synchronization points, random synchronization, pipelining, barrier synchronization, and critical sections.<> 相似文献

10.

An improved parallel Jacobi method for diagonalizing a symmetric matrix

Alan H. Karp John Greenstadt 《Parallel Computing》1987,5(3):281-294

We compare five implementations of the Jacobi method for diagonalizing a symmetric matrix. Two of these, the classical Jacobi and sequential sweep Jacobi, have been used on sequential processors. The third method, the parallel sweep Jacobi, has been proposed as the method of choice for parallel processors. The fourth and fifth methods are believed to be new. They are similar to the parallel sweep method but use different schemes for selecting the rotations.

The classical Jacobi method is known to take O(n⁴) time to diagonalize a matrix of order n. We find that the parallel sweep Jacobi run on one processor is about as fast as the sequential sweep Jacobi. Both of these methods take O(n³ log₂n) time. One of our new methods also takes O(n³ log₂n) time, but the other one takes only O(n³) time. The choice among the methods for parallel processors depends on the degree of parallelism possible in the hardware. The time required to diagonalize a matrix on a variety of architectures is modeled.

Unfortunately for proponents of the Jacobi method, we find that the sequential QR method is always faster than the Jacobi method. The QR method is faster even for matrices that are nearly diagonal. If we perform the reduction to tridiagonal form in parallel, the QR method will be faster even on highly parallel systems. 相似文献

11.

Broadcasting Sequential Processes (BSP)

Gehani Narain H. 《IEEE transactions on pattern analysis and machine intelligence》1984,(4):343-351

Communication in a broadcast protocol multiprocessor (BPM) is inherently different from that in distributed systems formed by explicit links between processors. A message broadcast by a processor in a BPM is received directly by all other processors in the network instead of being restricted to only one processor. Broadcasting is an inexpensive way of communicating with a large number of processors on a BPM. In this paper I will describe a new approach to user-level distributed programming called broadcast programming, i.e., distributed programs written as cooperating broadcasting sequential processes (BSP). Existing concurrent programming languages do not provide facilities to exploit the broadcast capability of a BPM. The idea of distributed programs written as BSP is tailored to exploiting a BPM architecture but is not restricted to such an architecture-however, implementation of the broadcast capability may not be as efficient on other architectures. I will illustrate the utility and convenience of broadcast programming with many examples. These examples will also be used to explore the suitability and advantages of BSP and to determine appropriate facilities for BSP. 相似文献

12.

Parallel nested dissection

John M. Conroy 《Parallel Computing》1990,16(2-3):139-156

Nested dissection is a very popular direct method for solving sparse linear systems that arise from finite difference and finite element methods. Worley and Schreiber [16] give a fine grain algorithm for a square array of processors. Their algorithm uses O(N²) processors, each with O(N) memory, to factor an N² by N² sparse matrix whose graphs is an N × N mesh. The efficiency of their method is between 1/46 and 1/12. George et al. [6] [8] give a medium grain algorithm for hypercube architecture, while George et al. [7] give an algorithm for shared memory machines. These papers present a column oriented approach which can exploit O(N) parallelism and yield efficiencies up to 50%. Lucas [11] also gives a column oriented scheme which achieves up to 75% efficiency and O(N) parallelism. In this paper, we present a medium to fine grain algorithm for a P × P array of processors with local memory. This algorithm can exploit up to O(N²) parallelism. The efficiency of the fine grain version is comparable to [16] while as a medium grain algorithm achieves about 49% efficiency. The strength of the method is due to three factors: its ability to pipeline much of the computation, overlapping computation and communication, and the use of level 3 BLAS like primitives. In addition to its high efficiency its memory requirement is optimal, only O(N² log N/P²) words memory is needed per processor. 相似文献

13.

Parallelizing the Data Cube 总被引：1，自引：0，他引：1

Frank Dehne Todd Eavis Susanne Hambrusch Andrew Rau-Chaplin 《Distributed and Parallel Databases》2002,11(2):181-201

This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. This allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting.The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array.We have implemented our parallel top-down data cube construction method in C++ with the MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. The tests show that our partitioning strategies generate a close to optimal load balance between processors. The actual run times observed show an optimal speedup of p. 相似文献

14.

Parallel clustering algorithms 总被引：3，自引：0，他引：3

Xiaobo Li Zhixi Fang 《Parallel Computing》1989,11(3):275-290

Clustering techniques play an important role in exploratory pattern analysis, unsupervised learning and image segmentation applications. Many clustering algorithms, both partitional clustering and hierarchical clustering, require intensive computation, even for a modest number of patterns. This paper presents two parallel clustering algorithms. For a clustering problem with N = 2ⁿ patterns and M = 2^m features, the time complexity of the traditional partitional clustering algorithm on a single processor computer is O(MNK), where K is the number of clusters. The proposed algorithm on anSIMD computer with MN processors has a time complexity O(K(n + m)). The time complexity of the proposed single-link hierarchical clustering algorithm is reduced from O(MN²) of the uniprocessor algorithm to O(nN) with MN processors. 相似文献

15.

A generalized processor mapping technique for array redistribution

Ching-Hsien Hsu Yeh-Ching Chung Don-Lin Yang Chyi-Ren Dow 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(7):743-757

In many scientific applications, array redistribution is usually required to enhance data locality and reduce remote memory access in many parallel programs on distributed memory multicomputers. Since the redistribution is performed at runtime, there is a performance trade-off between the efficiency of the new data decomposition for a subsequent phase of an algorithm and the cost of redistributing data among processors. In this paper, we present a generalized processor mapping technique to minimize the amount of data exchange for BLOCK-CYCLIC(kr) to BLOCK-CYCLIC(r) array redistribution and vice versa. The main idea of the generalized processor mapping technique is first to develop mapping functions for computing a new rank of each destination processor. Based on the mapping functions, a new logical sequence of destination processors can be derived. The new logical processor sequence is then used to minimize the amount of data exchange in a redistribution. The generalized processor mapping technique can handle array redistribution with arbitrary source and destination processor sets and can be applied to multidimensional array redistribution. We present a theoretical model to analyze the performance improvement of the generalized processor mapping technique. To evaluate the performance of the proposed technique, we have implemented the generalized processor mapping technique on an IBM SP2 parallel machine. The experimental results show that the generalized processor mapping technique can provide performance improvement over a wide range of redistribution problems 相似文献

16.

A multilevel parallel solver for block tridiagonal and banded linear systems

Ibrahim N. Hajj Stig Skelboe 《Parallel Computing》1990,15(1-3):21-45

This paper describes an efficient algorithm for the parallel solution of systems of linear equations with a block tridiagonal coefficient matrix. The algorithm comprises a multilevel LU-factorization based on block cyclic reduction and a corresponding solution algorithm.

The paper includes a general presentation of the parallel multilevel LU-factorization and solution algorithms, but the main emphasis is on implementation principles for a message passing computer with hypercube topology. Problem partitioning, processor allocation and communication requirement are discussed for the general block tridiagonal algorithm.

Band matrices can be cast into block tridiagonal form, and this special but important problem is dealt with in detail. It is demonstrated how the efficiency of the general block tridiagonal multilevel algorithm can be improved by introducing the equivalent of two-way Gaussian elimination for the first and the last partitioning and by carefully balancing the load of the processors. The presentation of the multilevel band solver is accompanied by detailed complexity analyses.

The properties of the parallel band solver were evaluated by implementing the algorithm on an Intel iPSC hypercube parallel computer and solving a larger number of banded linear equations using 2 to 32 processors. The results of the evaluation include speed-up over a sequential processor, and the measure values are in good agreement with the theoretical values resulting from complexity analysis. It is found that the maximum asymptotic speed-up of the multilevel LU-factorization using p processors and load balancing is approximated well by the expression (p +6)/4.

Finally, the multilevel parallel solver is compared with solvers based on row and column interleaved organization. 相似文献

17.

Parallel marching Poisson solvers

Marian Vajter&#x;ic 《Parallel Computing》1984,1(3-4):325-330

The paper presents parallel algorithms for solving Poisson equation at N² mesh points. The methods based on marching techniques are structured for efficient parallel realization. Using orthogonal decomposition properties of arising matrices, the algorithms can be formulated in terms of transformed vectors. On a MIMD computer with not more than N processors, the computations can be performed in horizontal slices with minimal synchronization requirements. Considering an SIMD machine with N² processors, the complexity bound O(log N) has been achieved, whereby the single marching requires 10 log N steps only. 相似文献

18.

Computing global combine operations in the multiport postal model

Bar-Noy A. Bruck J. Ching-Tien Ho Kipnis S. Schieber B. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(8):896-900

Consider a message-passing system of n processors, in which each processor holds one piece of data initially. The goal is to compute an associative and commutative reduction function on the n pieces of data and to make the result known to all the n processors. This operation is frequently used in many message-passing systems and is typically referred to as global combine, census computation, or gossiping. This paper explores the problem of global combine in the multiport postal model. This model is characterized by three parameters: n-the number of processors, k-the number of ports per processor, and λ-the communication latency. In this model, in every round r, each processor can send k distinct messages to k other processors, and it can receive k messages that were sent from k other processors λ-1 rounds earlier. This paper provides an optimal algorithm for the global combine problem that requires the least number of communication rounds and minimizes the time spent by any processor in sending and receiving messages 相似文献

19.

Numeric and symbolic evaluation of the pfaffian of general skew-symmetric matrices

C. González-Ballestero L.M. Robledo G.F. Bertsch 《Computer Physics Communications》2011,(10):2213-2218

Evaluation of pfaffians arises in a number of physics applications, and for some of them a direct method is preferable to using the determinantal formula. We discuss two methods for the numerical evaluation of pfaffians. The first is tridiagonalization based on Householder transformations. The main advantage of this method is its numerical stability that makes unnecessary the implementation of a pivoting strategy. The second method considered is based on Aitken?s block diagonalization formula. It yields to a kind of LU (similar to Cholesky?s factorization) decomposition (under congruence) of arbitrary skew-symmetric matrices that is well suited both for the numeric and symbolic evaluations of the pfaffian. Fortran subroutines (FORTRAN 77 and 90) implementing both methods are given. We also provide simple implementations in Python and Mathematica for purpose of testing, or for exploratory studies of methods that make use of pfaffians.

Program summary

Program title:PfaffianCatalogue identifier: AEJD_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJD_v1_0.htmlProgram obtainable from: CPC Program Library, Queen?s University, Belfast, N. IrelandLicensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.htmlNo. of lines in distributed program, including test data, etc.: 2281No. of bytes in distributed program, including test data, etc.: 13 226Distribution format: tar.gzProgramming language: Fortran 77 and 90Computer: Any supporting a FORTRAN compilerOperating system: Any supporting a FORTRAN compilerRAM: a few MBClassification: 4.8Nature of problem: Evaluation of the pfaffian of a skew symmetric matrix. Evaluation of pfaffians arises in a number of physics applications involving fermionic mean field wave functions and their overlaps.Solution method: Householder tridiagonalization. Aitken?s block diagonalization formula.Additional comments: Python and Mathematica implementations are provided in the main body of the paper.Running time: Depends on the size of the matrices. For matrices with 100 rows and columns a few milliseconds are required. 相似文献

20.

Efficient Compositing Methods for the Sort-Last-Sparse Parallel Volume Rendering System on Distributed Memory Multicomputers

Yang Don-Lin Yu Jen-Chih Chung Yeh-Ching 《The Journal of supercomputing》2001,18(2):201-220

In the sort-last-sparse parallel volume rendering system on distributed memory multicomputers, one can achieve a very good performance improvement in the rendering phase by increasing the number of processors. This is because each processor can render images locally without communicating with other processors. However, in the compositing phase, a processor has to exchange local images with other processors. When the number of processors exceeds a threshold, the image compositing time becomes a bottleneck. In this paper, we propose three compositing methods to efficiently reduce the compositing time in parallel volume rendering. They are the binary-swap with bounding rectangle (BSBR) method, the binary-swap with run-length encoding and static load-balancing (BSLC) method, and the binary-swap with bounding rectangle and run-length encoding (BSBRC) method. The proposed methods were implemented on an SP2 parallel machine along with the binary-swap compositing method. The experimental results show that the BSBRC method has the best performance among these four methods. 相似文献