首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Parallel algorithms for finding polynomial Roots on OTIS-torus   总被引:1,自引:0,他引:1  
We present two parallel algorithms for finding all the roots of an N-degree polynomial equation on an efficient model of Optoelectronic Transpose Interconnection System (OTIS), called OTIS-2D torus. The parallel algorithms are based on the iterative schemes of Durand–Kerner and Ehrlich methods. We show that the algorithm for the Durand–Kerner method requires (N 0.75+0.5N 0.25−1) electronic moves + 2(N 0.5−1) OTIS moves using N processors. The parallel algorithm for Ehrlich method is shown to run in (N 0.75+0.5N 0.25−1) electronic moves + 2(N 0.5−1) OTIS moves with the same number of processors. The algorithms have lower AT cost than the algorithms presented in Jana (Parallel Comput 32:301–312, 2006). The scalability of the algorithms is also discussed.  相似文献   

2.
We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive–iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies.  相似文献   

3.
Interval Newton/Generalized Bisection methods reliably find all numerical solutions within a given domain. Both computational complexity analysis and numerical experiments have shown that solving the corresponding interval linear system generated by interval Newton's methods can be computationally expensive (especially when the nonlinear system is large). In applications, many large-scale nonlinear systems of equations result in sparse interval jacobian matrices. In this paper, we first propose a general indexed storage scheme to store sparse interval matrices We then present an iterative interval linear solver that utilizes the proposed index storage scheme It is expected that the newly proposed general interval iterative sparse linear solver will improve the overall performance for interval Newton/Generalized bisection methods when the jacobian matrices are sparse. In section 1, we briefly review interval Newton's methods. In Section 2, we review some currently used storage schemes for sparse systems. In Section 3, we introduce a new index scheme to store general sparse matrices. In Section 4, we present both sequential and parallel algorithms to evaluate a general sparse Jacobian matrix. In Section 5, we present both sequential and parallel algorithms to solve the corresponding interval linear system by the all-row preconditioned scheme. Conclusions and future work are discussed in Section 6.  相似文献   

4.
In this work we show a portable sequential and a portable parallel algorithm for solving the inverse eigenproblem for real symmetric Toeplitz matrices. Both algorithms are based on Broyden's method for solving nonlinear systems. We reduced the computational cost for some problem sizes, and furthermore we managed to reduce spatial cost considerably, compared in both cases with parallel algorithms proposed by other authors and by us, although sometimes quasi‐Newton methods (as Broyden) do not reach convergence in all the test cases. We have implemented the parallel algorithm using the parallel numerical linear algebra library SCALAPACK based on the MPI environment. Experimental results have been obtained using two different architectures: a shared memory multiprocessor, the SGI PowerChallenge, and a cluster of Pentium II PCs connected through a myrinet network. The algorithms obtained are scalable in all the cases. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

5.
Extreme scale supercomputers available before the end of this decade are expected to have 100 million to 1 billion computing cores. The power and energy efficiency issue has become one of the primary concerns of extreme scale high performance scientific computing. This paper surveys the research on saving power and energy for numerical linear algebra algorithms in high performance scientific computing on supercomputers around the world. We first stress the significance of numerical linear algebra algorithms in high performance scientific computing nowadays, followed by a background introduction on widely used numerical linear algebra algorithms and software libraries and benchmarks. We summarize commonly deployed power management techniques for reducing power and energy consumption in high performance computing systems by presenting power and energy models and two fundamental types of power management techniques: static and dynamic. Further, we review the research on saving power and energy for high performance numerical linear algebra algorithms from four aspects: profiling, trading off performance, static saving, and dynamic saving, and summarize state-of-the-art techniques for achieving power and energy efficiency in each category individually. Finally, we discuss potential directions of future work and summarize the paper.  相似文献   

6.
Computing reduced-order models of controlled dynamical systems is of fundamental importance in many analysis and synthesis problems in systems and control theory. Algorithmic aspects of model reduction methods based on state-space truncation for linear discrete-time systems are addressed here. In contrast to the often-used approach of applying methods for continuous-time systems to discrete-time models employing a bilinear transformation, we devise special algorithms for discrete-time systems. Usually, this is more reliable and efficient. All methods discussed require in an initial stage the computation of the Gramians of the system. Using an accelerated fixed-point iteration for computing the full-rank factors of the Gramians yields some favorable computational aspects, particularly for non-minimal systems. The computations only require efficient implementations of basic linear algebra operations readily available on modern computer architectures. We discuss aspects of the parallel implementation of these methods and show the performance and scalability on distributed memory computers. Our approach enables users to deal with very complex systems using relatively cheap infrastructure, as, for example, a local PC or workstation network.  相似文献   

7.
Mesh of trees (MOT) is well known for its small diameter, high bisection width, simple decomposability and area universality. On the other hand, OTIS (Optical Transpose Interconnection System) provides an efficient optoelectronic model for massively parallel processing system. In this paper, we present OTIS-MOT as a competent candidate for a two-tier architecture that can take the advantages of both the OTIS and the MOT. We show that an n4-n^{4}_{-} processor OTIS-MOT has diameter 8log n +1 (The base of the logarithm is assumed to be 2 throughout this paper.) and fault diameter 8log n+2 under single node failure. We establish other topological properties such as bisection width, multiple paths and the modularity. We show that many communication as well as application algorithms can run on this network in comparable time or even faster than other similar tree-based two-tier architectures. The communication algorithms including row/column-group broadcast and one-to-all broadcast are shown to require O(log n) time, multicast in O(n 2log n) time and the bit-reverse permutation in O(n) time. Many parallel algorithms for various problems such as finding polynomial zeros, sales forecasting, matrix-vector multiplication and the DFT computation are proposed to map in O(log n) time. Sorting and prefix computation are also shown to run in O(log n) time.  相似文献   

8.
Parallel algorithms for direct methods of analysis and solution of linear algebra problems with sparse symmetric irregularly structured matrices are considered. The performance of such algorithms is investigated. Upper estimates of the speedup and efficiency factors are obtained for a parallel algorithm for triangular decomposition of sparse matrices. Some results of numerical experiments carried out on a MIMD computer are given.  相似文献   

9.
This paper will describe some recent attempts to construct transportable numerical software for high-performance computers. Restructuring algorithms in terms of simple linear algebra modules is reviewed. This technique has proved very succesful in obtaining a high level of transportability without severe loss of performance on a wide variety of both vector and parallel computers. The use of modules to encapsulate parallelism and reduce the ratio of data movement to floating-point operations has been demonstrably effective for regular problems such as those found in dense linear algebra. In other situations it may be necessary to express explicitly parallel algorithms. We also present a programming methodology that is useful for constructing new parallel algorithms which require sophisticated synchronization at a large grain level. We describe the SCHEDULE package which provides an environment for developing and analyzing explicitly parallel programs in FORTRAN which are portable. This package now includes a preprocessor to achieve complete portability of user level code and also a graphics post processor for performance analysis and debugging. We discuss details of porting both the SCHEDULE package and user code. Examples from linear algebra, and partial differential equations are used to illustrate the utility of this approach.  相似文献   

10.
High Performance Inverse Preconditioning   总被引:1,自引:0,他引:1  
The derivation of parallel numerical algorithms for solving sparse linear systems on modern computer systems and software platforms has attracted the attention of many researchers over the years. In this paper we present an overview on the design issues of parallel approximate inverse matrix algorithms, based on an anti-diagonal “wave pattern” approach and a “fish-bone” computational procedure, for computing explicitly various families of exact and approximate inverses for solving sparse linear systems. Parallel preconditioned conjugate gradient-type schemes in conjunction with parallel approximate inverses are presented for the efficient solution of sparse linear systems. Applications of the proposed parallel methods by solving characteristic sparse linear systems on symmetric multiprocessor systems and distributed systems are discussed and the parallel performance of the proposed schemes is given, using MPI, OpenMP and Java multithreading.  相似文献   

11.
In optoelectronic OTIS multicomputer networks (also known as Swapped networks), electrical and optical interconnects are used for local and global communication, respectively. An interesting instance of the OTIS multicomputers is the OTIS-mesh. Pancyclicity is of great importance in the implementation of a variety of parallel algorithms in multicomputers. This paper addresses the Pancyclicity problem of OTIS-mesh. More precisely, we prove that if the factor graph G of an OTIS network is a 2-D or 3-D mesh with at least one even radix, OTIS-G can embed any cycle of length l, l∈{4,6,7,8,9,…,2|G|}.  相似文献   

12.
Cluster algorithms have application in diverse areas, including statistical mechanics of polymer solutions, spin models in physics, and the study of ecological systems. Most parallel cluster labeling algorithms are designed for SIMD and MIMD multiprocessors and based on relaxation methods. We present a parallel 3-D cluster labeling algorithm based on mapping tables, for distributed memory environments. The proposed algorithm focuses on minimizing interprocess communication to enhance execution performance on workstation networks. We implemented the algorithm with the aid of theEcliPSeparallel replication toolkit, exploiting special tree-combining and data reduction features of the system. We report on performance results for experiments conducted on workstation clusters.  相似文献   

13.
The block-cyclic data distribution is commonly used to organize array elements over the processors of a coarse-grained distributed memory parallel computer. In many scientific applications, the data layout must be reorganized at run-time in order to enhance locality and reduce remote memory access overheads. In this paper we present a general framework for developing array redistribution algorithms. Using this framework, we have developed efficient algorithms that redistribute an array from one block-cyclic layout to another. Block-cyclic redistribution consists of index set computation , wherein the destination locations for individual data blocks are calculated, and data communication , wherein these blocks are exchanged between processors. The framework treats both these operations in a uniform and integrated way. We have developed efficient and distributed algorithms for index set computation that do not require any interprocessor communication. To perform data communication in a conflict-free manner, we have developed direct indirect and hybrid algorithms. In the direct algorithm, a data block is transferred directly to its destination processor. In an indirect algorithm, data blocks are moved from source to destination processors through intermediate relay processors. The hybrid algorithm is a combination of the direct and indirect algorithms. Our framework is based on a generalized circulant matrix formalism of the redistribution problem and a general purpose distributed memory model of the parallel machine. Our algorithms sustain excellent performance over a wide range of problem and machine parameters. We have implemented our algorithms using MPI, to allow for easy portability across different HPC platforms. Experimental results on the IBM SP-2 and the Cray T3D show superior performance over previous approaches. When the block size of the cyclic data layout changes by a factor of K , the redistribution can be performed in O( log K) communication steps. This is true even when K is a prime number. In contrast, previous approaches take O(K) communication steps for redistribution. Our framework can be used for developing scalable redistribution libraries, for efficiently implementing parallelizing compiler directives, and for developing parallel algorithms for various applications. Redistribution algorithms are especially useful in signal processing applications, where the data access patterns change significantly between computational phases. They are also necessary in linear algebra programs, to perform matrix transpose operations. Received June 1, 1997; revised March 10, 1998.  相似文献   

14.
The objective of this paper is to analyze the dynamic scheduling of dense linear algebra algorithms on shared‐memory, multicore architectures. Current numerical libraries (e.g., linear algebra package) show clear limitations on such emerging systems mainly because of their coarse granularity tasks. Thus, many numerical algorithms need to be redesigned to better fit the architectural design of the multicore platform. The parallel linear algebra for scalable multicore architectures library developed at the University of Tennessee tackles this challenge by using tile algorithms to achieve a finer task granularity. These tile algorithms can then be represented by directed acyclic graphs, where nodes are the tasks and edges are the dependencies between the tasks. The paramount key to achieve high performance is to implement a runtime environment to efficiently schedule the execution of the directed acyclic graph across the multicore platform. This paper studies the impact on the overall performance of some parameters, both at the level of the scheduler (e.g., window size and locality) and the algorithms (e.g., left‐looking and right‐looking variants). We conclude that some commonly accepted rules for dense linear algebra algorithms may need to be revisited. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

15.
We address direct multiple shooting based algorithms for nonlinear model predictive control, with a focus on problems with long prediction horizons. We describe different efficient multiple shooting variants with a computational effort that is only linear in the horizon length. Proposed techniques comprise structure exploiting linear algebra on the one hand, and approximation of derivative information in an adjoint Sequential Quadratic Programming method on the other hand. For explicit one-step methods for ordinary differential equations we address the issue of consistent and fast generation of both forward and adjoint derivatives of dynamic process models according to the principle of Internal Numerical Differentiation. We discuss the applicability of the proposed methods at the example of three benchmark problems. These have recently been addressed in literature and serve to evaluate the relative performance of each of the proposed methods for both off-line optimal control and on-line nonlinear model predictive control. Throughout, we compare against results published for a recently proposed collocation approach based on finite elements.  相似文献   

16.
We present a data structure for parallel computing which is directly linked to geometric quantities of an underlying mesh and which is well adapted to the requirements of a general finite element realization. In addition, we define an abstract linear algebra model which supports multigrid methods (extending our previous work in Comp. Vis. Sci. 1 (1997), 27–40). Finally, we apply the parallel multigrid preconditioner to several configurations in linear elasticity and we compute the condition number numerically for different smoothers, resulting in a quantitative evaluation of parallel multigrid performance.  相似文献   

17.
This paper derives a number of results related to the topological properties of OTIS k-ary n-cube interconnection networks. The basic topological metrics of size, degree, shortest distance, and diameter are obtained. Then results related to embedding in OTIS k-ary n-cubes of OTIS k-ary (n−1)-cubes, cycles, meshes, cubes, and spanning trees are derived. The OTIS k-ary n-cube is shown to be Hamiltonian. Minimal one-to-one routing and optimal broadcasting algorithms are proposed. The OTIS k-ary n-cube is shown to be maximally fault-tolerant. These results are derived based on known properties of k-ary n-cube networks and general properties of OTIS networks.  相似文献   

18.
Topological properties of OTIS-networks   总被引:1,自引:0,他引:1  
We conduct a general study of the topological properties of optical transpose interconnection systems (OTIS). We first obtain their basic topological metrics of size, degree, shortest distance and diameter, and then we obtain results related to the recursive structure and efficient embedding of meshes, cubes, spanning trees and cycles. We also present minimal one-to-one routing and optimal broadcasting algorithms, and we show how to construct node-disjoint paths between any two nodes of an OTIS network. Recent studies have addressed only particular members of the general class of OTIS networks. In this paper, we present unified tools for obtaining the topological properties of an arbitrary OTIS network based on the properties of the corresponding factor network  相似文献   

19.
In this paper we consider the problem of minimizing the completion time variance of n jobs on a single machine with deterministic processing times. We propose a new heuristic and compare the results with existing popular heuristics for the problem. We also propose a method based on genetic algorithms to solve the problem. We present the worst case performance analysis of the proposed heuristic. We also consider the problem of minimizing the completion time variance of n jobs on m identical parallel machines in both restricted and unrestricted versions. A heuristic method and a method based on genetic algorithms are presented for both the cases and results of computational testing are provided. It is concluded that the proposed methods provide better results compared to existing methods for the single machine case as well as for the multi-machine case.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号