期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A memory access model for highly-threaded many-core architectures

《Future Generation Computer Systems》2014

A number of highly-threaded, many-core architectures hide memory-access latency by low-overhead context switching among a large number of threads. The speedup of a program on these machines depends on how well the latency is hidden. If the number of threads were infinite, theoretically, these machines could provide the performance predicted by the PRAM analysis of these programs. However, the number of threads per processor is not infinite, and is constrained by both hardware and algorithmic limits. In this paper, we introduce the Threaded Many-core Memory (TMM) model which is meant to capture the important characteristics of these highly-threaded, many-core machines. Since we model some important machine parameters of these machines, we expect analysis under this model to provide a more fine-grained and accurate performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the dynamic programming algorithm and Johnson’s algorithm have the same performance in the PRAM model. However, our model predicts different performance for large enough memory-access latency and validates the intuition that the dynamic programming algorithm performs better on these machines. We validate several predictions made by our model using empirical measurements on an instantiation of a highly-threaded, many-core machine, namely the NVIDIA GTX 480. 相似文献

2.

PARALLEL GIVENS SEQUENCES FOR SOLVING THE GENERAL LINEAR MODEL ON A EREW PRAM∗

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-2):57-75

Abstract

Parallel Givens sequences for solving the General Linear Model (GLM) are developed and analyzed. The block updating GLM estimation problem is also considered. The solution of the GLM employs as a main computational device the Generalized QR Decomposition, where one of the two matrices is initially upper triangular. The proposed Givens sequences efficiently exploit the initial triangular structure of the matrix and special properties of the solution method. The complexity analysis of the sequences is based on a Exclusive Read-Exclusive Write (EREW) Parallel Random Access Machine (PRAM) model with limited parallelism. Furthermore, the number of operations performed by a Givens rotation is determined by the size of the vectors used in the rotation. With these assumptions one conclusion drawn is that a sequence which applies the smallest number of compound disjoint Givens rotations to solve the GLM estimation problem does not necessarily have the lowest computational complexity. The various Givens sequences and their computational complexity analyses will be useful when addressing the solution of other similar factorization problems. 相似文献

3.

A parallel algorithm for approximate regularity

Laurence Boxer Russ Miller 《Information Processing Letters》2001,80(6):311-316

Spatial regularity amidst a seemingly chaotic image is often meaningful. Many papers in computational geometry are concerned with detecting some type of regularity via exact solutions to problems in geometric pattern recognition. However, real-world applications often have data that is approximate, and may rely on calculations that are approximate. Thus, it is useful to develop solutions that have an error tolerance.

A solution has recently been presented by Robins et al. [Inform. Process. Lett. 69 (1999) 189–195] to the problem of finding all maximal subsets of an input set in the Euclidean plane that are approximately equally-spaced and approximately collinear. This is a problem that arises in computer vision, military applications, and other areas. The algorithm of Robins et al. is different in several important respects from the optimal algorithm given by Kahng and Robins [Patter Recognition Lett. 12 (1991) 757–764] for the exact version of the problem. The algorithm of Robins et al. seems inherently sequential and runs in O(n^5/2) time, where n is the size of the input set. In this paper, we give parallel solutions to this problem. 相似文献

4.

An efficient parallel strategy for the perfect domination problem on distance-hereditary graphs

Sun-Yuan Hsieh 《The Journal of supercomputing》2007,39(1):39-57

A graph is distance-hereditary if the distance stays the same between any of two vertices in every connected induced subgraph containing both. Two well-known classes of graphs, trees and cographs, both belong to distance-hereditary graphs. In this paper, we first show that the perfect domination problem can be solved in sequential linear-time on distance-hereditary graphs. By sketching some regular property of the problem, we also show that it can be easily parallelized on distance-hereditary graphs. 相似文献

5.

A process oriented semantics of the PRAM-language FORK

Gudula Rünger Kurt Sieber 《Computer Languages, Systems and Structures》1994,20(4):253-265

The parallel language FORK [1], based on a scalable shared memory model, is a PASCAL-like language with some additional parallel constructs. A PRAM (Parallel Random Access Machine) algorithm can be expressed on a high level of abstraction as a FORK program which is translated into efficient PRAM code guaranteeing theoretically predicted runtimes.

In this paper, we concentrate on those features of the language FORK related to parallelism, such as the group concept, a shared memory access and synchronous or asynchronous execution. We present a trace-based denotational interleaving semantics where processes describe synchronous computations. Processes are created or deleted dynamically and run asynchronously. Interleaving rules reflect the underlying CRCW (concurrent-read-concurrent-write) PRAM model. 相似文献

6.

An Efficient Parallel Algorithm for the Layered Planar Monotone Circuit Value Problem

Vijaya Ramachandran Honghua Yang 《Algorithmica》1997,18(3):384-404

A planar monotone circuit (PMC) is a Boolean circuit that can be embedded in the plane and that contains only AND and OR gates. A layered PMC is a PMC in which all input nodes are in the external face, and the gates can be assigned to layers in such a way that every wire goes between gates in successive layers. Goldschlager, Cook and Dymond, and others have developed NC ² algorithms to evaluate a layered PMC when the output node is in the same face as the input nodes. These algorithms require a large number of processors (Ω(n ⁶ ), where n is the size of the input circuit). In this paper we give an efficient parallel algorithm that evaluates a layered PMC of size n in time using only a linear number of processors on an EREW PRAM. Our parallel algorithm is the best possible to within a polylog factor, and is a substantial improvement over the earlier algorithms for the problem. Received April 18, 1994; revised April 7, 1995. 相似文献

7.

Locality-preserving hash functions for general purpose parallel computation

A. Chin 《Algorithmica》1994,12(2-3):170-181

Consider the problem of efficiently simulating the shared-memory parallel random access machine (PRAM) model on massively parallel architectures with physically distributed memory. To prevent network congestion and memory bank contention, it may be advantageous to hash the shared memory address space. The decision on whether or not to use hashing depends on (1) the communication latency in the network and (2) the locality of memory accesses in the algorithm.We relate this decision directly to algorithmic issues by studying the complexity of hashing in the Block PRAM model of Aggarwal, Chandra, and Snir, a shared-memory model of parallel computation which accounts for communication locality. For this model, we exhibit a universal family of hash functions having optimal locality. The complexity of applying these hash functions to the shared address space of the Block PRAM (i.e., by permuting data elements) is asymptotically equivalent to the complexity of performing a square matrix transpose, and this result is best possible for all pairwise independent universal hash families. These complexity bounds provide theoretical evidence that hashing and randomized routing need not destroy communication locality, addressing an open question of Valiant.This work was started when the author was a student at Oxford University, supported by a National Science Foundation Graduate Fellowship and a Rhodes Scholarship. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation or the Rhodes Trust. 相似文献

8.

3D block-based medial axis transform and chessboard distance transform based on dominance

Shih-Ying LinShi-Jinn Horng Tzong-Wann KaoChin-Shyurng Fahn Pingzhi FanYuan-Hsin Chen Muhammad Khurram KhanAnu Bourgeois Takao Terano 《Image and vision computing》2011,29(4):272-285

Traditionally, the block-based medial axis transform (BB-MAT) and the chessboard distance transform (CDT) were usually viewed as two completely different image computation problems, especially for three dimensional (3D) space. In fact, there exist some equivalent properties between them. The relationship between both of them is first derived and proved in this paper. One of the significant properties is that CDT for 3D binary image V is equal to BB-MAT for image V' where it denotes the inverse image of V. In a parallel algorithm, a cost is defined as the product of the time complexity and the number of processors used. The main contribution of this work is to reduce the costs of 3D BB-MAT and 3D CDT problems proposed by Wang [65]. Based on the reverse-dominance technique which is redefined from dominance concept, we achieve the computation of the 3D CDT problem by implementing the 3D BB-MAT algorithm first. For a 3D binary image of size N³, our parallel algorithm can be run in O(logN) time using N³ processors on the concurrent read exclusive write (CREW) parallel random access machine (PRAM) model to solve both 3D BB-MAT and 3D CDT problems, respectively. The presented results for the cost are reduced in comparison with those of Wang's. To the best of our knowledge, this work is the lowest costs for the 3D BB-MAT and 3D CDT algorithms known. In parallel algorithms, the running time can be divided into computation time and communication time. The experimental results of the running, communication and computation times for the different problem sizes are implemented in an HP Superdome with SMP/CC-NUMA (symmetric multiprocessor/cache coherent non-uniform memory access) architecture. We conclude that the parallel computer (i.e., SMP/CC-NUMA architecture or cluster system) is more suitable for solving problems with a large amount of input size. 相似文献

9.

Optimal BSR Solutions to Several Convex Polygon Problems

Jean-Frédéric Myoupo David Semé Ivan Stojmenovic 《The Journal of supercomputing》2002,21(1):77-90

This paper focuses on BSR (Broadcasting with Selective Reduction) implementation of algorithms solving basic convex polygon problems. More precisely, constant time solutions on a linear number, max(N, M) (where N and M are the number of edges of the two considered polygons), of processors for computing the maximum distance between two convex polygons, finding critical support lines of two convex polygons, computing the diameter, the width of a convex polygon and the vector sum of two convex polygons are described. These solutions are based on the merging slopes technique using one criterion BSR operations. 相似文献

10.

Optimal Computing the Chessboard Distance Transform on Parallel Processing Systems

Yu-Hua Lee Shi-Jinn Horng 《Computer Vision and Image Understanding》1999,73(3):272

Thedistance transform(DT) is an image computation tool which can be used to extract the information about the shape and the position of the foreground pixels relative to each other. It converts a binary image into a grey-level image, where each pixel has a value corresponding to the distance to the nearest foreground pixel. The time complexity for computing the distance transform is fully dependent on the different distance metrics. Especially, the more exact the distance transform is, the worse execution time reached will be. Nowadays, quite often thousands of images are processed in a limited time. It seems quite impossible for a sequential computer to do such a computation for the distance transform in real time. In order to provide efficient distance transform computation, it is considerably desirable to develop a parallel algorithm for this operation. In this paper, based on the diagonal propagation approach, we first provide anO(N²) time sequential algorithm to compute thechessboard distance transform(CDT) of anN×Nimage, which is a DT using the chessboard distance metrics. Based on the proposed sequential algorithm, the CDT of a 2D binary image array of sizeN×Ncan be computed inO(logN) time on the EREW PRAM model usingO(N²/logN) processors,O(log logN) time on the CRCW PRAM model usingO(N²/log logN) processors, andO(logN) time on the hypercube computer usingO(N²/logN) processors. Following the mapping as proposed by Lee and Horng, the algorithm for the medial axis transform is also efficiently derived. The medial axis transform of a 2D binary image array of sizeN×Ncan be computed inO(logN) time on the EREW PRAM model usingO(N²/logN) processors,O(log logN) time on the CRCW PRAM model usingO(N²/log logN) processors, andO(logN) time on the hypercube computer usingO(N²/logN) processors. The proposed parallel algorithms are composed of a set of prefix operations. In each prefix operation phase, only increase (add-one) operation and minimum operation are employed. So, the algorithms are especially efficient in practical applications. 相似文献