期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Optimal Computing the Chessboard Distance Transform on Parallel Processing Systems

Yu-Hua Lee Shi-Jinn Horng 《Computer Vision and Image Understanding》1999,73(3):272

Thedistance transform(DT) is an image computation tool which can be used to extract the information about the shape and the position of the foreground pixels relative to each other. It converts a binary image into a grey-level image, where each pixel has a value corresponding to the distance to the nearest foreground pixel. The time complexity for computing the distance transform is fully dependent on the different distance metrics. Especially, the more exact the distance transform is, the worse execution time reached will be. Nowadays, quite often thousands of images are processed in a limited time. It seems quite impossible for a sequential computer to do such a computation for the distance transform in real time. In order to provide efficient distance transform computation, it is considerably desirable to develop a parallel algorithm for this operation. In this paper, based on the diagonal propagation approach, we first provide anO(N²) time sequential algorithm to compute thechessboard distance transform(CDT) of anN×Nimage, which is a DT using the chessboard distance metrics. Based on the proposed sequential algorithm, the CDT of a 2D binary image array of sizeN×Ncan be computed inO(logN) time on the EREW PRAM model usingO(N²/logN) processors,O(log logN) time on the CRCW PRAM model usingO(N²/log logN) processors, andO(logN) time on the hypercube computer usingO(N²/logN) processors. Following the mapping as proposed by Lee and Horng, the algorithm for the medial axis transform is also efficiently derived. The medial axis transform of a 2D binary image array of sizeN×Ncan be computed inO(logN) time on the EREW PRAM model usingO(N²/logN) processors,O(log logN) time on the CRCW PRAM model usingO(N²/log logN) processors, andO(logN) time on the hypercube computer usingO(N²/logN) processors. The proposed parallel algorithms are composed of a set of prefix operations. In each prefix operation phase, only increase (add-one) operation and minimum operation are employed. So, the algorithms are especially efficient in practical applications. 相似文献

2.

Parallel algorithms for shortest path problems in polygons

Hossam ElGindy Michael Goodrich 《The Visual computer》1988,3(6):371-378

Given ann-vertex simple polygon we address the following problems: (i) find the shortest path between two pointss andd insideP, and (ii) compute the shortestpath tree between a single points and each vertex ofP (which implicitly represents all the shortest paths). We show how to solve the first problem inO(logn) time usingO(n) processors, and the more general second problem inO(log² n) time usingO(n) processors, and the more general second problem inO(log² n) time usingO(n) processors for any simple polygonP. We assume the CREW RAM shared memory model of computation in which concurrent reads are allowed, but no two processors should attempt to simultaneously write in the same memory location. The algorithms are based on the divide-and-conquer paradigm and are quite different from the known sequential algorithmsResearch supported by the Faculty of Graduate Studies and Research (McGill University) grant 276-07 相似文献

3.

Medial axis transform on mesh-connected computers with hyperbus broadcasting

Y.-J. Chen S.-J. Horng 《Computing》1997,59(2):95-114

To represent a region of a digital image as the union of maximal upright squares contained in the region is called the medial axis transform. In this paper, we present anO(logn) time parallel algorithm for the medial axis transform of ann×n binary image on an SIMD mesh-connected computers with hyperbus broadcasting usingn ³ processors. 相似文献

4.

Practical Constructive Schemes for Deterministic Shared-Memory Access

A. Pietracaprina F. P. Preparata 《Theory of Computing Systems》1997,30(1):3-37

We present three explicit schemes for distributingM variables amongN memory modules, whereM=Θ(N ^1.5),M = Θ(N ²), andM=Θ(N ³), respectively. Each variable is replicated into a constant number of copies stored in distinct modules. We show thatN processors, directly accessing the memories through a complete interconnection, can read/write any set ofN variables in worst-case timeO (N ^1/3),O(N ^1/2), andO(N ^2/3), respectively for the three schemes. The access times for the last two schemes are optimal with respect to the particular redundancy values used by such schemes. The address computation can be carried out efficiently by each processor without recourse to a complete memory map and requiring onlyO(1) internal storage. This paper was partially supported by NFS Grants CCR-91-96152 and CCR-94-00232, by ONR Contract N00014-91-J-4052, ARPA Order 8225, and by the ESPRIT III Basic Research Programme of the EC under Contract No. 9072 (Project GEPPCOM). Results reported here were presented in preliminary form at the 10th Symposium on Theoretical Aspects of Computer Science (Würzburg, Germany, 1993), and at the 5th ACM Symposium on Parallel Algorithms and Architectures (Velen, Germany, 1993). 相似文献

5.

Parallel B-Spline Surface Fitting on Mesh-Connected Computers

Kuo-Liang Chung Wen-Ming Yan 《Journal of Parallel and Distributed Computing》1996,35(2):205

The solution of uniform bicubic B-spline curve/surface fitting problem is considered. Based on the matrix perturbation method, this paper first presents a novel approximateO(n/p)-time parallel B-spline curve fitting algorithm for finding the correspondingncontrol points that interpolate thosendata points on a linear array processor withpprocessors, wherep≤n. Givenm×ndata points, we then present anO(mn/(p₁p₂))-time parallel algorithm for solving the uniform bicubic B-spline surface fitting problem on ap₁×p₂mesh-connected computer, wherep₁≤mandp₂≤n. The relative error analyses of our two stable and cost-optimal parallel solvers are also given. When settingp₁=mandp₂=n, a constant-time parallel solver for B-spline surface fitting can be derived; this time- and cost-optimal result is a direct method, in contrast to the parallel iterative method of Chenget al.(Parallel B-spline surface interpolation on a mesh-connected processor array,J. Parallel Distrib. Comput.24, 2 (1995), 224–229). 相似文献

6.

Systolic arrays for multidimensional discrete transforms

Weicheng Shen A. Yavuz Oruç 《The Journal of supercomputing》1990,4(3):201-222

An active area of research in supercomputing is concerned with mapping certain finite sums, such as discrete Fourier transforms, onto arrays of processors. This paper presents systolic mapping techniques that exploit the parallelism inherent in discrete Fourier transforms. It is established that, for anM-dimensional signal, parallel executions of such transforms are closely related to mappings of an (M + 1)-dimensional finite vector space into itself. Three examples of such parallel schemes are then described for the discrete Fourier transform of a two-dimensional finite extent sequence of sizeN ₁ ×N ₂. The first is a linear array ofN ₁ +N ₂ – 1 processors and takesO(N ₁ N ₂) steps. The second is anN ₁ ×N ₂ rectangular array of processors and takesO(N ₁ +N ₂) steps, and the third is a hexagonal array which usesN ₁ N ₂ + (N ₂ – 1)(N ₁ +N ₂ – 1) processors andO(N ₁ +N ₂) steps. All three mappings are optimal in that they achieve asymptotically the highest speedup possible over the sequential execution of the same transform, and can easily be generalized to higher dimensions. 相似文献

7.

Practical constructive schemes for deterministic shared-memory access

A. Pietracaprina F. P. Preparata 《Theory of Computing Systems》1997,30(2):3-37

We present three explicit schemes for distributingM variables amongN memory modules, whereM=Θ(N ^1.5),M = Θ(N ²), andM=Θ(N ³), respectively. Each variable is replicated into a constant number of copies stored in distinct modules. We show thatN processors, directly accessing the memories through a complete interconnection, can read/write any set ofN variables in worst-case timeO (N ^1/3),O(N ^1/2), andO(N ^2/3), respectively for the three schemes. The access times for the last two schemes are optimal with respect to the particular redundancy values used by such schemes. The address computation can be carried out efficiently by each processor without recourse to a complete memory map and requiring onlyO(1) internal storage. 相似文献

8.

More Efficient Topological Sort Using Reconfigurable Optical Buses

Li Jie Pan Yi Shen Hong 《The Journal of supercomputing》2003,24(3):251-258

Topological sort of an acyclic graph has many applications such as job scheduling and network analysis. Due to its importance, it has been tackled on many models. Dekel et al. [3], proposed an algorithm for solving the problem in O(log² N) time on the hypercube or shuffle-exchange networks with O(N ³) processors. Chaudhuri [2], gave an O(log N) algorithm using O(N ³) processors on a CRCW PRAM model. On the LARPBS (Linear Arrays with a Reconfigurable Pipelined Bus System) model, Li et al. [5] showed that the problem for a weighted directed graph with N vertices can be solved in O(log N) time by using N ³ processors. In this paper, a more efficient topological sort algorithm is proposed on the same LARPBS model. We show that the problem can be solved in O(log N) time by using N ³/log N processors. We show that the algorithm has better time and processor complexities than the best algorithm on the hypercube, and has the same time complexity but better processor complexity than the best algorithm on the CRCW PRAM model. 相似文献

9.

An optimal speed-up parallel algorithm for triangulating simplicial point sets in space

Hossam ElGindy 《International journal of parallel programming》1986,15(5):389-398

Previous research on developing parallel triangulation algorithms concentrated on triangulating planar point sets.O(log³ n) running time algorithms usingO(n) processors have been developed in Refs. 1 and 2. Atallah and Goodrich⁽³⁾ presented a data structure that can be viewed as a parallel analogue of the sequential plane-sweeping paradigm, which can be used to triangulate a planar point set inO(logn loglogn) time usingO(n) processors. Recently Merks⁽⁴⁾ described an algorithm for triangulating point sets which runs inO(logn) time usingO(n) processors, and is thus optimal. In this paper we develop a parallel algorithm for triangulating simplicial point sets in arbitrary dimensions based on the idea of the sequential algorithm presented in Ref. 5. The algorithm runs inO(log² n) time usingO(n/logn) processors. The algorithm hasO(n logn) as the product of the running time and the number of processors; i.e., an optimal speed-up. 相似文献

10.

Multiple Addition and Prefix Sum on a Linear Array with a Reconfigurable Pipelined Bus System

Amitava Datta 《The Journal of supercomputing》2004,29(3):303-317

We present several fast algorithms for multiple addition and prefix sum on the Linear Array with a Reconfigurable Pipelined Bus System (LARPBS), a recently proposed architecture based on optical buses. Our algorithm for adding N integers runs on an N log M-processor LARPBS in O(log* N) time, where log* N is the number of times logarithm has to be taken to reduce N below 1 and M is the largest integer in the input. Our addition algorithm improves the time complexity of several matrix multiplication algorithms proposed by Li, Pan and Zheng (IEEE Trans. Parallel and Distributed Systems, 9(8):705–720, 1998). We also present several fast algorithms for computing prefix sums of N integers on the LARPBS. For integers with bounded magnitude, our first algorithm for prefix sum computation runs in O(log log N) time using N processors and in O(1) time using N ¹⁺ processors, for < 1. For integers with unbounded magnitude, the first algorithm for multiple addition runs in O(log log N log* N) time using N log M processors, when M is the largest integer in the input. Our second algorithm for multiple addition runs in O(log* N) time using N ¹⁺ log M processors, for < 1. We also show suitable extensions of our algorithm for real numbers. 相似文献

11.

Efficient histogramming on hypercube SIMD machines

《Computer Vision, Graphics, and Image Processing》1990,49(1):104-120

This paper considers the histogramming problem on hypercube.N-PE hypercube is used to process anN ¹² × N¹²digitized image in which each pixel has a gray-level value between 0 andM − 1. In general,M, the range of gray-level values is much smaller thanN, the number of pixels being processed. Our algorithm generates the histogram of the image inO(logM * logN) time using radix sort and efficient data movement operations. This technique can be implemented on butterfly, shuffle-exchange and fat pyramid organizations. 相似文献

12.

Mesh of Linear Arrays for Template Matching

《Real》1996,2(6):373-382

This paper presents the architecture and the implementation of template matching on a 3-D piece-wise regular processor space that forms a two-dimensional array of linear systolic arrays. Template matching can be considered as a 2-D convolution of an image of sizeN × Nwith a kernel of sizer× r. Conventional high-speed implementations use 2-D systolic arrays of sizeO(r²) which compute inO(N²) time. The drawback of this solution is that the size of the processor array follows on the size of the convolution kernel. This does not permit the allocation of more processors in order to meet the real-time requirements. With the approach used in this paper, the size of the processor array may be extended up toO(sr²), 1 ≤s≤N, thereby accomplishing the calculations inO(N²/s) time. In the case whens=r, ther × rmesh of 1-D systolic arrays of sizeO(r) is yielded. The piecewise regularity of the 3-D processor array allows also easy physical realization. 相似文献

13.

Parallel algorithms for minimal spanning trees of directed graphs

Yixin Zhang 《International journal of parallel programming》1989,18(3):205-221

The main results of this paper are efficient parallel algorithms, MSP and LOCATE, for computing minimal spanning trees and locating minimal paths in directed graphs, respectively. Algorithm MSP has time complexityO(log³ n) usingO(n ³/logn) processors, while LOCATE has time complexityO(logn) usingO(n ²) processors. Algorithm MSP is derived from sequential algorithms, when the unbounded parallelism model is used. 相似文献

14.

A Parallel Recursive Shortest Spanning Tree Algorithm for Image Segmentation in Distributed Computing Environment

S. H. Kwok A. G. Constantinides 《Journal of Parallel and Distributed Computing》1999,56(3):979

The recursive shortest spanning tree (RSST) algorithm has been used for various image and video coding systems. The speed is very demanding for such applications. However, the RSST algorithm is too complex to perform in a reasonable time. This motivates the present work. A distributed algorithm that constructs a recursive shortest spanning tree for image segmentation with a fixed number of processors is presented in this paper. Using a tailored data partition strategy to assign jobs to processors in our proposed parallel recursive shortest spanning tree (PRSST) algorithm, we derive a new lower bound ofO(n) for one processor with ann-pixel image. The complexity of the proposed algorithm is cost-optimal and the total number of messages required isO(c) for images of any size, wherecis a small constant. The objective quality of segmented images produced by our PRSST has maximum 1.5 dB difference from those generated by Morris's RSST algorithm for images of the size of 128 by 128 and 150 by 150 pixels. 相似文献

15.

Parallel algorithms for arrangements

R. Anderson P. Beanie E. Brisson 《Algorithmica》1996,15(2):104-125

We give the first efficient parallel algorithms for solving the arrangement problem. We give a deterministic algorithm for the CREW PRAM which runs in nearly optimal bounds ofO (logn log^* n) time andn ²/logn processors. We generalize this to obtain anO (logn log^* n)-time algorithm usingn ^d/logn processors for solving the problem ind dimensions. We also give a randomized algorithm for the EREW PRAM that constructs an arrangement ofn lines on-line, in which each insertion is done in optimalO (logn) time usingn/logn processors. Our algorithms develop new parallel data structures and new methods for traversing an arrangement.This work was supported by the National Science Foundation, under Grants CCR-8657562 and CCR-8858799, NSF/DARPA under Grant CCR-8907960, and Digital Equipment Corporation. A preliminary version of this paper appeared at the Second Annual ACM Symposium on Parallel Algorithms and Architectures [3]. 相似文献

16.

Shellsort with a constant number of increments 总被引：2，自引：0，他引：2

M. A. Weiss 《Algorithmica》1996,16(6):649-654

We consider the worst-case running time of Shellsort when only a constant number,c, of increments are allowed. Forc=3, we show that Shellsort can be implemented to run inO(N ^5/3) time, which is optimal. Forc=4, we further improve the running time toO(N ^11/7), and, forc=5, we obtain a bound ofO(N ^23/15). We also show anO(N ^1+1/k) bound for generalc, wherek=[(1+8c+1)/4]. Forc=6, this isO(N ^3/2). 相似文献

17.

Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers 总被引：1，自引：0，他引：1

Keqin Li 《Journal of Parallel and Distributed Computing》2001,61(12):1709

Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N^α), where 2<α3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(log N) time by using N^α/log N processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Furthermore, our parallelization on a DMPC can be made fully scalable, that is, for all 1pN^α/log N, multiplying two N×N matrices can be performed by a DMPC with p processors in O(N^α/p) time, i.e., linear speedup and cost optimality can be achieved in the range [1..N^α/log N]. This unifies all known algorithms for matrix multiplication on DMPC, standard or non- standard, sequential or parallel. Extensions of our methods and results to other parallel systems are also presented. For instance, for all 1p N^α /log N, multiplying two N×N matrices can be performed by p processors connected by a hypercubic network in O(N^α/p+(N²/p^2/α)(log p)^2(α−1)/α) time, which implies that if p=O(N^α/(log N)^{2(α−1)/(α−2)}), linear speedup can be achieved. Such a parallelization is highly scalable. The above claims result in significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically. 相似文献

18.

Computing Hough transforms on hypercube multicomputers

Sanjay Ranka Sartaj Sahni 《The Journal of supercomputing》1990,4(2):169-190

Efficient algorithms to compute the Hough transform on MIMD and SIMD hypercube multicomputer are developed. Our algorithms can compute p angles of the Hough transform of an N × N image, p N, in 0(p + log N) time on both MIMD and SIMD hypercubes. These algorithms require 0(N ²) processors. We also consider the computation of the Hough transform on MIMD hypercubes with a fixed number of processors. Experimental results on an NCUBE/7 hypercube are presented.This research was supported by the National Science Foundation under grants DCR84-20935 and 86-17374. All correspondence should be mailed to Sanjay Ranka. 相似文献

19.

Optimal parallel algorithm for findingst-ambitus of a planar biconnected graph

K. S. Easwarakumar S. V. Krishnan C. Pandu Rangan S. Seshadri 《Algorithmica》1996,15(3):242-255

A cycleC passing through two specific verticess andt of a biconnected graph is said to be anst-ambitus if its bridges do not interlace in some special way. We present algorithms forst-ambitus for planar biconnected graphs, which are much simpler than the one known for general graphs [MT]. Our algorithm runs inO(n) time on a sequential machine and (logn) parallel time usingO(n/logn) processors on an EREW PRAM. 相似文献

20.

Parallel Implementation of Tree Skeletons

D.B. Skillicorn 《Journal of Parallel and Distributed Computing》1996,39(2):115

Trees are a useful data type, but they are not routinely included in parallel programming systems, in part because their irregular structure makes partitioning and scheduling difficult. We present a method for algebraically constructing implementations of tree skeletons, high-level homomorphic operations that execute in parallel. Many computations on binary trees can be performed inO(logn) parallel time usingnprocessors, even taking account of communication costs. We extend these results to trees with arbitrary and variable degree. Then we show that it is possible to implement a distributed version of homomorphisms on binary trees, takingO(n/p+ log²p) parallel time onp < nprocessors, for trees of any skew and taking full account of communication costs. Under slightly stronger restrictions on the underlying functions, this can be improved toO(n/p+ logp). Furthermore, the technique for deriving distributed versions is algebraic, allowing the automatic generation of code for SPMD and data-parallel architectures. 相似文献