期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Elimination of the computational broadcast in systolic arrays: an application to the qr decomposition algorithm

《国际计算机数学杂志》2012,89(4):449-469

Two types of broadcast in algorithms are determined: (1) a data broadcast, where one data value is used for more than one computation and (2) a computational broadcast where one variable is computed in more then one computation. Both types of broadcast are preferred to be eliminated when a processor array implementation is desired by using VLSI technology.

When the algorithm computes only one variable value for each index vector then the computational broadcast can be eliminated in a straight forward manner by introducing counter values resulting in a single assignment code. However in cases when the algorithm computes two or more variable values that are specified by a different computational broadcast, has not been considered. As far as is known it has been solved by deriving localized algorithms in single assignment code heuristically.

In this paper we define this problem in terms of a system of affine recurrence equations and analyze the data dependencies introduced. Then we show a synthesis procedure that eliminates the computational broadcast and a few examples of implementation are shown. The QR decomposition algorithm is also presented in a localized single assignment code by using the proposed method and several different parallel implementations are discussed. 相似文献

2.

Localized algorithms for vlsi processor arrays

《国际计算机数学杂志》2012,89(2):149-166

In this paper we analyze the algorithms expressed as a system of recurrence equations. The algorithms are called 2^?1 output algorithms if two output values of one function (variable identification) are specified by the system of recurrence equations for each index point in the algorithm. The algorithm is in free form if the indexes of these two values are not dependent. Two standard classes are determined by this criteria: the nearest neighbour and the all pair form. For example the sorting algorithm can be expressed in the all pair form i.e., the linear insertion algorithm or in the nearest neighbour form i.e., the bubble sort algorithm. However these algorithms are different in their nature. A procedure to eliminate the computational broadcast for the all pair 2^?1output algorithm has been proposed by the authors in [1]. The result obtained by implementing this procedure was a localized form of the algorithm and a system of uniform recurrence equations by eliminating the computational and data broadcast. So he data dependence method can be efficiently used for parallel implementations. The proposed procedure cannot be implemented directly on the nearest neighbour form algorithms. Here we show how the algorithm can be restructured into a form where the computational and data broadcast can be eliminated. These transformations result in localized algorithms. A few examples show how these algorithms can be implemented on processor arrays. For example, the Gentleman Kung triangular array [2] can be used for solving the QR decomposition algorithm for both forms of the algorithm. The implementations differ in the order of the data flow and the processor operation. We show that the implementation of the nearest neighbour algorithm is even better than the standard one. 相似文献

3.

Mapping uniform recurrences onto small size arrays

Vincent Van Dongen 《Future Generation Computer Systems》1992,8(4):349-361

Given a regular application described by a system of uniform recurrence equations, systolic arrays are commonly derived by means of an affine transformation; an affine schedule determines when the computations are performed and an affine processor allocation where they are performed. Circuit transformations are then applied on the resulting circuit when the application needs to be mapped onto a smaller size array. This method is in two steps and thus can hardly be optimized globally.

We hereafter present a different method for designing small size arrays. We derive them in one step by means of an affine schedule and a near-affine processor allocation. By doing so, we can generalize the optimization technique for affine mapping to be applicable here. The method is illustrated on the band-matrix multiplication and on the convolution algorithms. 相似文献

4.

Geometrical tools to map systems of affine recurrence equations on regular arrays

Catherine Mongenet Philippe Clauss Guy-René Perrin 《Acta Informatica》1994,31(2):137-160

We propose a method based on geometrical tools to map problems onto regular and synchronous processor arrays. The problems we consider are defined by systems of affine recurrence equations (SARE). From such a problem specification we extract the data dependencies in terms of two classes of vectors: the utilization vectors and the dependence vectors. We use these vectors to express constraints on the timing or the allocation functions. We differentiate two classes of constraints. The causal ones are intrinsic timing constraints induced by the system of equations defining the problem. A given choice of target architecture may impose new constraints on the timing or the allocation. We call them the architecture-related constraints. We use these constraints to determine first an affine timing function and next an allocation by projection. We finally illustrate the method with three examples: the matrix multiplication, the recursive convolution and the LL^t Cholesky factorization. 相似文献

5.

Processor array architectures for deep packet classification

Gebali F. Rafiq A.N.M.E. 《Parallel and Distributed Systems, IEEE Transactions on》2006,17(3):241-252

This paper presents a systematic technique for expressing a string search algorithm as a regular iterative expression to explore all possible processor arrays for deep packet classification. The computation domain of the algorithm is obtained and three affine scheduling functions are presented. The technique allows some of the algorithm variables to be pipelined while others are broadcast over system-wide buses. Nine possible processor array structures are obtained and analyzed in terms of speed, area, power, and I/O timing requirements. Time complexities are derived analytically and through extensive numerical simulations. The proposed designs exhibit optimum speed and area complexities. The processor arrays are compared with previously derived processor arrays for the string matching problem. 相似文献

6.

An Integrated Approach to the Design of Processor Arrays with Systolic Organization of Computations

L. D. Elfimova Yu. V. Kapitonova 《Cybernetics and Systems Analysis》2002,38(6):797-807

An integrated approach is proposed to the design of economically efficient and high-performance processor arrays with systolic organization of computations. The approach includes the construction of VLSI-oriented versions of locally recursive algorithms and synthesis of new architectures of processor arrays for transforming algorithms that maximally take into account fundamental restrictions of VLSI technology. Within the framework of this approach, strategies are developed for obtaining the above-mentioned algorithms and architectures. 相似文献

7.

Space-Time Equations for Non-Unimodular Mappings

《国际计算机数学杂志》2012,89(5):555-572

The class of systems of uniform recurrence equations (UREs) is closed under uni-modular transformations. As a result, every systolic array described by a unimodular mapping can be specified by a system of space-time UREs, in which the time and space coordinates are made explicit. As non-unimodular mappings are frequently used in systolic designs, this paper presents a method that derives space-time equations for systolic arrays described by non-unimodular mappings. The space-time equations for non-unimodular mappings are known elsewhere as sparse UREs (SUREs) because the domains of their variables are sparse and their data dependences are uniform. Our method is compositional in that space-time SUREs can be further transformed by unimodular and non-unimodular mappings, allowing a straightforward implementation in systems like ALPHA. Specifying a systolic design by space-time equations has two advantages. First, the space-time equations exhibit all useful properties about the design, allowing the design to be formally verified. Second, depending on the application area and performance requirement, the space-time equations can be realised as custom VLSI systems, FPGAs, or programs to be run on a parallel computer. 相似文献

8.

VLSI-suited orthogonal solution of systems of linear equations

J. Gotze U. Schwiegelshohn 《Journal of Parallel and Distributed Computing》1991,11(4)

In this paper one- and two-dimensional processor arrays for the orthogonal solution of systems of linear equations are presented. Each processor cell of these processor arrays is able to carry out a Givens rotation. By using a modified Givens rotation algorithm only multiplications and additions must be executed in these processor cells. Compared to previous approaches for an orthogonal solution of systems of linear equations, the presented processor arrays have the following advantages:

• -the hardware in an individual processor cell consists only of multipliers and adders.

• -almost full utilization (asymptotically 100%) of these hardware components in an active processor cell can be achieved.

• -the latency time for one GR, which determines the data pulse frequency of the processor arrays, can be reduced from O(b) to O(log₂b), where b is the length of a dataword. 相似文献

9.

Preconditioning index set transformations for time-optimal affine scheduling

B. Lisper 《Algorithmica》1996,15(2):193-203

Many regular algorithms, suitable for VLSI implementation, are naturally described by sets of integer index vectors together with a rule that assigns a computation to each vector. Regular VLSI structures for such algorithms can be found by mapping the index vectors to a discrete space-time with integer coordinates. If the scope is restricted to linear or affine mappings, then the minimization of the execution time for the VLSI implementation with respect to the space-time mapping is essentially an integer linear programming (ILP) problem. If the entries in the vector describing the time function must be integers, ILP techniques can be applied directly. There are, however, index sets that allow space-time mappings with rational, nonintegral entries. In such cases, ILP will not consider all possible affine time functions and an optimal solution may go unnoticed. In this paper we give sufficient conditions on the index set for when only integer time functions are allowed. We also give a general algorithm to find a preconditioning affine transformation of the index set, such that the transformed index set allows only integer time functions. ILP methods can then be used to find time-optimal architectures for the transformed algorithm. This considerably extends the class of algorithms for which time-optimal VLSI structures can be found.Communicated by M. Snir. 相似文献

10.

VLSI design for massively parallel signal processors

SY Kung Jurgen Annevelink 《Microprocessors and Microsystems》1983,7(10):461-468

相似文献

11.

Image processing with VLSI

AG Corry DK Arvind GLS Connolly RR Korya IN Parker 《Microprocessors and Microsystems》1983,7(10):482-486

The development and implementation of systems for the more complex realtime image processing and scene understanding tasks, such as robot vision and remote surveillance, calls for faster computation than that possible using the traditional serial computer. The advent of VLSI has made feasible the consideration of more specialized processing architectures, designed to support these datarates, while keeping systems compact and relatively cheap. Two approaches are discussed: the use of a programmable processor array, and the customizing of image processing algorithms in silicon. This paper examines designs based upon each approach in the light of the techniques and constraints of VLSI. In particular we describe in some detail an example of a VLSI parallel array processor, the Grid (GEC rectangular image and data processor), and a number of special-purpose CMOS/SOS chips based on systolic design techniques. 相似文献

12.

多关节机器人自适应控制的一种VLSI并行化方法

慕德俊佟明安《机器人》1997,19(2):81-85

本文基于超大规模集成的脉动结构，提出了实现机器人线性化模型的参数辩识及自适应控制中优动力矩的并行计算机方法。时序分析表明，该方法可使机器人自适应控制中扰动力矩计策实时性得到很大提高。相似文献

13.

PARALLEL SOLUTION OF SYMMETRIC POSITIVE DEFINITE TOEPLITZ SYSTEMS

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(4):297-303

Initially, parallel algorithms were designed by parallelising the existing sequential algorithms for frequently occurring problems on available parallel architectures.

More recently, parallel strategies have been identified and utilised resulting in many new parallel algorithms. However, the analysis of such techniques reveals that further strategies can be applied to increase the parallelism. One of these strategies, i.e., increasing the computational work in each processing node, can reduce the memory accesses and hence congestion in a shared memory multiprocessor system. Similarly, when network message passing is minimised in a distributed memory processor system, dramatic improvements in the performance of the algorithm ensue.

A frequently occurring computational problem in digital signal processing (DSP) is the solution of symmetric positive definite Toeplilz linear systems. The Levinson algorithm for solving such linear equations is where the Toeplitz matrix property is utilised in the elimination process of each element to advantage. However, it can be shown that in the Parallel Implicit Elimination (PIE) method where more than one element is eliminated simultaneously, the Toeplitz structure can again be utilised to advantage. This relatively simple strategy yields a reduction in accesses to shared memory or network message passing, resulting in a significant improvement in the performance of the algorithm [2], 相似文献

14.

A new algorithm based on Givens rotations for solving linearequations on fault-tolerant mesh-connected processors

Murthy K.N.B. Bhuvaneswari K. Ram Murthy C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):825-832

In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures 相似文献

15.

An efficient dictionary machine using hexagonal processor arrays

Hee Yong Youn Jae Young Lee 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(3):266-273

Dictionary machine is an important VLSI system performing high speed data archival operations. In this paper, we present a design which can efficiently implement dictionary machines in VLSI processor arrays. In order to effectively process the operations of dictionary machine, hexagonal mesh is selected as the host topology in which two different networks for update and query operation are embedded. The proposed design is simple to implement as well as allows high throughput 相似文献

16.

Communication Optimization for Affine Recurrence Equations Using Broadcast and Locality

Vincent Loechner Catherine Mongenet 《International journal of parallel programming》2000,28(1):47-102

This paper deals with communication optimization which is a crucial issue in automatic parallelization. From a system of parameterized affine recurrence equations we propose a heuristic that determines a set of efficient space-time transformations. It focuses on distant communications removal using broadcast—including anticipated broadcast, and locality enforcement. 相似文献

17.

Some efficient solutions to the affine scheduling problem. I. One-dimensional time

Paul Feautrier 《International journal of parallel programming》1992,21(5):313-347

Programs and systems of recurrence equations may be represented as sets of actions which are to be executed subject to precedence constraints. In may cases, actions may be labelled by integral vectors in some iterations domains, and precedence constraints may be described by affine relations. A schedule for such a program is a function which assigns an execution data to each action. Knowledge of such a schedule allows one to estimate the intrinsic degree of parallelism of the program and to compile a parallel version for multiprocessor architectures or systolic arrays. This paper deals with the problem of finding closed form schedules as affine or piecewise affine functions of the iteration vector. An algorithm is presented which reduces the scheduling problem to a parametric linear program of small size, which can be readily solved by an efficient algorithm. 相似文献

18.

Systolic Preconditioned Algorithms for the Iterative Solution of Linear Systems

《国际计算机数学杂志》2012,89(8):911-921

Systolic algorithms for preconditioned iterative procedures are described and discussed in relation to existing systolic arrays for iterative methods applied to the systems of linear equations arising from the solution of partial differential equations. In boundary value problems it is shown that if the cost of the preconditioned systolic arrays in terms of hardware is related to the (standard) iterative methods, then savings in the number of array cells can be made when the system involved is large and sparse (narrow bandwidth) with a significant improvement in convergence rate. 相似文献

19.

Simulation and verification of associative processor arrays

A. W. G. Duller R. Storer 《Parallel Computing》1992,18(12):1403-1414

This work is based on the design of a VLSI processor array comprising single bit processing elements combined with Content Addressable Memory (CAM) [1,2]. The processors are connected in a linear array with 64 currently being combined on a chip. Each processor is linked to 64 bits of data CAM and 4 bits of subset CAM (used for marking subsets of the array for subsequent processing). The architecture is targeted at image applications including pixel based processing as well as higher level symbolic manipulation and incorporates a data shift register linking all of the processing elements which allows data loading and processing to occur concurrently.

The current situation is that an extensive functional simulation package has been written [3] which allows algorithms to be coded and executed on a system which comprises an arbitrary number of array chips together with its controlling hardware. This allows algorithms to be investigated, and tuned to the architecture. A reduced design has been fabricated and the chips are undergoing parametric testing. A full version of the processor array chip will then be produced allowing a complete image system to be tested.

The VLSI design work undertaken so far [2] shows that the blocks which constitute the design can easily be replicated an arbitrary number of times (subject to chip size constraints) to create an application specific CAM array. The need for this type of flexibility has been borne out by the algorithmic work that has been carried out by a number of workers. In order to make the design of application specific arrays possible it is vital that the simulation tools are fast enough to allow adequate testing to be performed on the new design. It is for this reason that the original simulation package, written in C, has been transferred onto a transputer array.

This paper looks at the way in which the simulation is mapped onto the transputers in such a way that an arbitrary number can be used. In addition the problems of verification and validation of the simulator and the VLSI design are addressed. Results are given for a number of different applications which show very encouraging speed-ups. In many ways it has been found that the efficiency with which the simulation can be carried out with a large number of transputers mirrors the efficiency of the processor array in terms of communications overhead. 相似文献

20.

Parallel VLSI architectures for real-time kinematics of redundant robots

Ian D. Walker Joseph R. Cavallaro 《Journal of Intelligent and Robotic Systems》1994,9(1-2):25-43

We describe new architectures for the efficient computation of redundant manipulator kinematics (direct and inverse). By calculating the core of the problem in hardware, we can make full use of the redundancy by implementing more complex self-motion algorithms. A key component of our architecture is the calculation in the VLSI hardware of the Singular Value Decomposition of the manipulator Jacobian. Recent advances in VLSI have allowed the mapping of complex algorithms to hardware using systolic arrays with advanced computer arithmetic algorithms, such as the coordinate rotation (CORDIC) algorithms. We use CORDIC arithmetic in the novel design of our special-purpose VLSI array, which is used in computation of the Direct Kinematics Solution (DKS), the manipulator Jacobian, as well as the Jacobian Pseudoinverse. Application-specific (subtask-dependent) portions of the inverse kinematics are handled in parallel by a DSP processor which interfaces with the custom hardware and the host machine. The architecture and algorithm development is valid for general redundant manipulators and a wide range of processors currently available and under development commercially. 相似文献