首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 765 毫秒
1.
Decomposition abstraction is the process of organizing and specifying decomposition strategies for the exploitation of parallelism available in an application. In this paper we develop and evaluate declarative primitives for rule-based programs that expand opportunities for parallel execution. These primitives make explicit, implicit relations among the data and similarly among the rules. The semantics of the primitives are presented in a general object-based framework such that they may be applied to most rule-based programming languages. We show how the additional information provided by the decomposition primitives can be incorporated into a semantic-based dependency analysis technique. The resulting analysis reveals parallelism at compile time that is very difficult, if not impossible, to discover by traditional syntactic analysis techniques. Simulation results demonstrate scalable and broadly available parallelism  相似文献   

2.
This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.  相似文献   

3.
Algorithms for sparse nonnegative Tucker decompositions   总被引:3,自引:0,他引:3  
There is a increasing interest in analysis of large-scale multiway data. The concept of multiway data refers to arrays of data with more than two dimensions, that is, taking the form of tensors. To analyze such data, decomposition techniques are widely used. The two most common decompositions for tensors are the Tucker model and the more restricted PARAFAC model. Both models can be viewed as generalizations of the regular factor analysis to data of more than two modalities. Nonnegative matrix factorization (NMF), in conjunction with sparse coding, has recently been given much attention due to its part-based and easy interpretable representation. While NMF has been extended to the PARAFAC model, no such attempt has been done to extend NMF to the Tucker model. However, if the tensor data analyzed are nonnegative, it may well be relevant to consider purely additive (i.e., nonnegative) Tucker decompositions). To reduce ambiguities of this type of decomposition, we develop updates that can impose sparseness in any combination of modalities, hence, proposed algorithms for sparse nonnegative Tucker decompositions (SN-TUCKER). We demonstrate how the proposed algorithms are superior to existing algorithms for Tucker decompositions when the data and interactions can be considered nonnegative. We further illustrate how sparse coding can help identify what model (PARAFAC or Tucker) is more appropriate for the data as well as to select the number of components by turning off excess components. The algorithms for SN-TUCKER can be downloaded from M?rup (2007).  相似文献   

4.
Twelve adaptive image-space decomposition algorithms are presented for sort-first parallel direct volume rendering (DVR) of unstructured grids on distributed-memory architectures. The algorithms are presented under a novel taxonomy based on the dimension of the screen decomposition, the dimension of the workload arrays used in the decomposition, and the scheme used for workload-array creation and querying the workload of a region. For the 2D decomposition schemes using 2D workload arrays, a novel scheme is proposed to query the exact number of screen-space bounding boxes of the primitives in a screen region in constant time. A probe-based chains-on-chains partitioning algorithm is exploited for load balancing in optimal 1D decomposition and iterative 2D rectilinear decomposition (RD). A new probe-based optimal 2D jagged decomposition (OJD) is proposed which is much faster than the dynamic-programming based OJD scheme proposed in the literature. The summed-area table is successfully exploited to query the workload of a rectangular region in constant time in both OJD and RD schemes for the subdivision of general 2D workload arrays. Two orthogonal recursive bisection (ORB) variants are adapted to relax the straight-line division restriction in conventional ORB through using the medians-of-medians approach on regular mesh and quadtree superimposed on the screen. Two approaches based on the Hilbert space-filling curve and graph-partitioning are also proposed. An efficient primitive classification scheme is proposed for redistribution in 1D, and 2D rectilinear and jagged decompositions. The performance comparison of the decomposition algorithms is modeled by establishing appropriate quality measures for load-balancing, amount of primitive replication and parallel execution time. The experimental results on a Parsytec CC system using a set of benchmark volumetric datasets verify the validity of the proposed performance models. The performance evaluation of the decomposition algorithms is also carried out through the sort-first parallelization of an efficient DVR algorithm.  相似文献   

5.
研究一种生物运动神经控制机理与数据合成分析相结合的手写运动分析方法.特别地,将运动协作基元的概念用于手写运动数据分析,研究手写运动的协作基元合成分析方法,建立符合生物运动神经控制规律的手写运动数据理解模式.提出的协作基元合成分析过程由2个交替迭代的优化算法组成:其一,基于非负矩阵因子分解模式估计协作基元及调制幅度;其二,采用相似性最大化准则估计协作基元的激活时间.针对笔画切分的实验研究表明,采用协作基元合成分析方法获得的笔画切分结果,能够很好揭示相邻笔画之间的重叠连接模式,证实了所提方法的有效性.  相似文献   

6.
7.
Reeves  A.P. 《Software, IEEE》1991,8(6):51-59
Two Unix environments developed for programming parallel computers to handle image-processing and vision applications are described. Visx is a portable environment for the development of vision applications that has been used for many years on serial computers in research. Visx was adapted to run on a multiprocessor with modest parallelism by using functional decomposition and standard operating-system capabilities to exploit the parallel hardware. Paragon is a high-level environment for multiprocessor systems that has facilities for both functional decomposition and data partitioning. It provides primitives that will work efficiently on several parallel-processing systems. Paragon's primitives can be used to build special image-processing operations, allowing one's own programming environment to be grown naturally  相似文献   

8.
Matrix decompositions are used for many data mining purposes. One of these purposes is to find a concise but interpretable representation of a given data matrix. Different decomposition formulations have been proposed for this task, many of which assume a certain property of the input data (e.g., nonnegativity) and aim at preserving that property in the decomposition. In this paper we propose new decomposition formulations for binary matrices, namely the Boolean CX and CUR decompositions. They are natural combinations of two previously presented decomposition formulations. We consider also two subproblems of these decompositions and present a rigorous theoretical study of the subproblems. We give algorithms for the decompositions and for the subproblems, and study their performance via extensive experimental evaluation. We show that even simple algorithms can give accurate and intuitive decompositions of real data, thus demonstrating the power and usefulness of the proposed decompositions.  相似文献   

9.
The system of equations that govern kinematically redundant robotic manipulators is commonly solved by finding the singular value decomposition (SVD) of the corresponding Jacobian matrix. This can require a considerable amount of time to compute, thus a parallel SVD algorithm reducing execution time is sought. The approach employed here lends itself to parallelization by using Givens rotations and information from previous decompositions. The key contribution of this research is the presentation and implementation of parallel SVD algorithms to compute the SVD for a set of Jacobians that represent various different joint failure scenarios. Results from implementation of the algorithm on a MasPar MP-1, an IBM SP2, and the PASM prototype parallel computers are compared. Specific issues considered for each implementation include: how data is mapped to the processing elements, the effect that increasing the number of processing elements has on execution time, the type of parallel architecture used, and trade-offs between modes of parallelism.  相似文献   

10.
Efficient algorithms for estimating the coefficient parameters of the ordinary linear model on a massively parallel SIMD computer are presented. The numerical stability of the algorithms is ensured by using orthogonal transformations in the form of Householder reflections and Givens plane rotations to compute the complete orthogonal decomposition of the coefficient matrix. Algorithms for reconstructing the orthogonal matrices involved in the decompositions are also designed, implemented and analyzed. The implementation of all algorithms on the targeted SIMD array processor is considered in detail. Timing models for predicting the execution time of the implementations are derived using regression modelling methods. The timing models also provide an insight into how the algorithms interact with the parallel computer. The predetermined factors used in the regression fits are derived from the number of memory layers involved in the factorization process of the matrices. Experimental results show the high accuracy and predictive power of the timing models. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

11.
The authors introduce the notion of compatible star decompositions of simple polygons. In general, given two polygons with a correspondence between their vertices, two polygonal decompositions of the two polygons are said to be compatible if there exists a one-to-one mapping between them such that the corresponding pieces are defined by corresponding vertices. For compatible star decompositions, they also require correspondence between star points of the star pieces. Compatible star decompositions have applications in computer animation and shape representation and analysis. They present two algorithms for constructing compatible star decompositions of two simple polygons. The first algorithm is optimal in the number of pieces in the decomposition, providing that such a decomposition exists without adding Steiner vertices. The second algorithm constructs compatible star decompositions with Steiner vertices, which are not minimal in the number of pieces but are asymptotically worst-case optimal in this number and in the number of added Steiner vertices. They prove that some pairs of polygons require Ω(n2) pieces, and that the decompositions computed by the second algorithm possess no more than O(n2) pieces. In addition to the contributions regarding compatible star decompositions, the paper also corrects an error in the only previously published polynomial algorithm for constructing a minimal star decomposition of a simple polygon, an error which might lead to a nonminimal decomposition  相似文献   

12.
本文研究机群系统的程序设计问题,旨在建立一种支持虚拟共享存储空间和多种并行性描述方式的并行程序设计模型。文中首先提出了抽象结构共享存储器模型的概念,并在此基础上建立了同时支持数据并行、任务并行和对象并行的层次并行模型,这两种模型构成了并行语言TipC++的并行程序设计模型。文中还初步讨论了基于这种程序设计模型的性能优化原语、编译优化和任务调度等问题。  相似文献   

13.
We study the problem of decomposition of object-attribute matrices whose entries contain degrees to which objects have attributes. The degrees are taken from a bounded partially ordered scale. Examples of such matrices are binary matrices, matrices with entries from a finite chain, or matrices with entries from the unit interval [0, 1]. We study the problem of decomposition of a given object-attribute matrix I with degrees into an object-factor matrix A with degrees and a binary factor-attribute matrix B, with the number of factors as small as possible. We present a theorem which shows that decompositions which use particular formal concepts of I as factors for the decomposition are optimal in that the number of factors involved is the smallest possible. We show that the problem of computing an optimal decomposition is NP-hard and present two heuristic algorithms for its solution along with their experimental evaluation. For the first algorithm, we provide its approximation ratio. Experiments indicate that the second algorithm, which is considerably faster than the first one, delivers decompositions whose quality is comparable to the decompositions delivered by the first algorithm. We also present an illustrative example demonstrating a factor analysis interpretation of the decomposition studied in this paper.  相似文献   

14.
A common statistical problem is that of finding the median element in a set of data. This paper presents an efficient randomized high-level parallel algorithm for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank k, for an arbitrarily given integer k.Our general framework is an SPMD distributed-memory programming model that is enhanced by a set of communication primitives. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in the message-passing standard MPI, and our experimental results from the IBM SP-2 illustrate the scalability and efficiency of our algorithm and improve upon all the related experimental results known to the author.The main contributions of this paper are(1) New techniques for speeding the performance of certain randomized algorithms, such as selection, which are efficient with likely probability.(2) A new, practical randomized selection algorithm (UltraFast) with significantly improved convergence.  相似文献   

15.
The efficient mining of large, commercially credible, databases requires a solution to at least two problems: (a) better integration between existing Knowledge Discovery algorithms and popular DBMS; (b) ability to exploit opportunities for computational speedup such as data parallelism. Both problems need to be addressed in a generic manner, since the stated requirements of end-users cover a range of data mining paradigms, DBMS, and (parallel) platforms. In this paper we present a family of generic, set-based, primitive operations for Knowledge Discovery in Databases (KDD). We show how a number of well-known KDD classification metrics, drawn from paradigms such as Bayesian classifiers, Rule-Induction/Decision Tree algorithms, Instance-Based Learning methods, and Genetic Programming, can all be computed via our generic primitives. We then show how these primitives may be mapped into SQL and, where appropriate, optimised for good performance in respect of practical factors such as client–server communication overheads. We demonstrate how our primitives can support C4.5, a widely-used rule induction system. Performance evaluation figures are presented for commercially available parallel platforms, such as the IBM SP/2.  相似文献   

16.
Network utility maximization (NUM) problem formulations provide an important approach to conduct network resource allocation and to view layering as optimization decomposition. In the existing literature, distributed implementations are typically achieved by means of the so-called dual decomposition technique. However, the span of decomposition possibilities includes many other elements that, thus far, have not been fully exploited, such as the use of the primal decomposition technique, the versatile introduction of auxiliary variables, and the potential of multilevel decompositions. This paper presents a systematic framework to exploit alternative decomposition structures as a way to obtain different distributed algorithms, each with a different tradeoff among convergence speed, message passing amount and asymmetry, and distributed computation architecture. Several specific applications are considered to illustrate the proposed framework, including resource-constrained and direct-control rate allocation, and rate allocation among QoS classes with multipath routing. For each of these applications, the associated generalized NUM formulation is first presented, followed by the development of novel alternative decompositions and numerical experiments on the resulting new distributed algorithms. A systematic enumeration and comparison of alternative vertical decompositions in the future will help complete a mathematical theory of network architectures.  相似文献   

17.
This paper presents efficient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed memory programming model. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. Our connected components algorithm uses a novel approach for parallel merging which performs drastically limited updating during iterative steps, and concludes with a total consistency update at the final step. The algorithms have been coded in S -C and run on a variety of platforms. Our experimental results are consistent with the theoretical analysis and provide the best known execution times for these two primitives, even when compared with machine-specific implementations.  相似文献   

18.

Parallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

  相似文献   

19.
Recently, it has been proven that evolutionary algorithms produce good results for a wide range of combinatorial optimization problems. Some of the considered problems are tackled by evolutionary algorithms that use a representation which enables them to construct solutions in a dynamic programming fashion. We take a general approach and relate the construction of such algorithms to the development of algorithms using dynamic programming techniques. Thereby, we give general guidelines on how to develop evolutionary algorithms that have the additional ability of carrying out dynamic programming steps. Finally, we show that for a wide class of the so-called DP-benevolent problems (which are known to admit FPTAS) there exists a fully polynomial-time randomized approximation scheme based on an evolutionary algorithm.  相似文献   

20.
QR and LU decompositions are the most important matrix decomposition algorithms. Many studies work on accelerating these algorithms by FPGA or ASIC in a case by case style. In this paper, we propose a unified framework for the matrix decomposition algorithms, combining three QR decomposition algorithms and LU algorithm with pivoting into a unified linear array structure. The QR and LU decomposition algorithms exhibit the same two-level loop structure and the same data dependency. Utilizing the similarities in loop structure and data dependency of matrix decomposition, we unify a fine-grained algorithm for all four matrix decomposition algorithms. Furthermore, we present a unified co-processor structure with a scalable linear array of processing elements (PEs), in which four types of PEs are same in the structure of memory channels and PE connections, but the only difference exists in the internal structure of data path. Our unified co-processor, which is IEEE 32-bit floating-point precision, is implemented and mapped onto a Xilinx Virtex5 FPGA chip. Experimental results show that our co-processors can achieve speedup of 2.3 to 14.9 factors compared to a Pentium Dual CPU with double SSE threads.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号