期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast Four‐Way Parallel Radix Sorting on GPUs

Linh Ha Jens Krüger Cláudio T. Silva 《Computer Graphics Forum》2009,28(8):2368-2378

Efficient sorting is a key requirement for many computer science algorithms. Acceleration of existing techniques as well as developing new sorting approaches is crucial for many real‐time graphics scenarios, database systems, and numerical simulations to name just a few. It is one of the most fundamental operations to organize and filter the ever growing massive amounts of data gathered on a daily basis. While optimal sorting models for serial execution on a single processor exist, efficient parallel sorting remains a challenge. In this paper, we present a hardware‐optimized parallel implementation of the radix sort algorithm that results in a significant speed up over existing sorting implementations. We outperform all known General Processing Unit (GPU) based sorting systems by about a factor of two and eliminate restrictions on the sorting key space. This makes our algorithm not only the fastest, but also the first general GPU sorting solution. 相似文献

2.

Performance characteristics of the multi-zone NAS parallel benchmarks

《Journal of Parallel and Distributed Computing》2006,66(5):674-685

We describe a new suite of computational benchmarks that models applications featuring multiple levels of parallelism. Such parallelism is often available in realistic flow computations on systems of meshes, but had not previously been captured in benchmarks. The new suite, named NPB (NAS parallel benchmarks) multi-zone, is derived from the NPB suite, and involves solving the application benchmarks LU, BT and SP on collections of loosely coupled discretization meshes. The solutions on the meshes are updated independently, but after each time step they exchange boundary value information. This strategy provides relatively easily exploitable coarse-grain parallelism between meshes. Three reference implementations are available: one serial, one hybrid using the message passing interface (MPI) and OpenMP, and another hybrid using a shared memory multi-level programming model (SMP+OpenMP). We examine the effectiveness of hybrid parallelization paradigms in these implementations on four different parallel computers. We also use an empirical formula to investigate the performance characteristics of the hybrid parallel codes. 相似文献

3.

Implementing regularly structured neural networks on the DREAMmachine

Shams S. Gaudiot J.-L. 《Neural Networks, IEEE Transactions on》1995,6(2):407-421

High-throughput implementations of neural network models are required to transfer the technology from small prototype research problems into large-scale "real-world" applications. The flexibility of these implementations in accommodating for modifications to the neural network computation and structure is of paramount importance. The performance of many implementation methods today is greatly dependent on the density and the interconnection structure of the neural network model being implemented. A principal contribution of this paper is to demonstrate an implementation method which exploits maximum amount of parallelism from neural computation, without enforcing stringent conditions on the neural network interconnection structure, to achieve this high implementation efficiency. We propose a new reconfigurable parallel processing architecture, the Dynamically Reconfigurable Extended Array Multiprocessor (DREAM) machine, and an associated mapping method for implementing neural networks with regular interconnection structures. Details of the system execution rate calculation as a function of the neural network structure are presented. Several example neural network structures are used to demonstrate the efficiency of our mapping method and the DREAM machine architecture on implementing diverse interconnection structures. We show that due to the reconfigurable nature of the DREAM machine, most of the available parallelism of neural networks can be efficiently exploited. 相似文献

4.

DaSH: A benchmark suite for hybrid dataflow and shared memory programming models

《Parallel Computing》2015

The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory models. Their findings have influenced the introduction of task dependency in the OpenMP 4.0 standard.This article presents DaSH – the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH also includes sequential and shared-memory implementations based on OpenMP and Intel TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics. 相似文献

5.

Efficient Parallel Nonlinear Multigrid Relaxation Algorithms for Low-Level Vision Applications

《Journal of Parallel and Distributed Computing》1995,29(1):96-103

Multigrid techniques have been shown to significantly improve the convergence rate of the nonlinear relaxation algorithms used in computer vision for the extraction of low-level image features. It is also well known that the computations involved with relaxation algorithms are regular and local, and lead naturally to massive data parallelism. However, standard data parallelism does not exploit the large computing resources of the now available massively parallel 2D processor arrays when coarse image resolutions (i.e., small image grids) have to be processed, like in multigrid methods. In this research note, we present an algorithmic framework which enables us making a full use of the large potential of data parallelism for the implementation of nonlinear multigrid relaxation methods. The approach combines two different levels of parallelism: parallel updating of the image sites and concurrent explorations of the configuration space of the problem. The efficiency of the method is demonstrated on two different low-level vision applications: restoration of noisy images and optical flow computation. 相似文献

6.

The parallel computation of eigenvalues and eigenvectors of large Hermitian matrices using the AMT DAP 510

J. S. Weston M. Clint C. W. Bleakney 《Concurrency and Computation》1991,3(3):179-185

The solution of the algebraic eigenvalue problem is an important component of many applications in science and engineering. With the advent of novel architecture machines, much research effort is now being expended in the search for parallel algorithms for the computation of eigensystems which can gainfully exploit the processing power which these machines provide. Among important recent work References 1-4 address the real symmetric eigenproblem in both its dense and sparse forms, Reference 5 treats the unsymmetric eigenproblem, and Reference 6 investigates the solution of the generalized eigenproblem. In this paper two algorithms for the parallel computation of the eigensolution of Hermitian matrices on an array processor are presented. These algorithms are based on the Parallel Orthogonal Transformation algorithm (POT) for the solution of real symmetric matrices[7,8]. POT was developed to exploit the SIMD parallelism supported by array processors such as the AMT DAP 510. The new algorithms use the highly efficient implementation strategies devised for use in POT. The implementations of the algorithms permit the computation of the eigensolution of matrices whose order exceeds the mesh size of the array processor used. A comparison of the efficiency of the two algorithms for the solution of a variety of matrices is given. 相似文献

7.

Parallel implementation of OPS5 on the encore multiprocessor: Results and analysis 总被引：1，自引：0，他引：1

Anoop Gupta Milind Tambe Dirk Kalp Charles Forgy Allen Newell 《International journal of parallel programming》1988,17(2):95-124

Until now, most results reported for parallelism in production systems (rulebased systems) have been simulation results-very few real parallel implementations exist. In this paper, we present results from our parallel implementation of OPS5 on the Encore multiprocessor. The implementation exploits very finegrained parallelism to achieve significant speed-ups. For one of the applications, we achieve 12.4 fold speed-up using 13 processes. Our implementation is also distinct from other parallel implementations in that we parallelize a highly optimized C-based implementation of OPS5. Running on a uniprocessor, our C-based implementation is 10–20 times faster than the standard lisp implementation distributed by Carnegie Mellon University. In addition to presenting the performance numbers, the paper discusses the details of the parallel implementation-the data structures used, the amount of contention observed for shared data structures, and the techniques used to reduce such contention. 相似文献

8.

Attempting guards in parallel: A data flow approach to execute generalized guarded commands

R. Govindarajan S. Yu V. S. Lakshmanan 《International journal of parallel programming》1992,21(4):225-268

Earlier approaches to execute generalized alternative/repetitive commands of Communicating Sequential Processes (CSP) attempt the selection of guards in a sequential order. Also, these implementations are based on either shared memory or message passing multiprocessor systems. In contrast, we propose an implementation of generalized guarded commands using the data-driven model of computation. A significant feature of our implementation is that it attempts the selection of the guards of a process in parallel. We prove that our implementation is faithful to the semantics of the generalized guarded commands. Further, we have simulated the implementation using discrete-event simulation and measured various performance parameters. The measured parameters are helpful in establishing the fairness of our implementation and its superiority, in terms of efficiency and the parallelism exploited, over other implementations. The simulation study is also helpful in identifying various issues that affect the performance of our implementation. Based on this study, we have proposed an adaptive algorithm which dynamically tunes the extent of parallelism in the implementation to achieve an optimum level of performance.The first author's work was supported by a MICRONET, Network Centers of Excellence, research grant. Support for the second author is from the NSERC (Canada) Grant. The last author's work was supported by grants from NSERC (Canada) and FCAR (Quebec). 相似文献

9.

Parallelizing the Cellular Potts Model on graphics processing units

José Juan Tapia Roshan M. D'Souza 《Computer Physics Communications》2011,182(4):857-865

The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally addresses computational speed due to the limited memory bandwidth between nodes. In this paper we present new data-parallel algorithms and data structures for simulating the Cellular Potts Model on graphics processing units. Our implementations handle most terms in the Hamiltonian, including cell–cell adhesion constraint, cell volume constraint, cell surface area constraint, and cell haptotaxis. We use fine level checkerboards with lock mechanisms using atomic operations to enable consistent updates while maintaining a high level of parallelism. A new data-parallel memory allocation algorithm has been developed to handle cell division. Tests show that our implementation enables simulations of >¹⁰⁶ cells with lattice sizes of up to 256³ on a single graphics card. Benchmarks show that our implementation runs ∼80× faster than serial implementations, and ∼5× faster than previous parallel implementations on computing clusters consisting of 25 nodes. The wide availability and economy of graphics cards mean that our techniques will enable simulation of realistically sized models at a fraction of the time and cost of previous implementations and are expected to greatly broaden the scope of CPM applications. 相似文献

10.

An Efficient Parallel Algorithm for FFT

下载免费PDF全文

Qiao Xiangzhen 《计算机科学技术学报》1987,2(3):174-190

相似文献

11.

Parthenon: A parallel theorem prover for non-horn clauses

Soumitra Bose Edmund M. Clarke David E. Long Spiro Michaylov 《Journal of Automated Reasoning》1992,8(2):153-181

We describe a parallel resolution theorem prover, called Parthenon, that handles full first order logic. Although there has been much work on parallel implementations of logic programming languages, Parthenon is the first general purpose theorem prover to be developed for a multiprocessor. The system is based on a modification of Warren's SRI model for or-parallelism and implements a variant of Loveland's model elimination procedure. It has been evaluated on various shared memory multiprocessors including a 16-processor Encore Multimax and IBM's 64-processor RP3. We have found that many theorem proving problems exhibit a great deal of potential parallelism. Parthenon has been able to exploit much of this parallelism, producing both good absolute run times and near-linear speedup curves in many cases.This research was partially supported by NSF grant CCR-87-226-33. An earlier version of this paper appeared in the Fourth IEEE Symposium on Logic in Computer Science, Asilomar, CA, June 1989. D.E.L. was partially supported by an NSF graduate fellowship. S.M. was partially supported by an IBM graduate fellowship. 相似文献

12.

Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem 总被引：1，自引：0，他引：1

下载免费PDF全文

Jin Guohua Yang Xuejun Chen Fujie 《计算机科学技术学报》1993,8(1):49-57

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop. 相似文献

13.

Analysis of multidimensional images on the Connection Machine system

Giampiero Marcenaro Massimo Tistarelli 《Concurrency and Computation》1991,3(6):699-713

The Connection Machine (CM) has been demonstrated to be an efficient and fast computational engine for the solution of many problems related to image processing. The high-level parallelism of the CM naturally fits to many large-scale data intensive applications. In this paper the implementation of parallel algorithms for the analysis of multidimensional images on the CM is presented. Different aspects in the analysis of multidimensional images are considered. In the field of artificial vision, the implementation of algorithms for the filtering of image sequences (both in space and time) and the estimation of the optical flow is described and some results in terms of accuracy and computation time are presented. The processing of three-dimensional images is investigated in the field of biomedical engineering. In this case the goal is the development of algorithms for the 3-D reconstruction of human body segments and their visualization. The parallel implementations exploit the fine grain parallelism allowed by the CM, processing each point of the data on a different processor. This mechanism is allowed by the possibility of dynamically reconfiguring the connectivity of the CM nodes and of defining a huge number of virtual processors. Moreover, as the CM processors operate on one-bit data, it is possible to tune the number of bits for each data point to match the accuracy required by the application. 相似文献

14.

Adaptive parallelism and Piranha

Carriero N. Freeman E. Gelernter D. Kaminsky D. 《Computer》1995,28(1):40-49

相似文献

15.

基于分布对象的并行程序设计方法研究

龚向坚邹腊梅马淑萍《现代计算机》2011,(21):9-11,26

研究分布式对象的并行实现及优化,提出一种基于分布式对象的并行程序设计方法,构建一个基于分布式对象的并行程序设计模型,并以此方法完成虚拟计算机网络实验系统的设计和实现实验结果表明,该虚拟计算机网络实验系统并行性较好、响应速度适中,证明基于分布式对象的并行程序设计方法在改善微机系统并行性上具有一定的作用相似文献

16.

Stabilizing large‐scale generalized systems on parallel computers using multithreading and message‐passing

Peter Benner Maribel Castillo Rafael Mayo Enrique S. Quintana‐Ortí Gregorio Quintana‐Ortí 《Concurrency and Computation》2007,19(4):531-542

We discuss the parallelization of an efficient algorithm for the partial stabilization of large‐scale linear control systems in generalized state‐space form. The algorithm is composed of highly parallel iterative schemes that appear in the computation of certain matrix functions. Here we evaluate different approaches to exploit parallelism at two levels, based on threads and processes. Our experimental results on a cluster of symmetric multiprocessors and a CC‐NUMA platform show that the efficiency of the matrix operations underlying the iterative schemes carry over to the parallel implementation of the stabilization algorithm. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

17.

Algorithms for asynchronous parallel processing of object-orienteddatabases

Thakore A.K. Su S.Y.W. Lam H.X. 《Knowledge and Data Engineering, IEEE Transactions on》1995,7(3):487-504

Management of large quantities of complex data is essential in many advanced application areas. Object-oriented (OO) database management system have been developed to effectively model and process the complex domain knowledge. They have been shown to outperform some existing relational systems. The existing implementations of OO database management systems attempt to improve the efficiency of OO queries by explicitly capturing the relationships among objects. However, the execution of complex queries involving the retrieval of objects from many classes and relationships among them causes the existing system to operate inefficiently. In this paper, we present parallel algorithms for the processing of queries against a large OO database. The algorithms are based on a closed model of query processing pattern-based access instead of the conventional value-based access. During processing, the algorithms avoid the execution of time-consuming join operations by making use of the explicitly stored object associations. Generation of large quantities of temporary data is avoided by marking objects using their identifiers and by employing a two-phase query processing strategy. A query is processed by concurrent multiple waves, thereby improving parallelism avoiding the complexities introduced in their sequential implementation. The correctness and the performance of the parallel algorithms have been tested and analyzed by running parallel programs on a 32-node transputer based parallel machine designed and developed at the IBM Research Center at Yorktown Heights, New York. Benchmark queries of different semantic complexities are generated, and their performance is analyzed for various data and query parameters 相似文献

18.

Constructive protocol specification using Cicero

Huang Y.-M. Ravishankar C.V. 《IEEE transactions on pattern analysis and machine intelligence》1998,24(4):252-267

This paper describes Cicero, a set of language constructs to allow constructive protocol specifications. Unlike other protocol specification languages, Cicero gives programmers explicit control over protocol execution, and facilitates both sequential and parallel implementations, especially for protocols above the transport-layer. It is intended to be used in conjunction with domain-specific libraries, and is quite different in philosophy and mode of use from existing protocol specification languages. A feature of Cicero is the use of event patterns to control synchrony, asynchrony, and concurrency in protocol execution, which helps programmers build robust protocol implementations. Event-pattern driven execution also enables implementers to exploit parallelism of varying grains in protocol execution. Event patterns can also be translated into other formal models, so that existing verification techniques may be used 相似文献

19.

GPU-accelerated level-set segmentation

Julián Lamas-Rodríguez Dora B. Heras Francisco Argüello Dagmar Kainmueller Stefan Zachow Montserrat Bóo 《Journal of Real-Time Image Processing》2016,12(1):15-29

The level-set method, a technique for the computation of evolving interfaces, is a solution commonly used to segment images and volumes in medical applications. GPUs have become a commodity hardware with hundreds of cores that can execute thousands of threads in parallel, and they are nowadays ideal platforms to execute computational intensive tasks, such as the 3D level-set-based segmentation, in real time. In this paper, we propose two GPU implementations of the level-set-based segmentation method called Fast Two-Cycle. Our proposals perform computations in independent domains called tiles and modify the structure of the original algorithm to better exploit the features of the GPU. The implementations were tested with real images of brain vessels and a synthetic MRI image of the brain. Results show that they execute faster than a CPU-sequential implementation of the same method, without any significant loss of the segmentation quality and without requiring distributed parallel computer infrastructures. 相似文献

20.

Fifth generation and VLSI architectures

Philip C. Treleaven Apostolos N. Refenes 《Future Generation Computer Systems》1985,1(6):387-396

Most Western Governments (USA, Japan, EEC, etc.) have now launched national programmes to develop computer systems for use in the 1990s. These so-called Fifth Generation computers are viewed as “knowledge” processing systems which support the symbolic computation underlying Artificial Intelligence applications. The major driving force in Fifth Generation computer design is to efficiently support very high level programming languages (i.e. VHLL architecture).

Historycally, however, commercial VHLL architectures have been largely unsuccesful. The driving force in computer designs has principally been advances in hardware which at the present time means architectures to exploit very large scale integration (i.e. VLSI architecture).

This paper examines VHLL architectures and VLSI architectures and their probable influences on Fifth Generation computers. Interestingly the major problem for both architecture classes is parallelism; how to orchestrate a single parallel computation so that it can be distributed across an ensemble of processors. 相似文献