期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Design and performance evaluation of a portable parallel libraryfor space-time adaptive processing

Lebak J.M. Bojanczyk A.W. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(3):287-298

Space-time adaptive processing (STAP) refers to a class of methods for detecting targets using an array of sensors. Various STAP methods use similar operations on different data or in different orders. We have developed a portable, parallel library of subroutines for prototyping STAP methods. The subroutines work on the IBM SP2 and the Intel Paragon under three different operating systems and three different communication libraries, and can also be configured for other systems. We provide execution-time models for predicting the performance of each subroutine. Using the library routines, we created a parallel version of element-space pre-Doppler processing, three parallel versions of higher-order post-Doppler processing, and two versions of PRI-staggered post-Doppler processing. We implemented a fourth version of higher-order post-Doppler processing, the hybrid method, which uses a combination of fine-grain and coarse-grain parallelism to reduce execution time. The hybrid method can be used to improve performance when a large number of processors is available. Our execution time models generally predict the best method and predict execution times to within 10 percent or better for large test cases 相似文献

2.

Scalability of hybrid programming for a CFD code on the Earth Simulator

K. Itakura A. Uno M. Yokokawa T. Ishihara Y. Kaneda 《Parallel Computing》2004,30(12):1329-1343

The Earth Simulator (ES) is an SMP cluster system. There are two types of parallel programming models available on the ES. One is a flat programming model, in which a parallel program is implemented by MPI interfaces only, both within an SMP node and among nodes. The other is a hybrid programming model, in which a parallel program is written by using thread programming within an SMP node and MPI programming among nodes simultaneously. It is generally known that it is difficult to obtain the same high level of performance using the hybrid programming model as can be achieved with the flat programming model.

In this paper, we have evaluated scalability of the code for direct numerical simulation of the Navier–Stokes equations on the ES. The hybrid programming model achieves the sustained performance of 346.9 Gflop/s, while the flat programming model achieves 296.4 Gflop/s with 16 PNs of the ES for a DNS problem size of 256³. For small scale problems, however, the hybrid programming model is not as efficient because of microtasking overhead. It is shown that there is an advantage for the hybrid programming model on the ES for the larger size problems. 相似文献

3.

A benchmark study based on the parallel computation of the vector outer-product A = uvT operation

Rudnei Dias Da Cunha 《Concurrency and Computation》1997,9(8):803-819

In this paper we benchmark the performance of the Cray T3D, IBM 9076 SP/1 and Intel Paragon XP/S parallel computers, using implementations of parallel algorithms for the computation of the vector outer-product A = uv^T operation. The vector outer-product operation, although very simple in nature, requires the computation of a large number of floating-point operations and its parallelization induces a great level of communication between the processors. It is thus suited to measure the relative speed of the processor, memory subsystem and network capabilities of a parallel computer. It should not be considered a ‘toy problem’, since it arises in numerical methods in the context of the solution of systems of non-linear equations – still a difficult problem to solve. We present algorithms for both the explicit shared-memory and message-passing programming models together with theoretical computation models for those algorithms. Actual experiments were run on those computers, using Fortran 77 implementations of the algorithms. The results obtained with these experiments show that due to the high degree of communication between the processors one needs a parallel computer with fast communications and carefully implemented data exchange routines. The theoretical computation model allows prediction of the speed-up to be obtained for some problem size on a given number of processors. © 1997 John Wiley & Sons, Ltd. 相似文献

4.

优化并行计算的性能评价

刘杰迟利华胡庆丰《计算机工程与设计》2000,21(6):4-7

传统的并行计算的性能评价模型是加速比,文中讨论了加速比的缺点和不足,在此基础上提出了一种新的优化并行计算的性能评价模型（我们称之为优化加速比）。利用优化加速比分析了NAS基准测试程序MG和FT在IBM SP2(66mhz/wn)上的性能。相似文献

5.

CCL: a portable and tunable collective communication library forscalable parallel computers

Bala V. Bruck J. Cypher R. Elustondo P. Ho A. Ching-Tien Ho Kipnis S. Snir M. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(2):154-164

A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model 相似文献

6.

Performance comparison of MPI and OpenMP on shared memory multiprocessors

Graud Krawezik Franck Cappello 《Concurrency and Computation》2006,18(1):29-61

When using a shared memory multiprocessor, the programmer faces the issue of selecting the portable programming model which will provide the best performance. Even if they restricts their choice to the standard programming environments (MPI and OpenMP), they have to select a programming approach among MPI and the variety of OpenMP programming styles. To help the programmer in their decision, we compare MPI with three OpenMP programming styles (loop level, loop level with large parallel sections, SPMD) using a subset of the NAS benchmark (CG, MG, FT, LU), two dataset sizes (A and B), and two shared memory multiprocessors (IBM SP3 NightHawk II, SGI Origin 3800). We have developed the first SPMD OpenMP version of the NAS benchmark and gathered other OpenMP versions from independent sources (PBN, SDSC and RWCP). Experimental results demonstrate that OpenMP provides competitive performance compared with MPI for a large set of experimental conditions. Not surprisingly, the two best OpenMP versions are those requiring the strongest programming effort. MPI still provides the best performance under some conditions. We present breakdowns of the execution times and measurements of hardware performance counters to explain the performance differences. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

7.

并行遗传算法在非均衡负载节点并行机上的实现

陈前李星《计算机工程与应用》2000,36(9):55-57

基于平衡负载、减小通信开销的考虑,对于非均衡负载节点并行机提出了两种并行遗传算法一动态负载平衡的孤岛模型和主从模型,并与基本的孤岛模型做了比较。两种算法在实际使用中均取得了较好的效果。相似文献

8.

Simulation of flowfields induced by wind blades based on a parallelized low-speed flow solver

Yang-Yao Niu Han-Wei Tang Lung-Cheng Lee T.I. Tseng 《Computers & Fluids》2011,45(1):249-253

In this study, a parallel computing technology is applied on the simulation of a wind turbine flow problem. A third-order Roe type flux limited splitting based on a pre-conditioning matrix with an explicit time marching method is used to solve the Navier–Stokes equations. The original FORTRAN code was parallelized with Message Passing Interface (MPI) language and tested on a 64-CPU IBM SP2 parallel computer. The test results show that a significant reduction of computing time in running the model and a super-linear speed up rate is achieved up to 32 CPUs at IBM SP2 processors. The speed up rate is as high as 49 for using IBM SP2 64 processors. The test shows very promising potential of parallel processing to provide prompt simulation of the current wind turbine problems. 相似文献

9.

基于TS201的STAP并行处理系统实现

邓钰李强母其勇周祖华《计算机工程与设计》2007,28(12):2778-2781

空时自适应处理(STAP)是新一代机载信号处理机的关键技术之一,它对系统的实时性有严格的要求.在分析实时STAP性能需求的基础上,利用基于TS201的高性能数字信号处理板,实现了一个STAP并行处理系统.结合STAP处理特点和系统体系结构,提出了一个局部串行全局并行的系统算法模型,并在模型实现中采用了高效的通信与同步方法.该模型有效的增强了系统的负载平衡特性.实测结果表明,系统满足STAP处理实时要求,并且有很好的可扩展特性. 相似文献

10.

基于MPI的图像并行处理方法

孙敏《数字社区&智能家居》2009,(8)

该文阐述了用MPI实现图像并行处理的一般方法,并通过不同通信机制,通信模式,网络性能,数据大小的程序运行结果对比,得到并行程序的一般特点。相似文献

11.

Performance evaluation and comparison of parallel conjugate gradient on modern multi-core accelerator and massively parallel systems

Fadi N. Sibai 《International Journal of Parallel, Emergent and Distributed Systems》2014,29(1):38-67

Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Blue Gene (BG). The solution of systems of linear equations is one of the most central processing unit-intensive steps in engineering and simulation applications and can greatly benefit from the multitude of processing cores and vectorisation on today's parallel computers. We parallelise the conjugate gradient (CG) linear equation solver on the Cell Broadband Engine and the IBM Blue Gene/L machine. We perform a scalability analysis of CG on both machines across 1, 8 and 16 synergistic processing elements and 1–32 cores on BG with heptadiagonal matrices. The results indicate that the multi-core Cell system outperforms by three to four times the massively parallel BG system due to the Cell's higher communication bandwidth and accelerated vector processing capability. 相似文献

12.

Parallel Mining of Outliers in Large Database 总被引：3，自引：0，他引：3

Edward Hung David W. Cheung 《Distributed and Parallel Databases》2002,12(1):5-26

Data mining is a new, important and fast growing database application. Outlier (exception) detection is one kind of data mining, which can be applied in a variety of areas like monitoring of credit card fraud and criminal activities in electronic commerce. With the ever-increasing size and attributes (dimensions) of database, previously proposed detection methods for two dimensions are no longer applicable. The time complexity of the Nested-Loop (NL) algorithm (Knorr and Ng, in Proc. 24th VLDB, 1998) is linear to the dimensionality but quadratic to the dataset size, inducing an unacceptable cost for large dataset.A more efficient version (ENL) and its parallel version (PENL) are introduced. In theory, the improvement of performance in PENL is linear to the number of processors, as shown in a performance comparison between ENL and PENL using Bulk Synchronization Parallel (BSP) model. The great improvement is further verified by experiments on a parallel computer system IBM 9076 SP2. The results show that it is a very good choice to mine outliers in a cluster of workstations with a low-cost interconnected by a commodity communication network. 相似文献

13.

Scalability versus Execution Time in Scalable Systems

《Journal of Parallel and Distributed Computing》2002,62(2):173-192

Parallel programming is elusive. The relative performance of different parallel implementations varies with machine architecture, system and problem size. How to compare different implementations over a wide range of machine architectures and problem sizes has not been well addressed due to its difficulty. Scalability has been proposed in recent years to reveal scaling properties of parallel algorithms and machines. In this paper, the relation between scalability and execution time is carefully studied. The concepts of crossing point analysis and range comparison are introduced. Crossing point analysis finds slow/fast performance crossing points of parallel algorithms and machines. Range comparison compares performance over a wide range of ensemble and problem size via scalability and crossing point analysis. Three algorithms from scientific computing are implemented on an Intel Paragon and an IBM SP2 parallel computer. Experimental and theoretical results show how the combination of scalability, crossing point analysis, and range comparison provides a practical solution for scalable performance evaluation and prediction. While our testings are conducted on homogeneous parallel computers, the proposed methodology applies to heterogeneous and network computing as well. 相似文献

14.

PRAM programming: in theory and in practice

D. S. Lecomber C. J. Siniolakis K. R. Sujithan 《Concurrency and Computation》2000,12(4):211-226

That the influence of the PRAM model is ubiquitous in parallel algorithm design is as clear as the fact that it is technologically infeasible for the forseeable future. The current generation of parallel hardware prominently features distributed memory and high‐performance interconnection networks—very much the antithesis of the shared memory required for the PRAM model. It has been shown that, in spite of communication costs, for some problems very fast parallel algorithms are available for distributed‐memory machines—from embarassingly parallel problems to sorting and numerical analysis. In contrast it is known that for other classes of problem PRAM‐style shared‐memory simulation on a distributed‐memory machine can, in theory, produce solutions of comparable performance to the best possible for such architectures. The Bulk Synchronous Parallel (BSP) model accurately represents most parallel machines—theoretical and actual—in an execution and cost model. We introduce a scalable portable PRAM realization appropriate for BSP computers and a methodology for usage. Our system is fast and built upon the familiar sequential C++ coupled with the new standard BSP library of parallel computation and communication primitives. It is portable to and predictable on a vast number of parallel computers including workstation clusters, a 256‐processor Cray T3D, an 8‐node IBM SP/2 and a 4‐node shared‐memory SGI Power Challenge machine. Our approach achieves simplicity of programming over direct‐mode BSP programming for reasonable overhead cost. We objectively compare optimized BSP and PRAM algorithms implemented with our C++ PRAM library and provide encouraging experimental results for our new style of programming. Copyright © 2000 John Wiley & Sons, Ltd. 相似文献

15.

并行程序的优化与性能评价 总被引：5，自引：0，他引：5

下载免费PDF全文

刘杰迟利华胡庆丰《计算机工程与科学》2000,22(5):67-70

文中讨论了并行程序的优化问题,指出并行程序的优化应从数据划分、通信优化和串行优化三个方面着手。针对传统加速比的缺点和不足,我们提出了优化加速比模型来评价优化并行程序的性能;对ＮＡＳ基准测试程序ＭＧ和ＦＴ进行了优化,用优化加速比模型分析了上述两个程序在ＩＢＭＳＰ２上的性能。相似文献

16.

Practical Considerations of Parallel Simulations and Architecture Independent Parallel Algorithm Design

《Journal of Parallel and Distributed Computing》1998,53(1):1-25

We examine combinatorial properties of a class of hash functions and its application to the simulations of classical models of parallel computation on other models, such as theBSPand theS*PRAM, optimally in communication to within additive lower order terms. The BSP model can serve as a programming paradigm as well; we also examine the implications of architecture independent parallel algorithm design in the context of the BSP model and show how it can lead to portable and scalable implementations of algorithms that can work on a multiplicity of hardware platforms with only recompilation of the source program code. Toward this end, dense Cholesky factorization algorithms are presented and their performance on three parallel hardware platforms, an SGI Power Challenge, IBM SP2, and Cray T3D, is examined and analyzed. 相似文献

17.

Development and Results of PVMe on the IBM 9076 SP1

《Journal of Parallel and Distributed Computing》1995,29(1):75-83

We present PVMe, IBM′s AIX implementation of the widely used PVM message passing programming model. The focus is on the version for the IBM 9076 SP1. The PVMe design is thoroughly described along with the results obtained running some significant applications on this platform. A summary of the experiences of using PVM as the base for the implementation of other communication libraries (namely, PARMACS) concludes the work. 相似文献

18.

Performance and portability of an air quality model

Donald Dabdub Rajit Manohar 《Parallel Computing》1997,23(14):701-2200

We present a portable, parallel implementation of an urban air quality model. The parallel model runs on the Intel Delta, Intel Paragon, IBM SP2, and Cray T3D, using a variety of standard communication libraries. We analyze the performance of the air quality model on these platforms based on a model derived from the parallel communication behavior and sequential execution time of the air quality model. We predict the performance of the next generation air quality models based on this analysis. 相似文献

19.

Data-parallel tomographic reconstruction: A comparison of filtered backprojection and direct Fourier reconstruction

Jos B.T.M. Roerdink Michel A. Westenberg 《Parallel Computing》1998,24(14):2129-2142

We consider the parallelization of two standard 2D reconstruction algorithms, filtered backprojection and direct Fourier reconstruction, using the data-parallel programming style. The algorithms are implemented on a Connection Machine CM-5 with 16 processors and a peak performance of 2 Gflop/s. 相似文献

20.

High-Performance Radix-2, 3 and 5 Parallel 1-D Complex FFT Algorithms for Distributed-Memory Parallel Computers 总被引：3，自引：0，他引：3

Takahashi Daisuke Kanada Yasumasa 《The Journal of supercomputing》2000,15(2):207-228

In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT algorithms. In our parallel FFT algorithms, since we use cyclic distribution, all-to-all communication takes place only once. Moreover, the input data and output data are both in natural order.We also show that the suitability of a parallel FFT algorithm is machine-dependent because of the differences in the architecture of the processor elements in distributed-memory parallel computers. Experimental results of 2^p3^q5^r point FFTs on distributed-memory parallel computers, HITACHI SR2201 and IBM SP2 are reported. We succeeded to get performances of about 130 GFLOPS on a 1024PE HITACHI SR2201 and about 1.25 GFLOPS on a 32PE IBM SP2. 相似文献