期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Global‐view coefficients: a data management solution for parallel quantum Monte Carlo applications

Qingpeng Niu James Dinan Sravya Tirukkovalur Anouar Benali Jeongnim Kim Lubos Mitas Lucas Wagner P. Sadayappan 《Concurrency and Computation》2016,28(13):3655-3671

Quantum Monte Carlo (QMC) applications perform simulation with respect to an initial state of the quantum mechanical system, which is often captured by using a cubic B‐spline basis. This representation is stored as a read‐only table of coefficients and accesses to the table are generated at random as part of the Monte Carlo simulation. Current QMC applications, such as QWalk and QMCPACK, replicate this table at every process or node, which limits scalability because increasing the number of processors does not enable larger systems to be run. We present a partitioned global address space approach to transparently managing this data using Global Arrays in a manner that allows the memory of multiple nodes to be aggregated. We develop an automated data management system that significantly reduces communication overheads, enabling new capabilities for QMC codes. Experimental results with QWalk and QMCPACK demonstrate the effectiveness of the data management system. Copyright © 2016 John Wiley & Sons, Ltd. 相似文献

2.

Low‐level PGAS computing on many‐core processors with TSHMEM

Bryant C. Lam Alan D. George Herman Lam Vikas Aggarwal 《Concurrency and Computation》2015,27(17):5288-5310

Diminishing returns from increased clock frequencies and instruction‐level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many‐core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel‐programming tools for application development. Development of parallel applications on many‐core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared‐memory programming library that has demonstrated high performance and productivity potential for parallel‐computing systems with distributed‐memory architectures. In the paper, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low‐level PGAS on modern and emerging many‐core processors featuring dozens of cores and more. Our approach (with a new library known as TSHMEM) is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many‐core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE‐Gx and TILEPro many‐core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE‐Gx and TILEPro; therefore, TSHMEM's barrier design takes an alternative approach and leverages the on‐chip mesh network to provide consistent low‐latency performance. In addition, our experiments with TSHMEM show that naive collective algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP‐centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high‐performance parallel computing to emerging many‐core systems. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

3.

PGAS‐FMM: Implementing a distributed fast multipole method using the X10 programming language

Josh Milthorpe Alistair P. Rendell Thomas Huber 《Concurrency and Computation》2014,26(3):712-727

The fast multipole method (FMM) is a complex, multi‐stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 by using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10's task‐parallel model is used to express parallelism by using a pattern of activities mapping directly onto the tree. X10's work stealing runtime handles load balancing fine‐grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single‐node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

4.

UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Jorge Gonzlez‐Domínguez María J. Martín Guillermo L. Taboada Juan Tourio Ramn Doallo Damin A. Malln Brian Wibecan 《Concurrency and Computation》2012,24(14):1645-1667

The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures such as multicore clusters. This paper describes UPCBLAS, a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C language. The routines developed in UPCBLAS are built on top of sequential basic linear algebra subprograms functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. Furthermore, the routines implement other optimization techniques, several of them by automatically taking into account the hardware characteristics of the underlying systems on which they are executed. The library has been experimentally evaluated on a multicore supercomputer and compared with a message‐passing‐based parallel numerical library, demonstrating good scalability and efficiency. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

5.

A Computational Science IDE for HPC Systems: Design and Applications

David E. Hudak Neil Ludban Ashok Krishnamurthy Vijay Gadepally Siddharth Samsi John Nehrbass 《International journal of parallel programming》2009,37(1):91-105

相似文献

6.

Audit: A new synchronization API for the GET/PUT protocol

Atsushi Hori Jinpil Lee Mitsuhisa Sato 《Journal of Parallel and Distributed Computing》2012

The GET/PUT protocol is considered an effective communication API for parallel computing. However, the one-sided nature of the GET/PUT protocol lacks synchronization functionality for the target process. To date, several techniques have been proposed to tackle this problem. The APIs suggested thus far have failed to hide implementation details of the synchronization functionality. In this paper, a new synchronization API for the GET/PUT protocol is proposed. The central idea here is to associate synchronization flags with the GET/PUT memory regions. Using this technique, synchronization flags are hidden from users, and they are freed from managing the associations between the memory regions and the synchronization flags. The proposed API, named Audit, does not incur additional programming and thus enables natural parallel programming. The evaluations show that Audit exhibits better performance compared to the Notify API proposed in ARMCI. 相似文献