期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Zheng Gu Matthew Small Xin Yuan Aniruddha Marathe David K. Lowenthal 《International journal of parallel programming》2013,41(5):682-703

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations. 相似文献

2.

MPICH2-CMEX:可扩展消息传递接口实现技术

下载免费PDF全文

谢旻卢宇彤周恩强《计算机工程与应用》2008,44(2):123-125

在大规模并行计算系统中,为了更有效地利用系统的并行性,实现一个高性能、可扩展的MPI系统是非常重要的。CMEX是无连接模式的用户级通讯软件接口,提供了高性能的报文传输和RDMA通讯操作,MPICH2-CMEX是基于CMEX的MPI实现,结合RDMA读和RDMA写通讯操作的特性,MPICH2-CMEX实现了多种数据传输通道,并利用并行应用的近邻通讯模式,实现了混合通道数据传输方法,实际的应用测试表明,MPICH2-CMEX系统具有良好的性能和可扩展性。相似文献

3.

High Performance RDMA-Based MPI Implementation over InfiniBand 总被引：5，自引：0，他引：5

Jiuxing Liu Jiesheng Wu Dhabaleswar K. Panda 《International journal of parallel programming》2004,32(3):167-198

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation achieves a latency of 6.8 sec for small messages and a peak bandwidth of 871 million bytes/sec. Performance evaluation shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22% compared with the original design. For large data transfers, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks. 相似文献

4.

Low‐latency Java communication devices on RDMA‐enabled networks

Roberto R. Expsito Guillermo L. Taboada Sabela Ramos Juan Tourio Ramn Doallo 《Concurrency and Computation》2015,27(17):4852-4879

Providing high‐performance inter‐node communication is a key capability for running high performance computing applications efficiently on parallel architectures. In fact, current systems deployments are aggregating a significant number of cores interconnected via advanced networking hardware with Remote Direct Memory Access (RDMA) mechanisms, that enable zero‐copy and kernel‐bypass features. The use of Java for parallel programming is becoming more promising thanks to some useful characteristics of this language, particularly its built‐in multithreading support, portability, easy‐to‐learn properties, and high productivity, along with the continuous increase in the performance of the Java virtual machine. However, current parallel Java applications generally suffer from inefficient communication middleware, mainly based on protocols with high communication overhead that do not take full advantage of RDMA‐enabled networks. This paper presents efficient low‐level Java communication devices that overcome these constraints by fully exploiting the underlying RDMA hardware, providing low‐latency and high‐bandwidth communications for parallel Java applications. The performance evaluation conducted on representative RDMA networks and parallel systems has shown significant point‐to‐point performance increases compared with previous Java communication middleware, allowing to obtain up to 40% improvement in application‐level performance on 4096 cores of a Cray XE6 supercomputer. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

5.

RGraph:基于RDMA的高效分布式图数据处理系统

崔鹏杰袁野李岑浩张灿王国仁《软件学报》2022,33(3):1018-1042

图是描述实体间关系的重要数据结构,被广泛地应用于信息科学、物理学、生物学、环境生态学等重要的科学领域.现如今,随着图数据规模的不断增大,利用分布式系统来处理大图数据已经成为主流,出现了形如Pregel、GraphX、PowerGraph和Gemini等经典的分布式大图数据处理系统.然而,与当前先进的基于单机的图处理系统... 相似文献

6.

基于RDMA操作的MPI-2单边通信的设计与实现

江海昇范辉《计算机应用》2006,26(3):550-0552

MPI-2单边通信存在很高的通信开销以及对通信进程中远程进程的依赖。为此提出了在InfiniBand体系结构上的高性能MPI-2单边通信设计方法。其中,MPI-2单边通信操作,比如MPI_Put, MPI_Get以及MPI_Accumulate将对应于InfiniBand远程直接内存访问(Remote Direct Memory Access, RDMA)操作。设计是基于MPICH2的在InfiniBand上的应用,可以很好地实现通信和计算的重叠处理。相似文献

7.

一种基于NIC的RDMA可靠传输协议的设计与实现

夏军庞征斌刘路张峻常俊胜《计算机工程与科学》2014,36(2):216-221

高性能计算机不断增长的规模和复杂性使得可靠性成为影响高性能计算机系统可用性的关键因素,系统互连网络是高性能计算机的重要组成部分,其可靠性是高性能计算机系统设计必须考虑的重要问题。针对高性能计算机系统互连网络可能出现的故障,提出一种基于NIC实现的RDMA可靠传输协议,给出了一种通用的设计实现方案,并对该方案的几种具体优化设计实现方法进行了讨论。提出的可靠传输协议及实现方案能容忍系统互连网络可能出现的多种网络故障,并能尽量减少实现可靠传输所带来的额外开销。实验结果表明,所提出的RDMA可靠传输的实际测试性能与无连接RDMA传输相当。相似文献

8.

Evaluating InfiniBand performance with PCI Express

Jiuxing Liu Mamidala A. Vishnu V. Panda D.K. 《Micro, IEEE》2005,25(1):20-29

The InfiniBand architecture is an industry standard that offers low latency and high bandwidth as well as advanced features such as remote direct memory access (RDMA), atomic operations, multicast, and quality of service. InfiniBand products can achieve a latency of several microseconds for small messages and a bandwidth of 700 to 900 Mbytes/s. As a result, it is becoming increasingly popular as a high-speed interconnect technology for building high-performance clusters. The Peripheral Component Interconnect (PCI) has been the standard local-I/O-bus technology for the last 10 years. However, more applications require lower latency and higher bandwidth than what a PCI bus can provide. As an extension, PCI-X offers higher peak performance and efficiency. InfiniBand host channel adapters (HCAs) with PCI Express achieve 20 to 30 percent lower latency for small messages compared with HCAs using 64-bit, 133-MHz PCI-X interfaces. PCI Express also improves performance at the MPI level, achieving a latency of 4.1/spl mu/s for small messages. It can also improve MPI collective communication and bandwidth-bound MPI application performance. 相似文献

9.

天河-1A互连系统的接口设计

刘路张磊谢旻王永庆《计算机工程与科学》2013,35(2):18-25

NIC是高性能互连网络THNet的网络接口芯片,基于自主研发的通信协议,它高效地实现了无连接、零拷贝、用户级通信的RDMA传输机制,基于该机制的MPI实现具有极高的系统可扩展性。实现了基于控制报文触发的描述符队列处理机制,以支持卸载的聚合通信,包括广播和栅栏同步。使用NIC芯片的网络接口卡在测试中获得了1.57μs的最小单边延迟和6.34GB/s的带宽。NIC已成功应用于2010年TOP500排名世界第一的天河-1A超级计算机。相似文献

10.

基于RDMA的Rapidl0用户态通信接口实现

冀映辉张建东蔡炜蔡惠智《计算机科学》2010,37(6):293-296

作为一款高性能的嵌入式互联协议,RapidIO支持RDMA操作以获得高性能.目前,针对RapidIO通信接口只有以太网模拟器,这种实现机制限制了RapidIO通信性能的发挥.参考国内外基于RDMA的通信协议实现方法,并结合RapidIO互联协议的特点,提出了一套基于RDMA技术的RapidIO用户态通信接口实现方法.在此基础上,验证了通信接口的性能并对实现方案进行了多种优化.经比较,实现的RapidIO通信接口数据吞吐量是目前所有已知的RapidIO通信接口中最高的. 相似文献

11.

An adaptive extension library for improving collective communication operations

O. Hartmann M. Kühnemann T. Rauber G. Rünger 《Concurrency and Computation》2008,20(10):1173-1194

In this paper, we present an adaptive extension library that combines the advantage of using a portable MPI library with the ability to optimize the performance of specific collective communication operations. The extension library is built on top of MPI and can be used with any MPI library. Using the extension library, performance improvements can be achieved by an orthogonal organization of the processors in 2D or 3D meshes and by decomposing the collective communication operations into several consecutive phases of MPI communication. Additional point‐to‐point‐based algorithms are also provided. The extension library works in two steps, an a priori configuration phase detecting possible improvements for implementing collective communication for the MPI library used and an execution phase selecting a better implementation during execution time. This allows an adaptation of the performance of MPI programs to a specific execution platform and communication situation. The experimental evaluation shows that significant performance improvements can be obtained for different MPI libraries by using the library extension for collective MPI communication operations in isolation as well as in the context of application programs. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

12.

T-NBC:透明的MPI非阻塞集合操作

李强孙凝晖霍志刚马捷《计算机学报》2011,34(11):2052-2063

在不修改应用程序的前提下,在MPI通信库中将阻塞的集合操作转化为非阻塞的实现可以将集合通信与紧跟在集合操作之后的计算重叠起来,从而提高应用的性能.在应用中,集合操作之后的计算包括集合通信无关的计算和集合通信相关的计算两类.集合通信可以与前者很好地重叠;由于后者需要访问通信数据,与后者的重叠和集合通信中多个集合子消息的通... 相似文献

13.

An efficient design for fast memory registration in RDMA

Li Ou Xubin He Jizhong Han 《Journal of Network and Computer Applications》2009,32(3):642-651

Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and deregistration cost is a significant challenge to be addressed. Previous studies use pin-down cache and batched deregistration, but only simple LRU is used as a replacement algorithm to manage cache space. In this paper, we evaluate the cost of memory registration in both user and kernel spaces. Based on our analysis, we reduce the overhead of communication buffer management in two aspects simultaneously: utilize a Memory Registration Region Cache (MRRC), and optimize the RDMA communication process of clients and servers with Fast RDMA Read and Write Process (FRRWP). MRRC manages memory in terms of memory region, and replaces old memory regions according to both their sizes and recency. FRRWP overlaps memory registrations between a client and a server, and allows applications to submit RDMA write operations without being blocked by message synchronization. We compare the performance of MRRC and FRRWP with traditional RDMA operations. The results show that our new design improves the total cost of memory registrations and overall communication latency by up to 70%. 相似文献

14.

VI architecture communication features and performance on the Giganet cluster LAN

《Future Generation Computer Systems》2002,18(3):421-433

The virtual interface (VI) architecture standard was developed to satisfy the need for a high throughput, low latency communication system required for cluster computing. VI architecture aims to close the performance gap between the bandwidths and latencies provided by the communication hardware and visible to the application, respectively, by minimizing the software overhead on the critical path of the communication. This paper presents the results of a performance study of one VI architecture hardware implementation, the Giganet cLAN (cluster LAN). The focus of the study is to assess and compare the performance of different VI architecture data transfer modes and specific features that are available to higher-level communication software like MPI in order to aid the implementor to decide which VI architecture options to employ for various communication scenarios. Examples of such options include the use of send/receive vs. RDMA data transfers, polling vs. blocking to check completion of communication operations, multiple VIs, completion queues and scatter capabilities of VI architecture. 相似文献

15.

Application-oriented ping-pong benchmarking: how to assess the real communication overheads

Timo Schneider Robert Gerstenberger Torsten Hoefler 《Computing》2014,96(4):279-292

Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional ping-pong benchmarks to determine the holistic communication latency and bandwidth as observed by an application. It supports serialization loops in C and Fortran as well as MPI datatypes for representative application access patterns. Our benchmark, consisting of seven micro-applications, unveils significant performance discrepancies between the MPI datatype implementations of state of the art MPI implementations. Our micro-applications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to the established benchmarks SPEC CPU and Livermore Loops. 相似文献

16.

Design of scalable Java message-passing communications over InfiniBand

Roberto R. Expósito Guillermo L. Taboada Juan Touri?o Ramón Doallo 《The Journal of supercomputing》2012,61(1):141-165

This paper presents ibvdev a scalable and efficient low-level Java message-passing communication device over InfiniBand. The continuous increase in the number of cores per processor underscores the need for efficient communication support for parallel solutions. Moreover, current system deployments are aggregating a significant number of cores through advanced network technologies, such as InfiniBand, increasing the complexity of communication protocols, especially when dealing with hybrid shared/distributed memory architectures such as clusters. Here, Java represents an attractive choice for the development of communication middleware for these systems, as it provides built-in networking and multithreading support. As the gap between Java and compiled languages performance has been narrowing for the last years, Java is an emerging option for High Performance Computing (HPC). The developed communication middleware ibvdev increases Java applications performance on clusters of multicore processors interconnected via InfiniBand through: (1) providing Java with direct access to InfiniBand using InfiniBand Verbs API, somewhat restricted so far to MPI libraries; (2) implementing an efficient and scalable communication protocol which obtains start-up latencies and bandwidths similar to MPI performance results; and (3) allowing its integration in any Java parallel and distributed application. In fact, it has been successfully integrated in the Java messaging library MPJ Express. The experimental evaluation of this middleware on an InfiniBand cluster of multicore processors has shown significant point-to-point performance benefits, up to 85% start-up latency reduction and twice the bandwidth compared to previous Java middleware on InfiniBand. Additionally, the impact of ibvdev on message-passing collective operations is significant, achieving up to one order of magnitude performance increases compared to previous Java solutions, especially when combined with multithreading. Finally, the efficiency of this middleware, which is even competitive with MPI in terms of performance, increments the scalability of communications intensive Java HPC applications. 相似文献

17.

Towards Scalable Java HPC with Hybrid and Native Communication Devices in MPJ Express

Ansar Javed Bibrak Qamar Mohsan Jameel Aamir Shafi Bryan Carpenter 《International journal of parallel programming》2016,44(6):1142-1172

MPJ Express is a messaging system that allows application developers to parallelize their compute-intensive sequential Java codes on High Performance Computing clusters and multicore processors. In this paper, we extend MPJ Express software to provide two new communication devices. The first device—called hybrid—enables MPJ Express to exploit hybrid parallelism on cluster of multicore processors by sitting on top of existing shared memory and network communication devices. The second device—called native—uses JNI wrappers in interfacing MPJ Express to native MPI implementations like MPICH and Open MPI. We evaluate performance of these devices on a range of interconnects including 1G/10G Ethernet, 10G Myrinet and 40G InfiniBand. In addition, we analyze and evaluate the cost of MPJ Express buffering layer and compare it with the performance numbers of other Java MPI libraries. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability by exploiting multicore architecture. Our performance evaluation quantifies the cost incurred by buffering and its impact on overall performance of software. We witnessed comparative performance as both new devices improve application performance and achieve upto 90 % of the theoretical bandwidth available without application rewriting effort—including NAS Parallel Benchmarks, point-to-point and collective communication. 相似文献

18.

Planning for performance: Enhancing achievable performance for MPI through persistent collective operations

《Parallel Computing》2019

Advantages of nonblocking collective communication in MPI have been established over the past quarter century, even predating MPI-1. For regular computations with fixed communication patterns, significant additional optimizations can be revealed through the use of persistence (planned transfers) not currently available in the MPI-3 API except for a limited form of point-to-point persistence (aka half-channels) standardized since MPI-1. This paper covers the design, prototype implementation of LibPNBC (based on LibNBC), and MPI-4 standardization status of persistent nonblocking collective operations. We provide early performance results, using a modified version of NBCBench and an example application (based on 3D conjugate gradient) illustrating the potential performance enhancements for such operations. Persistent operations enable MPI implementations to make intelligent choices about algorithm and resource utilization once and amortize this decision cost across many uses in a long-running program. Evidence that this approach is of value is provided. As with non-persistent, nonblocking collective operations, the requirement for strong progress and blocking completion notification are jointly needed to maximize the benefit of such operations (e.g., to support overlap of communication with computation and/or other communication). Further enhancement of the current reference implementation, as well as additional opportunities to enhance performance through the application of these new APIs, comprise future work. 相似文献

19.

MVICH—一种基于TH—net VIA的MPICH下的设备层实现

谢超陈渝《计算机科学》2003,30(3):21-23

相似文献

20.

LogGPO: An accurate communication model for performance prediction of MPI programs

WenGuang Chen JiDong Zhai Jin Zhang WeiMin Zheng 《中国科学F辑(英文版)》2009,52(10):1785-1791

Message passing interface (MPI) is the de facto standard in writing parallel scientific applications on distributed memory systems. Performance prediction of MPI programs on current or future parallel systems can help to find system bottleneck or optimize programs. To effectively analyze and predict performance of a large and complex MPI program, an efficient and accurate communication model is highly needed. A series of communication models have been proposed, such as the LogP model family, which assume th... 相似文献