期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A framework for adaptive collective communications for heterogeneous hierarchical computing systems

Luiz Angelo Steffenel Grégory Mounié 《Journal of Computer and System Sciences》2008,74(6):1082-1093

Collective communication operations are widely used in MPI applications and play an important role in their performance. However, the network heterogeneity inherent to grid environments represent a great challenge to develop efficient high performance computing applications. In this work we propose a generic framework based on communication models and adaptive techniques for dealing with collective communication patterns on grid platforms. Toward this goal, we address the hierarchical organization of the grid, selecting the most efficient communication algorithms at each network level. Our framework is also adaptive to grid load dynamics since it considers transient network characteristics for dividing the nodes into clusters. Our experiments with the broadcast operation on a real-grid setup indicate that an adaptive framework allows significant performance improvements on MPI collective communications. 相似文献

2.

Transaction-Aware Network-on-Chip Resource Reservation

Li Zheng Zhu Changyun Shang Li Dick Robert Sun Yihe 《Computer Architecture Letters》2008,7(2):53-56

Performance and scalability are critically-important for on-chip interconnect in many-core chip-multiprocessor systems. Packet-switched interconnect fabric, widely viewed as the de facto on-chip data communication backplane in the many-core era, offers high throughput and excellent scalability. However, these benefits come at the price of router latency due to run-time multi-hop data buffering and resource arbitration. The network accounts for a majority of on-chip data transaction latency. In this work, we propose dynamic in-network resource reservation techniques to optimize run-time on-chip data transactions. This idea is motivated by the need to preserve existing abstraction and general-purpose network performance while optimizing for frequently-occurring network events such as data transactions. Experimental studies using multithreaded benchmarks demonstrate that the proposed techniques can reduce on-chip data access latency by 28.4% on average in a 16-node system and 29.2% on average in a 36-node system. 相似文献

3.

Connection resource management for compiler-generated communication

S. HINRICHS 《Concurrency and Computation》1997,9(2):85-112

Traditionally parallel compilers have targeted a standard message passing communication library when generating communication code (e.g. PVM, MPI). The standard message passing model dynamically reserves communication resources for each message. For regular, repeating communication patterns, a static communication resource reservation model can be more efficient. By reserving resources once for many communication exchanges, the communication startup time is better amortized. Plus, with a global view of communication, the static model has a wider choice of routes. While the static resource reservation model can be a more efficient communication target for the compiler, this model reveals the problems of scheduling use of limited communication resources. This paper uses the abstraction of a communication resource to define two resource management problems and presents three algorithms that can be used by the compiler to address these problems. Initial measures of the effectiveness of these algorithms are presented from two programs for an $8 \times 8$ iWarp system. © 1997 by John Wiley & Sons, Ltd. 相似文献

4.

An attribute grammar approach to compiler optimization of intra-module interprocess communication

Carole M. McNamee Ronald A. Olsson 《International journal of parallel programming》1991,20(3):181-202

Earlier work has shown the effectiveness of hand-applied program transformations optimizing high-level interprocess communication mechanisms. This paper describes the static analysis techniques necessary to ensure correct compiler application of the optimizing transformations. These techniques include both dataflow analysis and interprocess analysis. This paper focuses on the analysis of communication mechanisms within program modules; however, the analysis techniques can be generalized to handle inter-module optimization analysis as well. The major contributions of this paper include the application of dataflow analysis and the extension of interprocedural analysis—interprocess analysis—to real concurrent programming languages and, more specifically, to the optimization of interprocess communication and synchronization mechanisms that use both static and dynamic channels. In addition, the use of attribute grammars to perform interprocess analysis is significant. This paper also describes an implementation of both intra-process dataflow and interprocess analysis techniques using attribute grammars.This work was supported by NSF under Grant Number CCR88-10617. 相似文献

5.

StreamTMC: Stream compilation for tiled multi-core architectures

Haitao Wei Mingkang Qin Weiwei Zhang Junqing Yu Dongrui Fan Guang R. Gao 《Journal of Parallel and Distributed Computing》2013

Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an attractive way to exploit the parallelism. However, the architecture characteristics of large amounts of cores, memory hierarchy and exposed communication between tiles have presented a performance challenge for stream programs running on tiled multi-cores. In this paper, we present StreamTMC, an efficient stream compilation framework that optimizes the execution of stream applications for the tiled multi-core. This framework is composed of three optimization phases. First, a software pipelining schedule is constructed to exploit the parallelism. Second, an efficient hybrid of SPM and cache buffer allocation algorithm and data copy elimination mechanism is proposed to improve the efficiency of the data access. Last, a communication aware mapping is proposed to reduce the network communication and synchronization overhead. We implement the StreamTMC compiler on Godson-T, a 64-core tiled architecture and conduct an experimental study to verify the effectiveness. The experimental results indicate that StreamTMC can achieve an average of 58% improvement over the performance before optimization. 相似文献

6.

Toward integrating IP multicasting in internet network management protocols

《Computer Communications》2001,24(5-6):473-485

Supporting multi-point group communications in network management platforms is essential for improving scalability and responsiveness of management applications. With the deployment of IP multicasting as the standard infrastructure for multi-point group communications in the Internet, the integration of IP multicasting in SNMP becomes significantly important to achieve these goals. This paper presents a highly flexible, efficient and easy-to-integrate framework for integrating IP Multicast in standard SNMP agents. The proposed framework enables managers to re-configure the agents’ group membership and the communication model (e.g. one-to-many, many-to-one and many-to-many) dynamically based on the application requirements. This framework exploits the advantages of IP multicasting without requiring any significant changes or performance overhead in the protocol or the agent architecture. The resulting framework can be easily adopted by exiting SNMP agents of various network management platforms. Although the other approaches provide group communications through broker agents in the management platform, integrating IP multicasting in SNMP agents is more efficient and a simpler approach. Our ultimate goal is to promote the integration of IP multicasting as a standard service in SNMP agents. 相似文献

7.

Prediction-Based Power-Performance Adaptation of Multithreaded Scientific Codes

《Parallel and Distributed Systems, IEEE Transactions on》2008,19(10):1396-1410

Computing has recently reached an inflection point with the introduction of multi-core processors. On-chip thread-level parallelism is doubling approximately every other year. Concurrency lends itself naturally to allowing a program to trade performance for power savings by regulating the number of active cores, however in several domains users are unwilling to sacrifice performance to save power. We present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live program analysis to optimize applications dynamically. We describe a dynamic, phase-aware performance prediction model that combines multivariate regression techniques with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. Using our model, we develop a prediction-driven, phase-aware runtime optimization scheme that throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each program phase. The use of prediction reduces the overhead of searching the optimization space while achieving near-optimal performance and power savings. A thorough evaluation of our approach shows a reduction in power consumption of 10.8% simultaneous with an improvement in performance of 17.9%, resulting in energy savings of 26.7%. 相似文献

8.

Performance modeling of communication and computation in hybrid MPI and OpenMP applications

《Simulation Modelling Practice and Theory》2007,15(4):481-491

Performance evaluation and modeling are crucial steps to enabling the optimization of parallel programs. Programs written using two programming models, such as MPI and OpenMP, require analysis to determine both performance efficiency and the most suitable numbers of processes and threads for their execution on a given platform. To study both of these problems, we propose the construction of a model that is based upon a small number of parameters, but is able to capture the complexity of the runtime system. We incorporate measurements of overheads introduced by each of the programming models, and thus need to model both the network and computational aspects of the system.We have combined two different techniques that includes static analysis, driven by the OpenUH compiler, to retrieve application signatures and a parallelization overhead measurement benchmark, realized by Sphinx and Perfsuite, to collect system profiles. Finally, we propose a performance evaluation measurement to identify communication and computation efficiency. In this paper, we describe our underlying framework, the performance model, and show how our tool can be applied to a sample code. 相似文献

9.

Algorithms for supporting compiled communication

Xin Yuan Melhem R. Gupta R. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(2):107-118

We investigate the compiler algorithms to support compiled communication in multiprocessor environments and study the benefits of compiled communication, assuming that the underlying network is an all-optical time-division-multiplexing (TDM) network. We present an experimental compiler, E-SUIF, that supports compiled communication for High Performance Fortran (HPF) like programs on all-optical TDM networks, and describe and evaluate the compiler algorithms used in E-SUIF. We further demonstrate the effectiveness of compiled communication on all-optical TDM networks by comparing the performance of compiled communication with that of a traditional communication method using a number of application programs. 相似文献

10.

XQuery语言Hotspot编译系统的支撑框架

下载免费PDF全文

张开练廖湖声苏航《计算机工程》2011,37(24):28-31

设计并实现XQuery语言Hotspot编译系统的支撑框架,通过对XQuery程序进行Hotspot分析,将执行频率高的程序模块编译为Java字节码,以提高程序执行效率。实验结果证明,Hotspot编译系统在执行效率上相比解释系统有一定提高,与静态编译系统相比,能更有效地处理网络上动态生成的XQuery查询。相似文献

11.

基于Prim算法的通信网络架设仿真研究与应用

杨成慧殷红孟建军姜虎强《计算机仿真》2007,24(10):144-147,208

为了更好地提高通信网络架设实际问题的工作效率,进行了通信网络架设过程的仿真研究.通过算法的比较选择,对通信网络构架进行了动态规划.以最小代价生成树普里母算法为研究基础,采用数据结构的分析方法进行假设论证.文中结合通信网络构架的实际具体问题,讨论了网络规划中线路权重的选取方法,并在C语言环境下设计了适用于各个城市网络的节点-支路邻接表的数据存储结构.经实例验证,该方法具有计算速度快的优点并有效减少资源浪费,不仅可以保证通信网络架设工作效率,而且可以有效提高通信网络架设经济效益. 相似文献

12.

POEMS: end-to-end performance design of large parallel adaptive computational systems

Adve V.S. Bagrodia R. Browne J.C. Deelman E. Dube A. Houstis E.N. Rice J.R. Sakellariou R. Sundaram-Stukel D.J. Teller P.J. Vernon M.K. 《IEEE transactions on pattern analysis and machine intelligence》2000,26(11):1027-1048

The POEMS project is creating an environment for end-to-end performance modeling of complex parallel and distributed systems, spanning the domains of application software, runtime and operating system software, and hardware architecture. Toward this end, the POEMS framework supports composition of component models from these different domains into an end-to-end system model. This composition can be specified using a generalized graph model of a parallel system, together with interface specifications that carry information about component behaviors and evaluation methods. The POEMS Specification Language compiler will generate an end-to-end system model automatically from such a specification. The components of the target system may be modeled using different modeling paradigms and at various levels of detail. Therefore, evaluation of a POEMS end-to-end system model may require a variety of evaluation tools including specialized equation solvers, queuing network solvers, and discrete event simulators. A single application representation based on static and dynamic task graphs serves as a common workload representation for all these modeling approaches. Sophisticated parallelizing compiler techniques allow this representation to be generated automatically for a given parallel program. POEMS includes a library of predefined analytical and simulation component models of the different domains and a knowledge base that describes performance properties of widely used algorithms. The paper provides an overview of the POEMS methodology and illustrates several of its key components. The modeling capabilities are demonstrated by predicting the performance of alternative configurations of Sweep3D, a benchmark for evaluating wavefront application technologies and high-performance, parallel architectures. 相似文献

13.

Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters

Zheng Gu Matthew Small Xin Yuan Aniruddha Marathe David K. Lowenthal 《International journal of parallel programming》2013,41(5):682-703

Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations. 相似文献

14.

The Implementation of a High Performance GPGPU Compiler

Yi Yang Huiyang Zhou 《International journal of parallel programming》2013,41(6):768-781

In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the naive kernel and generates the optimized GPU kernel. Our compiler supports optimizations for GPU kernels using either global memory or texture memory. The implementation of our compiler is facilitated with a source-to-source compiler infrastructure, Cetus. The code transformation in the Cetus compiler framework is called a pass. We classify all the passes used in our work into two categories: functional passes and optimization passes. The functional passes translate input kernels into desired intermediate representation, which clearly represents memory access patterns and thread configurations. A series of optimization passes improve the performance of the kernels by adapting them to the target GPGPU architecture. Our experiments show that the optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries. 相似文献

15.

多机并发系统中通信模型分层抽象的方法

齐微陈平李青山《计算机科学》2006,33(8):243-246

针对多机并发系统的复杂性，为了辅助用户能从多个角度和层次全面地理解并发系统，就需要逆向产生出能够反映软件系统框架结构的高层架构。基于此本文以进程为边界，提出了一种分层抽取多机并发系统通信模型的方法。此方法基于反射和开放编译的植入机制来获取所需要的动态信息，在此基础上运用分层抽象的策略，分别从系统、节点、进程三个层次对多机并发系统的通信结构和设计结构进行逆向恢复，最后对该方法进行系统的实验研究。结果表明，分层抽象所得到并发系统的通信模型能够正确、有效地反映系统设计时的高层架构关系。相似文献

16.

The Design and Implementation of a Domain-Specific Language for Network Performance Testing

Pakin S. 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(10):1436-1449

CONCEPTUAL is a toolset designed specifically to help measure the performance of high-speed interconnection networks such as those used in workstation clusters and parallel computers. It centers around a high-level domain-specific language, which makes it easy for a programmer to express, measure, and report the performance of complex communication patterns. The primary challenge in implementing a compiler for such a language is that the generated code must be extremely efficient so as not to misattribute overhead costs to the messaging library. At the same time, the language itself must not sacrifice expressiveness for compiler efficiency, or there would be little point in using a high-level language for performance testing. This paper describes the CONCEPTUAL language and the CONCEPTUAL compiler's novel code-generation framework. The language provides primitives for a wide variety of idioms needed for performance testing and emphasizes a readable syntax. The core code-generation technique, based on unrolling CONCEPTUAL programs into sequences of communication events, is simple yet enables the efficient implementation of a variety of high-level constructs. The paper further explains how CONCEPTUAL implements time-bounded loops - even those that comprise blocking communication - in the absence of a time-out mechanism as this is a somewhat unique language/implementation feature. 相似文献

17.

Design Alternatives of Multithreaded Architecture

Avi Mendelson Michael Bekerman 《International journal of parallel programming》1999,27(3):161-193

This paper compares two possible implementations of multithreaded architecture and proposes a new architecture combining the flexibility of the first with the low hardware complexity of the second. We present performance and step-by-step complexity analysis of two design alternatives of multithreaded architecture: dynamic inter-thread resource scheduling and static resource allocation. We then introduce a new multithreaded architecture based on a new scheduling mechanism called the semi-static. We show that with two concurrent threads the dynamic scheduling processor achieves from 5 to 45 % higher performance at the cost of much more complicated design. This paper indicates that for a relatively high number of execution resources the complexity of the dynamic scheduling logic will inevitably require design compromises. Moreover, high chip-wide communication time and an incomplete bypassing network will limit the dynamic scheduling and reduce its performance advantage. On the other hand, static scheduling architecture achieves low resource utilization. The semi-static architecture utilizes compiler techniques to exploit patterns of program parallelism and introduces a new hardware mechanism, in order to achieve performance close to dynamic scheduling without significantly increasing the static hardware complexity. The semi-static architecture statically assigns part of the functional units but dynamically schedules the most performance-critical functional units on a medium-grain basis. 相似文献

18.

魂芯DSP上复数类型的支持和优化

王玉林郑启龙赵高义《计算机系统应用》2017,26(9):40-45

魂芯DSP是一款采用VLIW和SIMD架构的针对高性能计算领域而设计的32bit静态标量数字信号处理器.为了满足数字高性能计算的性能要求,魂芯DSP提供了丰富的复数指令,而编译器不能直接利用这些复数指令来提升编译性能.因此针对魂芯DSP芯片提供了大量的复数类操作指令的特点,在传统开源编译器Open64的编译框架基础上进行研究,实现了复数作为编译器基础类型和复数运算操作的支持.同时,通过识别特定的复数类操作的模式利用魂芯DSP上的复数类指令对程序编译优化.实验结果表明,该实现方案在魂芯DSP编译器上对复数程序优化后能够取得平均5.28的加速比. 相似文献

19.

Routing performance enhancement in hierarchical torus network by link-selection algorithm

《Journal of Parallel and Distributed Computing》2005,65(11):1453-1461

A hierarchical torus network (HTN) is a 2D-torus network of multiple basic modules, in which the basic modules are 3D-torus networks that are hierarchically interconnected for higher-level networks. The static network performance of the HTN and its dynamic communication performance using the popular dimension-order routing algorithm have already been evaluated and shown to be superior to the performance of other conventional and hierarchical interconnection networks. In this paper, we propose a link-selection algorithm for efficient use of physical links of the HTN, while keeping the link-selection algorithm as simple as the dimension-order routing algorithm. We also prove that the proposed algorithm for the HTN is deadlock-free using three virtual channels. We evaluate the dynamic communication performance of an HTN using dimension-order routing and link-selection algorithms under various traffic patterns. We find that the dynamic communication performance of an HTN using the link-selection algorithm is better than when the dimension-order routing algorithm is used. 相似文献

20.

Evaluation of Compiler-Controlled Updating to Reduce Coherence-Miss Penalties in Shared-Memory Multiprocessors

《Journal of Parallel and Distributed Computing》1999,56(2):122-143

We consider in this paper the effectiveness of a new approach calledcompiler-controlledupdating to reduce coherence-miss penalties in shared-memory multiprocessors. A key part of the method is a compiler algorithm that identifies the last store instruction to a memory block in a flow graph using classic dataflow analysis techniques. Such stores are marked and replaced by update instructions that at run time make the memory copy clean. Whereas this static method shortens the read-miss latency for actively shared blocks, it can cause useless traffic for shared blocks that are effectively private. We therefore complement the static analysis with a dynamic simple heuristic in the cache coherence protocol aiming at classifying blocks as private or shared at run time. We evaluate the performance effects of compiler-controlled updating using six scientific parallel applications compiled by an optimizing compiler that incorporates our static analysis and then running them on a detailed CC-NUMA architectural simulation model. We have found that the compiler algorithm can convert between 83 and 100% of the dirty misses into clean misses. By adding the private/shared heuristic, the update traffic of private memory blocks can be practically eliminated. Overall, the static analysis in combination with the dynamic heuristic is shown to reduce the execution time by as much as 32%. 相似文献