首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
The mapping of tasks of a parallel program onto nodes of a parallel computing system has a remarkable impact on application performance. In this paper we propose an optimization framework to solve the mapping problem, which takes into account the communication matrix of the application and a cost matrix that depends on the topology of the parallel system. This cost matrix is usually a distance matrix (the classic approach), but we propose a novel definition of the cost criterion, applicable to torus networks, that tries to distribute traffic evenly over the different axes; we call this the Traffic Distribution criterion. As the mapping problem can be seen as a particular instance of the Quadratic Assignment Problem (QAP), we can apply any QAP solver to this problem. In particular, we use a greedy randomized algorithm. Using simulation, we test the performance levels of the optimization-based mappings, and compare them with those of trivial mappings (consecutive, random), in two different environments: single application (one application uses all system resources all the time) and space sharing (several applications run simultaneously, on different system partitions), using systems with 2D and 3D topologies and real application traffic. Experimental results show that some applications do not benefit from optimization-based mappings: those in which there is a match between virtual and physical topologies, and those that carry out massive all-to-all communications. In other cases, optimization-based mappings with the TD criterion provide excellent performance levels.  相似文献   

2.
Group communication is widely used by most of the emerging network applications like telecommunication, video conferencing, simulation applications, distributed and other interactive systems. Secured group communication plays a vital role in case of providing the integrity, authenticity, confidentiality, and availability of the message delivered among the group members with respect to communicate securely between the inter group or else within the group. In secure group communications, the time cost associated with the key updating in the proceedings of the member join and departure is an important aspect of the quality of service, particularly in the large groups with highly active membership. Hence, the paper is aimed to achieve better cost and time efficiency through an improved DC multicast routing protocol which is used to expose the path between the nodes participating in the group communication. During this process, each node constructs an adaptive Ptolemy decision tree for the purpose of generating the contributory key. Each of the node is comprised of three keys which will be exchanged between the nodes for considering the group key for the purpose of secure and cost-efficient group communication. The rekeying process is performed when a member leaves or adds into the group. The performance metrics of novel approach is measured depending on the important factors such as computational and communicational cost, rekeying process and formation of the group. It is concluded from the study that the technique has reduced the computational and communicational cost of the secure group communication when compared to the other existing methods.  相似文献   

3.
Abstract

Heterogeneous networks of workstations and/or personal computers (NOW) are increasingly used as a powerful platform for the execution of parallel applications. When applications previously developed for traditional parallel machines (homogeneous and dedicated) are ported to NOWs, performance worsens owing in part to less efficient communications but more often to unbalancing.

In this paper, we address the problem of the efficient porting to heterogeneous NOWs of data-parallel applications originally developed using the SPMD paradigm for homogeneous parallel systems with regular topology like ring.

To achieve good performance, the computation time on the various machines composing the NOW must be as balanced as possible. This can be obtained in two ways: by using an heterogeneous data partition strategy with a single process per node, or by splitting homogeneously data among processes and assigning to each node a number of processes proportional to its computing power. The first method is however more difficult, since some modifications in the code are always needed, whereas the second approach requires very few changes.

We carry out a simplified but reliable analysis, and propose a simple model able to simulate performance in the various situations. Two test cases, matrix multiplication and computation of long-range interactions, are considered, obtaining a good agreement between simulated and experimental results.

Our analysis shows that an efficient porting of regular homogeneous data-parallel applications on heterogeneous NOWs is possible. Particularly, the approach based on multiple processes per node turns out to be a straightforward and effective way for achieving very satisfying performance in almost all situations, even dealing with highly heterogeneous systems.  相似文献   

4.
Advanced Switching (AS) is an open-standard fabric-interconnect technology that is built over the same physical and link layers as PCI Express technology. Moreover, it includes an optimized transaction layer to enable essential communication capabilities, including protocol encapsulation, peer-to-peer communications, mechanisms to provide quality of service (QoS), enhanced fail-over, high availability, multicast communications, and congestion and system management.In this paper, we propose a strategy to use the AS resources that provides a good performance and QoS support at a low cost. When the system is considered as a whole rather than each element being taken separately, it is possible to use only two virtual channels (VCs) at the switches to provide a service like that with many more VCs. As a result, we obtain a noticeable reduction of silicon area and arbitration time. Our proposal is fully compatible with the AS specification and permits us to provide an adequate performance both for typical multimedia applications and for best-effort traffic.  相似文献   

5.
This paper presents a systematic design methodology for fuzzy observer-based secure communications of chaotic systems with guaranteed robust performance. The Takagi-Sugeno fuzzy models are given to exactly represent chaotic systems. Then, the general fuzzy model of many well-known chaotic systems is constructed with only one premise variable in fuzzy rules and the same premise variable in the system output. Based on this general model, the fuzzy observer of chaotic system is given and leads the stability condition of a linear-matrix inequality problem. When taking the fuzzy observer-based design to applications on secure communications, the robust performance is presented by simultaneously considering the effects of parameter mismatch and external disturbances. Then, the error of the recovered message is stated in an H criterion. In addition, if the communication system is free of external disturbances, the asymptotic recovering of the message is obtained in the same framework. The main results also hold for applications on chaotic synchronization. Numerical simulations illustrate that this proposed scheme yields robust performance  相似文献   

6.
Scheduling large-scale applications in heterogeneous distributed computing systems is a fundamental NP-complete problem that is critical to obtaining good performance and execution cost. In this paper, we address the scheduling problem of an important class of large-scale Grid applications inspired by the real world, characterized by a huge number of homogeneous, concurrent, and computationally intensive tasks that are the main sources of performance, cost, and storage bottlenecks. We propose a new formulation of this problem based on a cooperative distributed game-theory-based method applied using three algorithms with low time complexity for optimizing three important metrics in scientific computing: execution time, economic cost, and storage requirements. We present comprehensive experiments using simulation and real-world applications that demonstrate the effectiveness of our approach in terms of time and fairness compared to other related algorithms.  相似文献   

7.
Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only identifies performance problems with a priori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating performance bottlenecks or uncovering their root causes.The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any prior knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on the basis of the rough set theory, we propose an innovative approach to automatically uncover root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code—MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.  相似文献   

8.
Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms.  相似文献   

9.
Modern GPUs (Graphics Processing Units) offer very high computing power at relative low cost. To take advantage of their computing resources and develop efficient implementations is essential to have certain knowledge about the architecture and memory hierarchy. In this paper, we use the FFT (Fast Fourier Transform) as a benchmark tool to analyze different aspects of GPU architectures, like the influence of the memory access pattern or the impact of the register pressure. The FFT is a good tool for performance analysis because it is used in many digital signal processing applications and has a good balance between computational cost and memory bandwidth requirements.  相似文献   

10.
郑启龙  汪睿  周寰 《计算机应用》2011,31(6):1453-1457
大规模集群已经发展到多核的时代,多核架构对并行计算提出了新的要求。消息传递接口(MPI)是最常用的并行编程模型,而群集通信又是MPI中的重要组成部分。研究高效的群集通信算法对并行计算效率的提升有着重要的作用。KD60平台是采用首款国产多核芯片——龙芯3号搭建的国产万亿次多核集群。首先分析了KD60平台多核集群的体系特征以及多核架构下通信具有的层次性特征;然后分析原有群集通信算法实现原理及其不足;最后以广播为例,在原有算法基础上,采用一种基于片上多核(CMP)架构改进算法,改变原有算法通信模式,同时结合实验平台KD60体系特征,对算法做了体系相关优化。实验结果表明,改进算法能够很好地利用多核结构的特点,提高了群集通信广播算法的性能。  相似文献   

11.
无线簇树网络可以支持实时、确定的通信活动,但由于实际应用中的设备间位置关系,会产生簇间通信冲突,从而影响系统的实时性能。基于IEEE 802.15.4标准,本文系统地研究簇树网络中实时通信的冲突问题,提出避免簇间超帧冲突的算法。具体实例分析表明,该算法可以解决无线簇树网络实时通信中的冲突,并可以提高整个网络的通信性能。  相似文献   

12.
Exascale computing is one of the major challenges of this decade, and several studies have shown that communications are becoming one of the bottlenecks for scaling parallel applications. The analysis on the characteristics of communications can effectively aid to improve the performance of scientific applications. In this paper, we focus on the statistical regularity in time-dimension communication characteristics for representative scientific applications on supercomputer systems, and then prove that the distribution of communication-event intervals has a power-law decay, which is common in scientific interests and human activities. We verify the distribution of communication-event intervals has really a power-lawdecay on the Tianhe-2 supercomputer, and also on the other six parallel systems with three different network topologies and two routing policies. In order to do a quantitative study on the power-law distribution, we exploit two groups of statistics: bursty vs. memory and periodicity vs. dispersion. Our results indicate that the communication events show a “strong-bursty and weak-memory” characteristic and the communication event intervals show the periodicity and the dispersion. Finally, our research provides an insight into the relationship between communication optimizations and time-dimension communication characteristics.  相似文献   

13.
Mobile robotics development provides an excellent opportunity to experiment with various architectural solutions for distributed real-time systems. This is because of the increasing complexity of sensor and actuator hardware, and the interaction between intelligent features and real-time constraints. Currently, hybrid control structures seem to be the most widespread method of control. This paper describes a communications scenario resulting from hybrid structures. The YAIR robot and its communication infrastructure is described by addressing the control problems found and the solutions adopted. This paper presents a case study of implementing a hybrid communications system using the CAN bus. The worst-case message delay analysis for this bus is also presented, as well as the structure of identifiers defining its semantic possibilities. The deliberative part of the communication system is a developed object bus on TCP/IP protocol networks. The programming interface at this level takes the form of a distributed blackboard with extended properties such as a bind-notification mechanism and a temporal register recording the temporal firewall of information supplied. The overlap between both communication systems is a gateway service performing bi-directional mirroring over a set of CAN identifiers. Finally, a system test is presented. The test emphasises the intra-level gateway for validating performance and time expressiveness.  相似文献   

14.
MPJ Express is a messaging system that allows application developers to parallelize their compute-intensive sequential Java codes on High Performance Computing clusters and multicore processors. In this paper, we extend MPJ Express software to provide two new communication devices. The first device—called hybrid—enables MPJ Express to exploit hybrid parallelism on cluster of multicore processors by sitting on top of existing shared memory and network communication devices. The second device—called native—uses JNI wrappers in interfacing MPJ Express to native MPI implementations like MPICH and Open MPI. We evaluate performance of these devices on a range of interconnects including 1G/10G Ethernet, 10G Myrinet and 40G InfiniBand. In addition, we analyze and evaluate the cost of MPJ Express buffering layer and compare it with the performance numbers of other Java MPI libraries. Our performance evaluation reveals that the native device allows MPJ Express to achieve comparable performance to native MPI libraries—for latency and bandwidth of point-to-point and collective communications—which is a significant gain in performance compared to existing communication devices. The hybrid communication device—without any modifications at application level—also helps parallel applications achieve better speedups and scalability by exploiting multicore architecture. Our performance evaluation quantifies the cost incurred by buffering and its impact on overall performance of software. We witnessed comparative performance as both new devices improve application performance and achieve upto 90 % of the theoretical bandwidth available without application rewriting effort—including NAS Parallel Benchmarks, point-to-point and collective communication.  相似文献   

15.
Distributed classification in large-scale P2P networks has gained relevance in recent years and support applications like distributed intrusion detection in P2P monitoring environments, online match-making, personalized information retrieval, distributed document classification in a P2P media repository and P2P recommender systems to mention a few. However, classification in a P2P network is a challenging task due to the constraints such as centralization of data is not feasible, scarce communication bandwidth, scalability, synchronization and peer dynamism. Moreover, without considering data distributions and topological scenarios of real world P2P systems, most of the existing distributed classification approaches lack in their predictive and network-cost performance. In this paper, we investigate a collaborative classification method (TRedSVM) based on Support Vector Machines (SVM) in Scale-free P2P networks. In particular, we demonstrate how to construct SVM classifier in real world P2P networks which exhibit inherently skewed distribution of node links and eventually data. The proposed method propagates the most influential instances of SVM models to the vast majority of scarcely connected peers in a controlled way that improves their local classification accuracy and, at the same time, keeps the communication cost low throughout the network. Besides using benchmark Machine Learning data sets for extensive experimental evaluations, we have evaluated the proposed method particularly for music genre classification to exhibit its performance in a real application scenario. Additionally, performance analysis is carried out with respect to centralized approaches, data replication in P2P networks and cost accuracy trade-off. TRedSVM outperforms baseline approaches of model propagation by improving the overall classification performance substantially at the cost of a tolerable increase in communication.  相似文献   

16.
MMX technology extension to the Intel architecture   总被引:2,自引:0,他引:2  
Peleg  A. Weiser  U. 《Micro, IEEE》1996,16(4):42-50
Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU  相似文献   

17.
In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.  相似文献   

18.
Presents the MAP1000A, an alternative to using custom ASICs for each multimedia-processing task. It is a single-chip, programmable mediaprocessor that also makes use of general-purpose RISC processing and a view framework. It provides a new programmable infrastructure with the cost, performance, and power characteristics suitable for replacing RISCs and ASICs in consumer electronics, communications, and imaging applications while, retaining a completely high-level-language programming approach. This single-chip mediaprocessor handles all digital functions in high-level-language software with significantly improved performance and without increased system cost or development complexity  相似文献   

19.
Advances in IC fabrication technology, coupled with aggressive circuit design, have led to exponential growth of IC speed and integration levels. For these improvements to benefit overall system performance, the communication bandwidth between systems and ICs must scale accordingly. Currently, communication links in various applications approach Gbps data rates. These applications include computer-to-peripheral connections, local area networks, memory buses, and multiprocessor interconnection networks. Designers are concerned that these links will soon reach the fundamental limits of electrical signaling. In this article, we examine the limitations of CMOS implementations of highspeed links and show that the links' performance should continue to scale with technology. To handle the interconnects' finite bandwidth, however requires more sophisticated signaling methods. CMOS circuits, typically slower than circuits implemented in nonmainstream technologies, are particularly attractive for common applications because of their lower cost. The overall system cost is further reduced when signaling components are implemented as macro cells, integrated on the same die with a microprocessor or signal processing block  相似文献   

20.
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号