期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Adaptive checkpointing strategy to tolerate faults in economy based grid 总被引：3，自引：2，他引：1

Babar Nazir Kalim Qureshi Paul Manuel 《The Journal of supercomputing》2009,50(1):1-18

In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs. 相似文献

2.

Fault-tolerant grid architecture and practice 总被引：10，自引：0，他引：10

下载免费PDF全文

金海邹德清陈汉华孙建华吴松《计算机科学技术学报》2003,18(4):0-0

Grid computing emerges as effective technologies to couple geographically dis-tributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globns fault detection service uses the well-known techniques basedon unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in agrid system, and a convenient toolkit is also needed to maintain the consistency in the grid. Afault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus faultdetection service is presented in this paper. The platform offers effective strategies in such threeaspects as grid key components, user tasks, and high-level applications. 相似文献

3.

Replication based fault tolerant job scheduling strategy for economy driven grid

Babar Nazir Kalim Qureshi Paul Manuel 《The Journal of supercomputing》2012,62(2):855-873

In this paper, the problem of fault tolerance in grid computing is addressed and a novel adaptive task replication based fault tolerant job scheduling strategy for economy driven grid is proposed. The proposed strategy maintains fault history of the resources termed as resource fault index. Fault index entry for the resource is updated based on successful completion or failure of an assigned task by the grid resource. Grid Resource Broker then replicates the task (submitting the same task to different backup resources) with different intensity, based on vulnerability of resource towards faults suggested by resource fault index. Consequently, in case of possible fault at a resource the results of replicated task(s) on other backup resource(s) can be used. Hence, user job(s) can be completed within specified deadline and assigned budget, even on the event of faults at the grid resource(s). Through extensive simulations, performance of the proposed strategy is evaluated and compared with the Time Optimization and Checkpointing based Strategy in an economy driven grid environment. The experimental results demonstrate that in the presence of faults, proposed fault tolerant strategy improves the number of tasks completed with varied deadline and fixed budget as well as number of tasks completed with varied budget and fixed deadline. Additionally, the proposed strategy used a smaller percentage of deadline time as compare to both Time Optimization and Checkpointing based Strategy. Although the proposed strategy has a percentage of budget spent greater than that of Time Optimization Strategy and Checkpointing based Strategy, it is accepted as a proposed strategy in time optimization where the main objective is to maximize tasks completed within a given deadline. It can be concluded from the experiments that the proposed strategy shows improvement in satisfying the user QoS requirements. It can effectively schedule tasks and tolerate faults gracefully even in the presence of failures, but the costs are slightly higher in terms of budget consumption. Hence, the proposed fault tolerant strategy helps in sustaining user??s faith in the grid, by enabling the grid to deliver reliable and consistent performance in the presence of faults. 相似文献

4.

Survey of fault tolerant techniques for grid

S. Siva Sathya K. Syam Babu 《Computer Science Review》2010,4(2):101-120

Besides the dynamic nature of grids, which means that resources may enter and leave the grid at any time, in many cases outside of the applications’ control, grid resources are also heterogeneous in nature. Many grid applications will be running in environments where interaction faults are more likely to occur between disparate grid nodes. As resources may also be used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious. Due to the diverse faults and failure conditions, developing, deploying, and executing long running applications over the grid remains a challenge. So fault tolerance is an essential factor for grid computing. This paper presents an extensive survey of different fault tolerant techniques such as replication strategies, check-pointing mechanisms, scheduling policies, failure detection mechanisms and finally malleability and migration support for divide-and-conquer applications. These techniques are used according to the needs of the computational grid and the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each has its own merits and demerits which forms the subject matter of this survey. 相似文献

5.

分布式计算系统回卷恢复容错的仿真设计

董奇 黄斌 颜耀 李韦韦 曾玮妮 张恒 《计算机与现代化》2017,(4):48

为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。  相似文献

6.

A hybrid fault tolerance technique in grid computing system 总被引：1，自引：0，他引：1

Kalim Qureshi Fiaz Gul Khan Paul Manuel Babar Nazir 《The Journal of supercomputing》2011,56(1):106-128

In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Fault tolerance plays a key role in order to assert availability and reliability of a grid system. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QoS requirement in grid computing. 相似文献

7.

Managing and Optimizing Bioinformatics Workflows for Data Analysis in Clouds

Vincent C. Emeakaroha Michael Maurer Patrick Stern Paweł P. Łabaj Ivona Brandic David P. Kreil 《Journal of Grid Computing》2013,11(3):407-428

相似文献

8.

Performance evaluation of cloud service considering fault recovery

Bo Yang Feng Tan Yuan-Shun Dai 《The Journal of supercomputing》2013,65(1):426-444

Cloud computing is a recent trend in IT, which has attracted lots of attention. In cloud computing, service reliability and service performance are two important issues. To improve cloud service reliability, fault tolerance techniques such as fault recovery may be used, which in turn has impact on cloud service performance. Such impact deserves detailed research. Although there exist some researches on cloud/grid service reliability and performance, very few of them addressed the issues of fault recovery and its impact on service performance. In this paper, we conduct detailed research on performance evaluation of cloud service considering fault recovery. We consider recovery on both processing nodes and communication links. The commonly adopted assumption of Poisson arrivals of users’ service requests is relaxed, and the interarrival times of service requests can take arbitrary probability distribution. The precedence constraints of subtasks are also considered. The probability distribution of service response time is derived, and a numerical example is presented. The proposed cloud performance evaluation models and methods could yield results which are realistic, and thus are of practical value for related decision-makings in cloud computing. 相似文献

9.

Overhead Analysis of Scientific Workflows in Grid Environments

Prodan R. Fahringer T. 《Parallel and Distributed Systems, IEEE Transactions on》2008,19(3):378-393

Scientific workflows are a topic of great interest in the grid community that sees in the workflow model an attractive paradigm for programming distributed wide-area grid infrastructures. Traditionally, the grid workflow execution is approached as a pure best effort scheduling problem that maps the activities onto the grid processors based on appropriate optimization or local matchmaking heuristics such that the overall execution time is minimized. Even though such heuristics often deliver effective results, the execution in dynamic and unpredictable grid environments is prone to severe performance losses that must be understood for minimizing the completion time or for the efficient use of high-performance resources. In this paper, we propose a new systematic approach to help the scientists and middleware developers understand the most severe sources of performance losses that occur when executing scientific workflows in dynamic grid environments. We introduce an ideal model for the lowest execution time that can be achieved by a workflow and explain the difference to the real measured grid execution time based on a hierarchy of performance overheads for grid computing. We describe how to systematically measure and compute the overheads from individual activities to larger workflow regions and adjust well-known parallel processing metrics to the scope of grid computing, including speedup and efficiency. We present a distributed online tool for computing and analyzing the performance overheads in real time based on event correlation techniques and introduce several performance contracts as quality-of-service parameters to be enforced during the workflow execution beyond traditional best effort practices. We illustrate our method through postmortem and online performance analysis of two real-world workflow applications executed in the Austrian grid environment. 相似文献

10.

面向瞬态故障的软件容错技术

徐建军谭庆平熊荫乔谭兰芳李建立《计算机工程与科学》2011,33(11):132-139

宇宙射线辐射所导致的瞬态故障一直是航天计算面临的最主要挑战之一.而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,其可信性也正日益面临着瞬态故障的严重威胁.当前针对瞬态故障的容错技术可大致分为两类:基于硬件实现和基于软件实现.相比较前者,后者由于在实现成本和灵活性等方面的优势而备受关注.本文首先概述... 相似文献

11.

分布式计算集群容错系统的设计与实现

万玮杨志义《计算机工程与设计》2005,26(10):2811-2813,2816

为了提高分布式计算集群系统的可靠性，增强系统的容错能力，使系统在局部出错的情况下仍能稳定正常运行，建立了一个容错系统模型，该模型采用两级容错机制即节点级容错和任务级容错。此模型为分布式计算集群系统下的容错的进一步研究建立了基础。相似文献

12.

A resource management and fault tolerance services in grid computing

《Journal of Parallel and Distributed Computing》2005,65(11):1305-1317

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur. 相似文献

13.

Early fault detection in IaaS cloud computing based on fuzzy logic and prediction technique

Dinh-Mao Bui Thien Huynh-The Sungyoung Lee 《The Journal of supercomputing》2018,74(11):5730-5745

Availability is one of the most important requirements in production system. Keeping a persistent level of high availability in the Infrastructure-as-a-Service (IaaS) cloud computing is a challenge due to the complexity of service providing. By definition, the availability can be maintained by coupling with the fault tolerance approaches. Recently, many fault tolerance methods have been developed, but few of them adequately consider the fault detection aspect, which is critical to issue the appropriate recovery actions just in time. In this paper, based on a rigorous analysis on the nature of failures, we would like to introduce a method to early identify the faults occurring in the IaaS system. By engaging fuzzy logic algorithm and prediction technique, the proposed approach can provide better performance in terms of accuracy and reaction rate, which subsequently enhances the system reliability. 相似文献

14.

工作流系统适应性检查点机制的研究

桑莉莉《计算机应用与软件》2010,27(3):139-141

容错技术已经成为工作流的研究热点,设置检查点是一种常用的容错方法。针对工作流系统提出一种适应性检查点机制,该机制通过最优化检查点数量和动态设置检查点间隔,大大提高了错误发生情况下任务按时完成的比率,并通过实验验证了该机制优于传统的检查点机制。相似文献

15.

Graceful degradation in algorithm-based fault tolerantmultiprocessor systems

Yajnik S. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):137-153

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system 相似文献

16.

基于SCSI的存储系统评测平台设计与实现

下载免费PDF全文

温东新高清娥张展钱军陈宇龙张中兆《计算机工程》2012,38(5):47-49,55

为解决海量存储系统的容错性问题,定义SCSI磁盘I/O故障模型,设计并实现一种基于SCSI协议的存储系统评测平台。利用SCSI协议中间层提供的接口函数,截获SCSI上层命令,并将其修改为模拟多种故障注入。通过实验比较系统在故障前与故障中的应用级性能,结果表明,该评测平台对不同故障具有不同的容错能力,可以衡量不同存储系统的技术指标。相似文献

17.

Parallel C语言级容错机制的设计与实现

下载免费PDF全文

何王全方燕飞魏迪董恩铭漆锋滨《计算机工程与应用》2018,54(17):41-49

大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点,已成为超级计算机的发展方向,但其复杂的异构结构和庞大的系统规模,也使系统的可用性面临巨大挑战,因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题,在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制,兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现,支持众核系统,可扩展到百万以上并行规模;编译指导与自动分析的检查点通过程序员插入简单的编译指示,由编译器进行分析,提示不需要保留的数据,可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明,两种容错措施相对于传统容错方法效果良好,轻量级降级的容错开销小于1%,相对于传统回卷容错方法单次故障执行时间可减少3.5%以上,编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10,具有很好的实用性。相似文献

18.

基于OGSA网格的分层式网格任务调度器设计 总被引：1，自引：0，他引：1

邓宾《电脑与信息技术》2012,20(1):52-55

文章根据网格任务调度的需求、网格任务调度的特点,在充分分析一般网格任务调度的过程等的基础上,另外考虑到了网格计算环境的一些特点,比如虚拟化、分层次及自治的本质特征,以及在工作流任务协同需求下网格任务的资源依赖、粗粒度、重复执行等特性的前提下,改进设计了一种网格工作流任务主从式分层调度模型,并给出了调度策略和调度算法实现。该调度器模型在实际的网格工作流任务协同系统中得到了较好的应用效果。相似文献

19.

Optimizing and Tuning MapReduce Jobs to Improve the Large‐Scale Data Analysis Process

Wichian Premchaiswadi Walisa Romsaiyud 《国际智能系统杂志》2013,28(2):185-200

Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced. 相似文献

20.

面向高性能计算的网格计算中间件 总被引：1，自引：0，他引：1

何冰张武邵伟民《计算机应用》2004,24(3):19-21

系统地研究了网格计算中间件Netsolve系统的结构和工作原理，深入地探讨了系统中的负载平衡与容错策略，然后针对自强-2000集群超级计算机应用环境提出了对负载平衡和容错策略的改进方法。相似文献