首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Adaptive checkpointing strategy to tolerate faults in economy based grid   总被引:3,自引:2,他引:1  
In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs.  相似文献   

2.
Fault-tolerant grid architecture and practice   总被引:10,自引:0,他引:10       下载免费PDF全文
Grid computing emerges as effective technologies to couple geographically dis-tributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globns fault detection service uses the well-known techniques basedon unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in agrid system, and a convenient toolkit is also needed to maintain the consistency in the grid. Afault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus faultdetection service is presented in this paper. The platform offers effective strategies in such threeaspects as grid key components, user tasks, and high-level applications.  相似文献   

3.
In this paper, the problem of fault tolerance in grid computing is addressed and a novel adaptive task replication based fault tolerant job scheduling strategy for economy driven grid is proposed. The proposed strategy maintains fault history of the resources termed as resource fault index. Fault index entry for the resource is updated based on successful completion or failure of an assigned task by the grid resource. Grid Resource Broker then replicates the task (submitting the same task to different backup resources) with different intensity, based on vulnerability of resource towards faults suggested by resource fault index. Consequently, in case of possible fault at a resource the results of replicated task(s) on other backup resource(s) can be used. Hence, user job(s) can be completed within specified deadline and assigned budget, even on the event of faults at the grid resource(s). Through extensive simulations, performance of the proposed strategy is evaluated and compared with the Time Optimization and Checkpointing based Strategy in an economy driven grid environment. The experimental results demonstrate that in the presence of faults, proposed fault tolerant strategy improves the number of tasks completed with varied deadline and fixed budget as well as number of tasks completed with varied budget and fixed deadline. Additionally, the proposed strategy used a smaller percentage of deadline time as compare to both Time Optimization and Checkpointing based Strategy. Although the proposed strategy has a percentage of budget spent greater than that of Time Optimization Strategy and Checkpointing based Strategy, it is accepted as a proposed strategy in time optimization where the main objective is to maximize tasks completed within a given deadline. It can be concluded from the experiments that the proposed strategy shows improvement in satisfying the user QoS requirements. It can effectively schedule tasks and tolerate faults gracefully even in the presence of failures, but the costs are slightly higher in terms of budget consumption. Hence, the proposed fault tolerant strategy helps in sustaining user??s faith in the grid, by enabling the grid to deliver reliable and consistent performance in the presence of faults.  相似文献   

4.
Besides the dynamic nature of grids, which means that resources may enter and leave the grid at any time, in many cases outside of the applications’ control, grid resources are also heterogeneous in nature. Many grid applications will be running in environments where interaction faults are more likely to occur between disparate grid nodes. As resources may also be used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious. Due to the diverse faults and failure conditions, developing, deploying, and executing long running applications over the grid remains a challenge. So fault tolerance is an essential factor for grid computing. This paper presents an extensive survey of different fault tolerant techniques such as replication strategies, check-pointing mechanisms, scheduling policies, failure detection mechanisms and finally malleability and migration support for divide-and-conquer applications. These techniques are used according to the needs of the computational grid and the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each has its own merits and demerits which forms the subject matter of this survey.  相似文献   

5.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。   相似文献   

6.
A hybrid fault tolerance technique in grid computing system   总被引:1,自引:0,他引:1  
In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Fault tolerance plays a key role in order to assert availability and reliability of a grid system. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QoS requirement in grid computing.  相似文献   

7.
8.
Cloud computing is a recent trend in IT, which has attracted lots of attention. In cloud computing, service reliability and service performance are two important issues. To improve cloud service reliability, fault tolerance techniques such as fault recovery may be used, which in turn has impact on cloud service performance. Such impact deserves detailed research. Although there exist some researches on cloud/grid service reliability and performance, very few of them addressed the issues of fault recovery and its impact on service performance. In this paper, we conduct detailed research on performance evaluation of cloud service considering fault recovery. We consider recovery on both processing nodes and communication links. The commonly adopted assumption of Poisson arrivals of users’ service requests is relaxed, and the interarrival times of service requests can take arbitrary probability distribution. The precedence constraints of subtasks are also considered. The probability distribution of service response time is derived, and a numerical example is presented. The proposed cloud performance evaluation models and methods could yield results which are realistic, and thus are of practical value for related decision-makings in cloud computing.  相似文献   

9.
Scientific workflows are a topic of great interest in the grid community that sees in the workflow model an attractive paradigm for programming distributed wide-area grid infrastructures. Traditionally, the grid workflow execution is approached as a pure best effort scheduling problem that maps the activities onto the grid processors based on appropriate optimization or local matchmaking heuristics such that the overall execution time is minimized. Even though such heuristics often deliver effective results, the execution in dynamic and unpredictable grid environments is prone to severe performance losses that must be understood for minimizing the completion time or for the efficient use of high-performance resources. In this paper, we propose a new systematic approach to help the scientists and middleware developers understand the most severe sources of performance losses that occur when executing scientific workflows in dynamic grid environments. We introduce an ideal model for the lowest execution time that can be achieved by a workflow and explain the difference to the real measured grid execution time based on a hierarchy of performance overheads for grid computing. We describe how to systematically measure and compute the overheads from individual activities to larger workflow regions and adjust well-known parallel processing metrics to the scope of grid computing, including speedup and efficiency. We present a distributed online tool for computing and analyzing the performance overheads in real time based on event correlation techniques and introduce several performance contracts as quality-of-service parameters to be enforced during the workflow execution beyond traditional best effort practices. We illustrate our method through postmortem and online performance analysis of two real-world workflow applications executed in the Austrian grid environment.  相似文献   

10.
宇宙射线辐射所导致的瞬态故障一直是航天计算面临的最主要挑战之一.而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,其可信性也正日益面临着瞬态故障的严重威胁.当前针对瞬态故障的容错技术可大致分为两类:基于硬件实现和基于软件实现.相比较前者,后者由于在实现成本和灵活性等方面的优势而备受关注.本文首先概述...  相似文献   

11.
万玮  杨志义 《计算机工程与设计》2005,26(10):2811-2813,2816
为了提高分布式计算集群系统的可靠性,增强系统的容错能力,使系统在局部出错的情况下仍能稳定正常运行,建立了一个容错系统模型,该模型采用两级容错机制即节点级容错和任务级容错。此模型为分布式计算集群系统下的容错的进一步研究建立了基础。  相似文献   

12.
In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.  相似文献   

13.
Availability is one of the most important requirements in production system. Keeping a persistent level of high availability in the Infrastructure-as-a-Service (IaaS) cloud computing is a challenge due to the complexity of service providing. By definition, the availability can be maintained by coupling with the fault tolerance approaches. Recently, many fault tolerance methods have been developed, but few of them adequately consider the fault detection aspect, which is critical to issue the appropriate recovery actions just in time. In this paper, based on a rigorous analysis on the nature of failures, we would like to introduce a method to early identify the faults occurring in the IaaS system. By engaging fuzzy logic algorithm and prediction technique, the proposed approach can provide better performance in terms of accuracy and reaction rate, which subsequently enhances the system reliability.  相似文献   

14.
容错技术已经成为工作流的研究热点,设置检查点是一种常用的容错方法。针对工作流系统提出一种适应性检查点机制,该机制通过最优化检查点数量和动态设置检查点间隔,大大提高了错误发生情况下任务按时完成的比率,并通过实验验证了该机制优于传统的检查点机制。  相似文献   

15.
Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system  相似文献   

16.
为解决海量存储系统的容错性问题,定义SCSI磁盘I/O故障模型,设计并实现一种基于SCSI协议的存储系统评测平台。利用SCSI协议中间层提供的接口函数,截获SCSI上层命令,并将其修改为模拟多种故障注入。通过实验比较系统在故障前与故障中的应用级性能,结果表明,该评测平台对不同故障具有不同的容错能力,可以衡量不同存储系统的技术指标。  相似文献   

17.
大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点,已成为超级计算机的发展方向,但其复杂的异构结构和庞大的系统规模,也使系统的可用性面临巨大挑战,因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题,在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制,兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现,支持众核系统,可扩展到百万以上并行规模;编译指导与自动分析的检查点通过程序员插入简单的编译指示,由编译器进行分析,提示不需要保留的数据,可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明,两种容错措施相对于传统容错方法效果良好,轻量级降级的容错开销小于1%,相对于传统回卷容错方法单次故障执行时间可减少3.5%以上,编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10,具有很好的实用性。  相似文献   

18.
基于OGSA网格的分层式网格任务调度器设计   总被引:1,自引:0,他引:1  
文章根据网格任务调度的需求、网格任务调度的特点,在充分分析一般网格任务调度的过程等的基础上,另外考虑到了网格计算环境的一些特点,比如虚拟化、分层次及自治的本质特征,以及在工作流任务协同需求下网格任务的资源依赖、粗粒度、重复执行等特性的前提下,改进设计了一种网格工作流任务主从式分层调度模型,并给出了调度策略和调度算法实现。该调度器模型在实际的网格工作流任务协同系统中得到了较好的应用效果。  相似文献   

19.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

20.
面向高性能计算的网格计算中间件   总被引:1,自引:0,他引:1  
系统地研究了网格计算中间件Netsolve系统的结构和工作原理,深入地探讨了系统中的负载平衡与容错策略,然后针对自强-2000集群超级计算机应用环境提出了对负载平衡和容错策略的改进方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号