首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
一个适合大规模集群并行计算的检查点系统   总被引:5,自引:1,他引:4  
分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错.  相似文献   

2.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.  相似文献   

3.
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.  相似文献   

4.
New non-volatile memory (e.g., phase-change memory) provides fast access, large capacity, byte-addressability, and non-volatility features. These features, fast-byte-persistency, will bring new opportunities to fault tolerance. We propose a fine-grained checkpoint based on non-volatile memory. We extend the current virtual memory manager to manage non-volatile memory, and design a persistent heap with support for fast allocation and checkpointing of persistent objects. To achieve a fine-grained checkpoint, we scatter objects across virtual pages and rely on hardware page-protection to monitor the modifications. In our system, two objects in different virtual pages may reside on the same physical page. Modifying one object would not interfere with the other object. This allows us to monitor and checkpoint objects smaller than 4096 bytes in a fine-grained way. Compared with previous page-grained based checkpoint mechanisms, our new checkpoint method can greatly reduce the data copied at checkpoint time and better leverage the limited bandwidth of non-volatile memory.  相似文献   

5.
A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

6.
Summary A common approach for supporting fault tolerance against node failures is the primary site approach. In this approach the service to be made fault-tolerant is replicated at many nodes, one of which is designated as primary and the others as backups. All the requests for the service are sent to the primary site. The primary site periodically checkpoints its state on the backups. If the primary fails, one of the backups takes over as primary, and to maintain consistency, it first re-executes all the requests performed by the previous primary since the last checkpoint. Two important issues that effect performance of this approach are the frequency of checkpointing and the degree of replication of the service. If the checkpointing interval is decreased the overhead of reexecuting old requests decreases, but the overhead for checkpointing increases. If the degree of replication increases, on the one hand, the availability of the system for user services increases since the reliability of the system increases. On the other hand, the checkpointing time increases, which reduces the availability of the system. In this paper, we present an analytic model to study the optimum checkpointing interval, and a queuing model to study the optimum degree of replication for a service in a primary site system. The reliability of a primary site system is also studied.This work was supported by the NFS grant DC1-861033. P. Jalote has a joint appointment with Institute of Advanced Computer Studies  相似文献   

7.
Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.  相似文献   

8.
刘勇燕  刘勇鹏  冯华  迟万庆 《计算机科学》2011,38(5):287-289,305
检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。  相似文献   

9.
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.  相似文献   

10.
The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High‐Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade‐off between failure‐free and recovery performances. In this paper we describe our contribution in fault tolerance for high‐level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure‐free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

11.
A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.  相似文献   

12.
Pierre Sens  Bertil Folliot 《Software》1998,28(10):1079-1099
This paper presents the design, implementation and performance evaluation of a software fault manager for distributed applications. Dubbed Star, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, Star implements non-blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, Star is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIX™-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. © 1998 John Wiley & Sons, Ltd.  相似文献   

13.
The ability to analyze streaming data in real time is vital in systems that process data from health sensors. These systems need to build and maintain state, as well as preserve this state during system failures. In this work, we introduce a fault‐tolerance scheme designed for the Granules stream processing system. We work with two distinct health stream datasets: thorax extension and electroencephalogram (EEG) signal analysis. We have developed a monitoring program to track trends in the thorax extension dataset and a classification system for the EEG dataset, which allows us to determine user intent from EEG signals. Using these two motivating applications, we have explored several approaches to fault tolerance through replication, developing a hybrid approach that is particularly suited to health streams. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

14.
Adaptive checkpointing strategy to tolerate faults in economy based grid   总被引:3,自引:2,他引:1  
In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs.  相似文献   

15.
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger.  相似文献   

16.
软件DSM(distributed shared memory)系统在机群上构造了共享存储编程环境,结合了共享存储的易编程性和机群的可扩展性,引起了广泛的研究.由于软件DSM系统是一个分布式系统,系统失败风险大,需要实现容错技术以促进其实用化.利用用户级检查点技术,在支持域存储一致模型的软件DSM系统JIAJIA的基础上,设计并实现了一个可恢复的高可移植的软件DSM系统JIACKPT(JIAjia with ChecKPoinTing).由于采用适合软件DSM系统的强全局一致状态以及多种优化措施,JIACKPT易于实现且获得很好的性能.在一个8节点的PC机群上的应用测试表明,即使每分钟做一次检查点,大部分应用的检查点开销也小于10%.此外,JIACKPT还具有高可移植性.这些都表明JIACKPT已经成为一个比较实用的系统.  相似文献   

17.
Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.  相似文献   

18.
Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage  相似文献   

19.
Process checkpointing is a procedure which periodically saves the process states into stable storage. Most checkpointing facilities select hard disks for archiving. However, the disk seek time is limited by the speed of the read‐write heads, thus checkpointing process into a local disk requires extensive disk bandwidth. In this paper, we propose an approach that exploits the memory on idle workstations as a faster storage for checkpointing. In our scheme, autonomous machines which submit jobs to the computation server offer their physical memory to the server for job checkpointing. Eight applications are used to measure the remote memory performance in four checkpointing policies. Experimental results show that remote memory reduces at least 34.5 per cent of the overhead for sequential checkpointing and 32.1 per cent for incremental checkpointing. Additionally, to checkpoint a running process into a remote memory requires only 60 per cent of the local disk checkpoint latency time. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

20.
Message logging is an attractive solution to provide fault tolerance for message‐passing applications because it is more scalable than coordinated checkpointing. Sender‐based message logging is a well‐known optimization that allows the saving of message payload in the sender memory. Thus, only message reception events have to be logged reliably by using an event logger. This paper proposes solutions to further improve message logging protocol scalability. In existing works on message logging, the event logger has always been considered as a centralized process. We propose a distributed event logger that takes advantage of multi‐core processors that are to be executed in parallel with application processes, leveraging the volatile memory of the nodes to save events reliably. We also propose the combination of our distributed event logger and O2P, an active optimistic message logging protocol using a gossip‐based protocol to disseminate information on new stable events. Our distributed event logger and O2P are implemented in the Open MPI library. Our results show the following: (i) distributed event logging improves message logging protocol scalability and (ii) using O2P with a distributed event logger provides an efficient and scalable fault‐tolerant solution for message‐passing applications. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号