期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

Gyung-Leen?Park Email author Hee?Youn?Yong 《The Journal of supercomputing》2005,33(1):65-78

Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. 相似文献

2.

An efficient checkpointing method for multicomputers with wormhole routing

Kai Li Jeffrey F. Naughton James S. Plank 《International journal of parallel programming》1991,20(3):159-180

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements. 相似文献

3.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

4.

The performance of cache-based error recovery in multiprocessors

Janssens B. Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(10):1033-1043

Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval 相似文献

5.

Reducing reverse-mode memory requirements by using profile-driven checkpointing

Mike Fagan Alan Carle 《Future Generation Computer Systems》2005,21(8):134-1390

Reverse-mode derivative calculations have favorable time cost for many problems. Unfortunately “real world” reverse-mode computations frequently experience prohibitive space costs. To mitigate this space cost, users resort to checkpointing techniques to recompute, rather than save, the necessary values. Injudicious checkpointing, however, can destroy the favorable time performance that made reverse mode attractive in the first place. Consequently, reverse-mode users must spend significant amounts of development time analyzing and developing checkpointing schemes that complement their reverse-mode computation code.

In this paper, we describe a particular instance of this checkpointing problem: we were using reverse-mode code generated by Adifor 3.0 to compute derivatives of a large computational fluid dynamics code. Our effort labored under the additional constraint that development time was minimal (as always, it was needed yesterday). Our solution was to use profiling to narrowly focus our checkpoint analysis. This profiling approach worked well for our problem. Furthermore, the profiling idea is sufficiently general that it should work well for other problems. This paper details both our results on our specific problem and guidelines for applying the profiling technique to other checkpoint-based reverse-mode development problems. 相似文献

6.

A Low-Cost Checkpointing Technique for Distributed Databases

Jun-Lin Lin Margaret H. Dunham 《Distributed and Parallel Databases》2001,10(3):241-268

For distributed databases, checkpointing is used to ensure an efficient way to perform global reconstruction. However, the need for global reconstruction is infrequent. Most current checkpointing approaches for distributed databases are too expensive during run time. Some of them allow the checkpointing process to run in parallel with normal transactions at the cost of more data and resource contention, which in turn causes longer response time for normal transactions. Thus, an efficient way to checkpoint distributed databases is needed to avoid degrading the system performance. This paper presents a low-cost solution, called Loosely Synchronized Local Fuzzy Checkpointing (LSLFC), to these problems. LSLFC supports global reconstruction, and our performance study shows that LSLFC has little overhead during run time. 相似文献

7.

Reliable user‐level rollback recovery implementation for multithreaded processes on windows

Jin‐Min Yang Da‐Fang Zhang Xue‐Dong Yang Wen‐Wei Li 《Software》2007,37(3):331-346

The existing user‐level checkpointing schemes support only a limited portion of multithreaded programs because they are derived from the schemes for single‐threaded applications. This paper addresses the impact of thread suspension point on rollback recovery, and presents a checkpointing scheme for multithreaded processes. Unlike the existing schemes in which the checkpointer suspends every working thread, our scheme employs a distinctive strategy that every working thread suspends itself. This technique manages to avoid the suspension point in the API code or kernel code, ensuring correct rollback recovery. Our scheme supports inter‐thread synchronization and thread lifetime. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

8.

An implementation of using remote memory to checkpoint processes

Shang‐Te Hsu Ruei‐Chuan Chang 《Software》1999,29(11):985-1004

Process checkpointing is a procedure which periodically saves the process states into stable storage. Most checkpointing facilities select hard disks for archiving. However, the disk seek time is limited by the speed of the read‐write heads, thus checkpointing process into a local disk requires extensive disk bandwidth. In this paper, we propose an approach that exploits the memory on idle workstations as a faster storage for checkpointing. In our scheme, autonomous machines which submit jobs to the computation server offer their physical memory to the server for job checkpointing. Eight applications are used to measure the remote memory performance in four checkpointing policies. Experimental results show that remote memory reduces at least 34.5 per cent of the overhead for sequential checkpointing and 32.1 per cent for incremental checkpointing. Additionally, to checkpoint a running process into a remote memory requires only 60 per cent of the local disk checkpoint latency time. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

9.

Modeling of hierarchical distributed systems with fault-tolerance

Shieh Y.-B. Ghosal D. Chintamaneni P.R. Tripathi S.K. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(4):444-457

Since each of the levels in a hierarchical system could have various characteristics, different fault-tolerant schemes could be appropriate at different levels. A stochastic Petri net (SPN) is used to investigate various fault-tolerant schemes in this context. The basic SPN is augmented by parameterized subnet primitives to model the fault-tolerant schemes. Both centralized and distributed fault-tolerant schemes are considered. The two schemes are investigated by considering the individual levels in a hierarchical system independently. In the case of distributed fault tolerance, two different checkpointing strategies are considered. The first scheme is called the arbitrary checkpointing strategy. Each process in this scheme does its checkpointing independently; thus, the domino effect may occur. The second scheme is called the planned strategy. Here, process checkpointing is constrained to ensure no domino effect. The results show that, under certain conditions, an arbitrary checkpointing strategy can perform better than a planned strategy. The effect of integration on the fault-tolerant strategies of the various levels of a hierarchy are studied 相似文献

10.

一种面向移动计算的低代价透明检查点恢复协议 总被引：2，自引：0，他引：2

下载免费PDF全文

李庆华蒋廷耀张红君《软件学报》2005,16(1):135-144

移动计算系统中的检查点恢复协议面临着许多与传统分布式系统所不同的问题.在目前已出现的支持移动计算的检查点恢复机制中,基于建立全局一致的检查点的方法不能确保错误的独立恢复;基于m-MSS-m通信的消息日志方法其移动站之间交换的消息需通过移动基站的转发.提出了一种基于消息日志的支持移动站之间直接通信(m-m)的容错协议并给出了相应的算法及正确性证明.与m-MSS-m通信相比,m-m通信有利于降低信道冲突;减少消息传递延迟.仿真结果表明,所设计的协议比传统协议具有更小的无错误状态下引入负载和错误恢复时间. 相似文献

11.

Memory exclusion: optimizing the performance of checkpointing systems

James S. Plank Yuqun Chen Kai Li Micah Beck Gerry Kingsley 《Software》1999,29(2):125-142

Checkpointing systems are a convenient way for users to make their programs fault‐tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well‐known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt , which works on Unix‐based workstations, and CLIP , which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long‐running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

12.

Checkpointing in Distributed Computing Systems

《Journal of Parallel and Distributed Computing》1996,35(1):67-75

This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state-dependent checkpoint intervals, and a performance metric which is coupled with failure-free performance and the speedup functions associated with implementation of parallel algorithms. The analytic results for synchronous checkpointing without load redistribution are compared to measurements of a synthetic parallel algorithm with user-level checkpointing. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are used to determine when load redistribution is advantageous. 相似文献

13.

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

《Parallel Computing》2015

We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques. 相似文献

14.

Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Hiroyuki Okamura Author Vitae Tadashi Dohi Author Vitae 《Journal of Systems and Software》2010,83(9):1591-1604

This paper examines comprehensive evaluation of aperiodic time-based checkpointing and rejuvenation schemes maximizing the steady-state system availability in an operational software system. We consider two kinds of maintenance policies: checkpointing prior to rejuvenating (CPTR) and rejuvenating prior to checkpointing (RPTC). These schemes are complementary from each other to schedule checkpoints and rejuvenation points. In addition, under a periodic full maintenance operation, we show that aperiodic checkpointing or rejuvenation scheme is optimal to maximize the steady-state system availability by applying the dynamic programming. In numerical examples, CPTR and RPTC are comparatively examined with same overhead parameters, and the effects of CPTR and RPTC on maximizing the steady-state system availability are investigated. 相似文献

15.

Analytic models for the primary site approach to fault-tolerance

Yennun Huang Pankaj Jalote 《Acta Informatica》1989,26(6):543-557

Summary A common approach for supporting fault tolerance against node failures is the primary site approach. In this approach the service to be made fault-tolerant is replicated at many nodes, one of which is designated as primary and the others as backups. All the requests for the service are sent to the primary site. The primary site periodically checkpoints its state on the backups. If the primary fails, one of the backups takes over as primary, and to maintain consistency, it first re-executes all the requests performed by the previous primary since the last checkpoint. Two important issues that effect performance of this approach are the frequency of checkpointing and the degree of replication of the service. If the checkpointing interval is decreased the overhead of reexecuting old requests decreases, but the overhead for checkpointing increases. If the degree of replication increases, on the one hand, the availability of the system for user services increases since the reliability of the system increases. On the other hand, the checkpointing time increases, which reduces the availability of the system. In this paper, we present an analytic model to study the optimum checkpointing interval, and a queuing model to study the optimum degree of replication for a service in a primary site system. The reliability of a primary site system is also studied.This work was supported by the NFS grant DC1-861033. P. Jalote has a joint appointment with Institute of Advanced Computer Studies 相似文献

16.

An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415

Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献

17.

A model for error recovery with global checkpointing

Krishna Kant 《Information Sciences》1983,30(3):225-239

The paper proposes a new technique for providing software fault tolerance in concurrent systems. It combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism. This mechanism involves smaller overhead in case of moderate to high process interaction than the schemes considered in past, which are based upon the idea of local checkpointing. A model for computing the optimum checkpointing interval is also presented. A particular distribution is hypothesized for the coverage of the recovery, and the behavior of the model is studied in detail for this case. 相似文献

18.

Communication-based prevention of useless checkpoints in distributed computations

J.-M. Hélary A. Mostefaoui R.H.B. Netzer M. Raynal 《Distributed Computing》2000,13(1):29-43

Summary. A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in “consistent global checkpoint”-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints. Received: July 1997 / Accepted: August 1999 相似文献

19.

嵌入式实时系统容错集成技术的研究

黎忠文《计算机科学》2006,33(5):277-281

本文提出了一种用于嵌入式实时系统的集成检查点回卷、任务重复和DVS的容错方法。该方法支持处理器速度的在线调整，并根据系统的特点，分别插入额外的SCP或CCP点，有效使用检查点的存贮和比较功能，减少任务的执行时间，提高系统性能。通过概率原理导出了该方法任务的平均执行时间。仿真结果表明在DMR系统上，与原有的方法相比。所提出的方法明显减少了任务的平均执行时间。在此基础上，进一步提出了可适配处理器速度的算法，在减少任务执行时间的同时又节约系统能源。本文研究成果也可用于其它任务重复系统，如TMR-F、DMR-F-1和RFCS等。相似文献

20.

Compiler-assisted full checkpointing

Chung-Chi Jim Li Elliot M. Stewart W. Kent Fuchs 《Software》1994,24(10):871-886

This paper describes a compiler-based approach to checkpointing for process recovery. The implementation is transparent to both the programmer and the hardware. The compiler-generated sparse potential checkpoint code maintains the desired checkpoint interval. Adaptive checkpointing reduces the size of the checkpoints. Training is used to select low-cost, high-coverage potential checkpoints. The problem of selecting potential checkpoints is shown to be NP-complete, and a heuristic algorithm is introduced that determines a quick suboptimal solution. These compiler-assisted checkpointing techniques have been implemented in a modified version of the GNU C (GCC) compiler. Experiments involving the modified version of the GCC compiler on a Sun SPARC workstation are summarized. 相似文献