首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
协同式检查点设置及卷回恢复技术是一种简单有效的容错手段,被广泛地运用于并行/分布式系统中。为进一步降低协同式检查点算法的开销,该文给出了一个基于可重建检查点的非阻塞协同式检查点算法。并行程序出错导致卷回恢复发生的概率远小于检查点设置概率,该算法利用这一特性,将检查点设置的部分开销转至卷回恢复阶段,降低了容错的开销,提高了系统的可扩展性。  相似文献   

2.
一种改进的同步检查点设置算法   总被引:1,自引:0,他引:1  
检查点设置与卷回恢复是集群系统中容错计算的重要手段.同步检查点方法在集群系统中得到了广泛应用.为了提高集群计算系统的工作效率,降低系统的容错开销,根据基于消息驱赶的同步检查点设置算法的性质和在实际应用中并行应用程序的通信特征,通过减小协同过程中的阻塞时间,降低系统中控制消息的数量,对基于消息驱赶的Syncand-Stop算法进行优化.改进的算法有效降低检查点设置的时间和空间开销,减小在系统应用中检查点设置的代价,进一步提高系统可扩展性和应用可靠性.  相似文献   

3.
基于虚拟文件操作的文件检查点设置   总被引:1,自引:0,他引:1  
刘少锋  汪东升  朱晶 《软件学报》2002,13(8):1528-1533
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点.  相似文献   

4.
检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。  相似文献   

5.
为降低设置检查点的开销,提出一种高效的异步存储非阻塞协调式检查点算法。该算法允许多个进程并发地在进程状态信息量较小时设置检查点,只在稳固存储器空闲时进行异步存储,并可同时进行检查点设置及进程执行。实验结果表明,该算法能降低设置检查点的开销,提高系统性能。  相似文献   

6.
任务调度是云计算及网格计算环境中的重要问题,已有的调度算法往往仅致力于最小化任务的总执行时间而不设置其他约束条件,以致难以实现多种性能指标的同时优化。所提出的面向网络边缘任务调度问题的多方向粒子群优化算法,用于解决并发任务在网络边缘服务节点中的分布式调度问题,调度的目标是在任务执行的资源开销不超过阈值的情况下,最小化任务完成的总时间。该方法与现有的离散粒子群优化算法相比同时降低了任务的总完成时间及资源开销,且在合理预设资源开销上限的情况下,其计算复杂度实现了较大程度优化。仿真表明,所提出的方法比现有的离散粒子群优化算法的任务总完成时间缩短约10.52%~13.23%,资源开销减少约10.32%~13.29%。同时,在合理降低资源开销阈值的情况下,该方法的程序运行时间比现有的粒子群调度方法明显缩短。  相似文献   

7.
一种基于移动计算环境的因果日志卷回恢复算法   总被引:2,自引:0,他引:2  
由于移动节点的不可靠和无线网络连接的脆弱性,研究移动计算系统容错机制具有重要意义.对可以跨区移动、随时可以与网络断开的自治性很强的移动节点来说,异步的卷回恢复是一种重要的容错手段.现有的移动计算环境下的卷回恢复算法都无法完全实现一致的异步卷回恢复.基于因果消息日志,提出一种新的移动计算环境的卷回恢复算法:通过先行图来记录节点间的消息依赖关系,将异步检查点、基于发送方的暂存消息日志和先行图全部在移动支持站上存储和处理,为移动节点提供一种透明的容错服务,完全消除依赖关系在移动节点之间造成的影响.用形式化的方法证明了系统的一致性.仿真结果表明,在卷回开销达到最低的同时,也显著降低了无错运行时的通信和存储开销.  相似文献   

8.
熊超  武小年  张昭 《计算机工程与设计》2012,33(12):4432-4435,4444
欺负算法产生大量通信信息,时间开销大,占用系统资源过高,严重影响了分布式OLAP系统的性能。针对该问题,提出一种基于欺负算法的改进算法。该算法采用一对一的方式直接向性能最优的节点发送选举消息,以降低选举过程中产生信息的通信量和选举时间开销;并通过循环选举保证选举出系统的最优节点担任系统协调者。实验结果表明,该改进算法有效地降低了消息通信量,减少了时间开销,能更好的应用于分布式OLAP系统。  相似文献   

9.
任务调度是影响动态可重构系统性能的重要因素.针对现有预约算法中由于维护预约资源逻辑单元信息而带来系统额外开销大、任务调度自私性等问题,提出一种基于离散时间距的非预约调度算法.算法的特点在于通过任务紧迫度和时间距信息能够动态更新任务优先级和设置任务的启动时间,从而有效避免了复杂的系统开销和任务调度的自私性.实验表明,该算法能提高任务的调度成功率,而运行时间开销没有明显增加.  相似文献   

10.
WSN数据采集时间同步算法研究   总被引:1,自引:0,他引:1  
针对现有的WSN时间同步算法不适用于数据采集时间同步的问题,以基于WSN的某洗煤厂设备点检系统电动机振动信号采集为例,提出了一种新的延时时间广播同步算法,详细分析了该算法的原理、在单跳WSN中的实现过程及其时间同步延时补偿和误差计算,并与其它WSN时间同步算法进行了比较,得出该算法开销小、能量消耗低、同步精度较高的结论。该算法的有效性在基于WSN的洗煤厂设备点检系统数据同步采集中已得到验证。  相似文献   

11.
A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.  相似文献   

12.
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class.  相似文献   

13.
This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions  相似文献   

14.
Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N+1)/2 checkpoints where N is the number of processes  相似文献   

15.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.  相似文献   

16.
Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks.  相似文献   

17.
基于索引的分布式检查点算法利用了Lamport逻辑时钟的思想来保证形成全局一致性检查点(或者恢复线)。作为一种准同步方法,基于索引的检查点算法具有异步检查点算法的灵活性,且能像同步算法一样避免多米诺效应。本文在著名的BCS算法的基础上提出了一种减少基本检查点数目的优化策略--重新计时法。最后,通过模拟实验证明了这种改进策略的有效性。  相似文献   

18.
《Theoretical computer science》2003,290(2):1127-1148
There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process non-blocking. In our previous work (IEEE Parallel Distributed Systems 9 (12) (1998) 1213), we proved that there does not exist a non-blocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper, we present a min-process algorithm which relaxes the non-blocking condition while tries to minimize the blocking time, and a non-blocking algorithm which relaxes the min-process condition while minimizing the number of checkpoints saved on the stable storage. The proposed non-blocking algorithm is based on the concept of “mutable checkpoint”, which is neither a tentative checkpoint nor a permanent checkpoint. Based on mutable checkpoints, our non-blocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage.  相似文献   

19.
Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage  相似文献   

20.
Consistent global checkpoints have many uses in distributed computations. A central question in applications that use consistent global checkpoints is to determine whether a consistent global checkpoint that includes a given set of local checkpoints can exist. Netzer and Xu (1995) presented the necessary and sufficient conditions under which such a consistent global checkpoint can exist, but they did not explore what checkpoints could be constructed. In this paper, we prove exactly which local checkpoints can be used for constructing such consistent global checkpoints. We illustrate the use of our results with a simple and elegant algorithm to enumerate all such consistent global checkpoints  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号