期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Closed form solution for response time of fault tolerant network of processors

Sameer Bataineh Author Vitae 《Computers & Electrical Engineering》2004,30(4):291-308

Employing the queuing theory, closed form solutions for the response time of a fault tolerant network of processors system based on the primary site approach is obtained. Fault tolerance is achieved in the primary site approach by having the services replicated by the primary at many nodes. All the requests are sent to the primary which, periodically, checkpoints its status on the backup nodes. If the primary fails, one of the backups takes over as primary. Two repair mechanisms are considered to repair faulty nodes in the system: delayed repair and immediate repair. In addition to their closed form formats, the analytical results presented in this paper have several other advantages over those presented in the previous work. First, for immediate repair case, there is no need to solve a set of recursive equations. Secondly, the results reveal much of the characteristics of the system. We studied the effect of checkpointing rate on the system response time and we found a closed form solution for the optimum checkpointing rate, which minimizes the system response time. 相似文献

2.

Resource allocation for primary-site fault-tolerant systems

Huang Y. Tripathi S.K. 《IEEE transactions on pattern analysis and machine intelligence》1993,19(2):108-119

Resource allocation for a distributed system employing the primary site approach for fault tolerance is discussed. Two kinds of systems are considered. The first consists of fault-tolerant nodes where each node has many duplicated servers. One server is the primary, which serves user requests, and the rest are backup. The second does not have fault-tolerant nodes. To tolerate node failures, each node uses other nodes as backups. When a node fails, all requests initially allocated to the node are served by one of its backups. To study the resource allocation for such systems, an approximate model for each system is developed. Using these models, efficient allocation algorithms that take into account the failure/repair rates of the system and the fault-tolerant overheads are presented. Using experimental results, it is shown that the algorithms give the optimal or suboptimal allocations. The algorithms, which incur little overhead, can improve the system performance significantly over an intuitive allocation algorithm 相似文献

3.

A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

Gyung-Leen?Park Email author Hee?Youn?Yong 《The Journal of supercomputing》2005,33(1):65-78

Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. 相似文献

4.

A new approach for high performance computing systems with various checkpointing schemes

Gyung-Leen Park Hee Yong Youn 《The Journal of supercomputing》2005,33(1-2):65-78

Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes thesample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. This research was supported in part by the MIC (Ministry of Information and Communication), Korea, under the ITRC support program supervised by the UTA and CUCN 21st Century Frontier R&D Program. 相似文献

5.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids 总被引：1，自引：0，他引：1

Chtepen M. Claeys F.H.A. Dhoedt B. De Turck F. Demeester P. Vanrolleghem P.A. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):180-190

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency. 相似文献

6.

Replication-Based Fault Tolerance for MPI Applications

Walters John Paul Chaudhary Vipin 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(7):997-1010

As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system. 相似文献

7.

CoRAL: A transparent fault-tolerant web service

Navid Aghdaie^{Author Vitae} Yuval Tamir^{Author Vitae} 《Journal of Systems and Software》2009,82(1):131-143

The Web is increasingly used for critical applications and services. We present a client-transparent mechanism, called CoRAL, that provides high reliability and availability for Web service. CoRAL provides fault tolerance even for requests being processed at the time of server failure. The scheme does not require deterministic servers and can thus handle dynamic content. CoRAL actively replicates the TCP connection state while maintaining logs of HTTP requests and replies. In the event of a primary server failure, active client connections fail over to a spare, where their processing continues seamlessly. We describe key aspects of the design and implementation as well as several performance optimizations. Measurements of system overhead, failover performance, and preliminary validation using fault injection are presented. 相似文献

8.

一种降低并行程序检查点开销的方法

周小成孙凝晖霍志刚马捷《计算机工程》2007,33(12):84-86

检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径，其性能通常用开销率来评价，而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状，该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度，该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销，从而能进一步降低开销率。相似文献

9.

面向更新密集型应用的内存数据库高效检查点技术

覃雄派肖艳芹曹巍王珊《计算机学报》2009,32(11)

面向更新密集型应用的内存数据库系统,其检查点技术应符合几个关键的要求,包括检查点操作对正常事务处理的干扰尽可能小、能够处理存取倾斜状况、支持数据库系统的快速恢复、提供恢复过程中的系统可用性等.该文提出一种事务一致的分区检查点技术,采用基于元组的动态多版本并发控制机制,避免了读写事务的加锁冲突,提高系统吞吐能力;检查点操作以只读事务形式实现,存多版本并发控制下,避免检查点操作对正常事务处理的堵塞;由于检查点文件是事务一致的,只需要记录事务的Redo 日志信息,在系统恢复过程中,只需要对日志文件进行一遍扫描处理,加快恢复过程;基于优先级的数据分区装载和恢复,使得恢复过程中新事务的数据存取请求迅速得到满足,保证了恢复过程中的系统可用性.由于采用两级版本管理机制以及动态版本共享技术,多版本管理的空间开销降低到可以接受的水平.实验结果表明,文中提出的检查点技术方案获得比模糊检查点技术高27%的系统吞吐量,同时版本管理的空间开销在可接受的范围之内,满足高性能应用的要求. 相似文献

10.

Energy minimization for reliability-guaranteed real-time applications using DVFS and checkpointing techniques

《Journal of Systems Architecture》2015,61(2):71-81

This paper addresses the energy minimization issue when executing real-time applications that have stringent reliability and deadline requirements. To guarantee the satisfaction of the application’s reliability and deadline requirements, checkpointing, Dynamic Voltage Frequency Scaling (DVFS) and backward fault recovery techniques are used. We formally prove that if using backward fault recovery, executing an application with a uniform frequency or neighboring frequencies if the desired frequency is not available, not only consumes the minimal energy but also results in the highest system reliability. Based on this theoretical conclusion, we develop a strategy that utilizes DVFS and checkpointing techniques to execute real-time applications so that not only the applications reliability and deadline requirements are guaranteed, but also the energy consumption for executing the applications is minimized. The developed strategy needs at most one execution frequency change during the execution of an application, hence, the execution overhead caused by frequency switching is small, which makes the strategy particularly useful for processors with a large frequency switching overhead. We empirically compare the developed real-time application execution strategy with recently published work. The experimental results show that, without sacrificing reliability and deadline satisfaction guarantees, the proposed approach can save up to 12% more energy when compared with other approaches. 相似文献

11.

A model for error recovery with global checkpointing

Krishna Kant 《Information Sciences》1983,30(3):225-239

The paper proposes a new technique for providing software fault tolerance in concurrent systems. It combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism. This mechanism involves smaller overhead in case of moderate to high process interaction than the schemes considered in past, which are based upon the idea of local checkpointing. A model for computing the optimum checkpointing interval is also presented. A particular distribution is hypothesized for the coverage of the recovery, and the behavior of the model is studied in detail for this case. 相似文献

12.

Comprehensive evaluation of aperiodic checkpointing and rejuvenation schemes in operational software system

Hiroyuki Okamura Author Vitae Tadashi Dohi Author Vitae 《Journal of Systems and Software》2010,83(9):1591-1604

This paper examines comprehensive evaluation of aperiodic time-based checkpointing and rejuvenation schemes maximizing the steady-state system availability in an operational software system. We consider two kinds of maintenance policies: checkpointing prior to rejuvenating (CPTR) and rejuvenating prior to checkpointing (RPTC). These schemes are complementary from each other to schedule checkpoints and rejuvenation points. In addition, under a periodic full maintenance operation, we show that aperiodic checkpointing or rejuvenation scheme is optimal to maximize the steady-state system availability by applying the dynamic programming. In numerical examples, CPTR and RPTC are comparatively examined with same overhead parameters, and the effects of CPTR and RPTC on maximizing the steady-state system availability are investigated. 相似文献

13.

Optimum checkpoints with age dependent failures

Erol Gelenbe Marisela Hernández 《Acta Informatica》1990,27(6):519-531

Summary This paper presents a method for obtaining the optimum checkpoint interval of a transaction processing computer system subject to time dependent failures. The system uses checkpointing to create a valid system state, and roll-back in order to recover from failures. Maximizing system availability we derive the optimum checkpoint interval as a function of the load of the system and of the time dependent failure rate. The results are illustrated numerically for the Weibull failure rate.On leave from Universidad de Los Andes, Venezuela 相似文献

14.

Elsa:一种面向跨区域架构的无协调分布式键值存储系统

崔玉龙付国张岩峰于戈《软件学报》2023,34(5):2427-2445

作为具备高性能和高可伸缩性的分布式存储解决方案,键值存储系统近年来被广泛采用,例如Redis、MongoDB、Cassandra等.分布式存储系统中广泛使用的多副本机制一方面提高了系统吞吐量和可靠性,但同时也增加了系统协调和副本一致性的额外开销.对于跨域分布式系统来说,远距离的副本协调开销甚至可能成为系统的性能瓶颈,降低系统的可用性和吞吐量.提出分布式键值存储系统Elsa,这是一种面向跨区域架构的无协调键值存储系统. Elsa在保证高性能和高可拓展性的基础上,采用无冲突备份数据结构(CRDT)技术来无协调的保证副本间的强最终一致性,降低了系统节点间的协调开销.在阿里云上构建了跨4数据中心8节点的跨区域分布式环境,进行了大规模分布式性能对比实验,实验结果表明:在跨域的分布式环境下,对于高并发争用的负载, Elsa系统的性能具备明显的优势,最高达到MongoDB集群的7.37倍, Cassandra集群的1.62倍. 相似文献

15.

Surviving sensor node failures by MMU-less incremental checkpointing

《Journal of Systems and Software》2014

For some critical safety applications, sensor nodes embed valuable information, and they should be able to operate unattended and unfailing for several months or years. One promising solution is to adopt a checkpointing that periodically saves the state of a sensor node, thereby maintaining node reliability and network availability. Thus, this study first shows the design and implementation of a full checkpointing for WSNs. However, checkpointing is expensive. Therefore, incremental checkpointing was previously proposed to eliminate the checkpoint overhead by relying on the page protection hardware to identify dirty pages. Because sensor nodes are resource-constrained and do not equip with the page protection hardware, previous incremental checkpointings cannot be directly applied. To address this issue, this paper proposes three incremental checkpointings for WSNs. These three methods differ in the granularity of the checkpoint memory data unit and module execution overhead. In addition, we designed an incremental checkpoint file format that simultaneously supports proposed three different incremental checkpointings and accommodates them with sensor network characteristics. We implemented the full and three incremental checkpointings on SOS in the mica2 sensor motes. A performance evaluation of the three incremental checkpointings is presented. We also discuss and evaluate a method for selecting the appropriate incremental checkpointing. To the best of our knowledge, this study is the first to design and implement incremental checkpointing in MMU-less WSNs. 相似文献

16.

基于虚拟文件操作的文件检查点设置 总被引：1，自引：0，他引：1

刘少锋汪东升朱晶《软件学报》2002,13(8):1528-1533

实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点. 相似文献

17.

Adaptive Remus: adaptive checkpointing for Xen-based virtual machine replication

Marcelo Pereira da Silva Rafael Rodrigues Obelheiro 《International Journal of Parallel, Emergent and Distributed Systems》2017,32(4):348-367

With the ever increasing dependence on computers and networks, many systems are required to be continuously available in order to fulfil their mission. Virtualization technology enables high availability to be offered in a convenient, cost-effective manner: with the encapsulation provided by virtual machines (VMs), entire systems can be replicated transparently in software, obviating the need for expensive fault-tolerant hardware. Remus is a VM replication mechanism for the Xen hypervisor that provides high availability despite crash failures. Replication is performed by checkpointing the VM at fixed intervals. However, there is an antagonism between processing and communication regarding the optimal checkpoint interval: while longer intervals benefit processor-intensive applications, shorter intervals favour network-intensive applications. Thus, any chosen interval may not always be suitable for the hosted applications, limiting Remus usage in many scenarios. This work introduces Adaptive Remus, a proposal for adaptive checkpointing in Remus that dynamically adjusts the replication frequency according to the characteristics of running applications. Experimental results indicate that our proposal improves performance for applications that require both processing and communication, without harming applications that use only one type of resource. 相似文献

18.

Adaptive data replication strategy in cloud computing for performance improvement

Najme MANSOURI 《Frontiers of Computer Science》2016,10(5):925-935

Cloud computing is becoming a very popular word in industry and is receiving a large amount of attention from the research community. Replica management is one of the most important issues in the cloud, which can offer fast data access time, high data availability and reliability. By keeping all replicas active, the replicas may enhance system task successful execution rate if the replicas and requests are reasonably distributed. However, appropriate replica placement in a large-scale, dynamically scalable and totally virtualized data centers is much more complicated. To provide cost-effective availability, minimize the response time of applications and make load balancing for cloud storage, a new replica placement is proposed. The replica placement is based on five important parameters: mean service time, failure probability, load variance, latency and storage usage. However, replication should be used wisely because the storage size of each site is limited. Thus, the site must keep only the important replicas.We also present a new replica replacement strategy based on the availability of the file, the last time the replica was requested, number of access, and size of replica. We evaluate our algorithm using the CloudSim simulator and find that it offers better performance in comparison with other algorithms in terms of mean response time, effective network usage, load balancing, replication frequency, and storage usage. 相似文献

19.

网格计算中面向虚拟组织的高可用安全目录服务研究

赵曦滨吴雷裴军《计算机科学》2007,34(6):124-127

本文针对虚拟组织对目录服务的要求,结合检查点归档算法、SOAP协议及多机部署方案,设计并实现了基于Web Services的轻量目录服务,并针对目录服务自身的安全做了基于代理服务器和服务迁移机制的安全扩展.这种目录服务可以在保持服务安全性和可靠性的同时为虚拟组织提供良好的可用性. 相似文献

20.

Stability, availability, and response in network file service

Gait J. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(2):133-140

A network file system called Multifile is described. It meets response, availability, and stability requirements as primitive functions. Multifile has a high degree of responsiveness because its component parts compete among themselves to service file requests; it has high availability because it maintains multiple copies of files; and it exhibits stable behavior over wise range of system parameters. The responsiveness of Multifile to read requests improves as the number of pages per request rises, implying that read ahead pages can profitably be cached at client sites. The throughput of Multifile improves as the request size increases and as the number of clients increases. As server load increases, the responsiveness of Multifile to read requests is stable in most configurations. The throughput of writes is unstable as the number of pages in the wire request rises, implying that write back pages should not be cached at client sites. The scale of events in file service is dominated by disk activity, so lost message exceptions do not occur frequently enough to affect file service; however, duplicate message exceptions are a factor in performance 相似文献