共查询到20条相似文献,搜索用时 46 毫秒
1.
Sameer Bataineh Author Vitae 《Computers & Electrical Engineering》2004,30(4):291-308
Employing the queuing theory, closed form solutions for the response time of a fault tolerant network of processors system based on the primary site approach is obtained. Fault tolerance is achieved in the primary site approach by having the services replicated by the primary at many nodes. All the requests are sent to the primary which, periodically, checkpoints its status on the backup nodes. If the primary fails, one of the backups takes over as primary. Two repair mechanisms are considered to repair faulty nodes in the system: delayed repair and immediate repair. In addition to their closed form formats, the analytical results presented in this paper have several other advantages over those presented in the previous work. First, for immediate repair case, there is no need to solve a set of recursive equations. Secondly, the results reveal much of the characteristics of the system. We studied the effect of checkpointing rate on the system response time and we found a closed form solution for the optimum checkpointing rate, which minimizes the system response time. 相似文献
2.
Huang Y. Tripathi S.K. 《IEEE transactions on pattern analysis and machine intelligence》1993,19(2):108-119
Resource allocation for a distributed system employing the primary site approach for fault tolerance is discussed. Two kinds of systems are considered. The first consists of fault-tolerant nodes where each node has many duplicated servers. One server is the primary, which serves user requests, and the rest are backup. The second does not have fault-tolerant nodes. To tolerate node failures, each node uses other nodes as backups. When a node fails, all requests initially allocated to the node are served by one of its backups. To study the resource allocation for such systems, an approximate model for each system is developed. Using these models, efficient allocation algorithms that take into account the failure/repair rates of the system and the fault-tolerant overheads are presented. Using experimental results, it is shown that the algorithms give the optimal or suboptimal allocations. The algorithms, which incur little overhead, can improve the system performance significantly over an intuitive allocation algorithm 相似文献
3.
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. 相似文献
4.
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach.
In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper
proposes thesample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing.
We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme.
Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than
those of the schemes without it, while the required checkpoint interval is larger.
This research was supported in part by the MIC (Ministry of Information and Communication), Korea, under the ITRC support
program supervised by the UTA and CUCN 21st Century Frontier R&D Program. 相似文献
5.
Chtepen M. Claeys F.H.A. Dhoedt B. De Turck F. Demeester P. Vanrolleghem P.A. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):180-190
A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency. 相似文献
6.
Walters John Paul Chaudhary Vipin 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(7):997-1010
As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system. 相似文献
7.
Navid Aghdaie Author Vitae Yuval Tamir Author Vitae 《Journal of Systems and Software》2009,82(1):131-143
The Web is increasingly used for critical applications and services. We present a client-transparent mechanism, called CoRAL, that provides high reliability and availability for Web service. CoRAL provides fault tolerance even for requests being processed at the time of server failure. The scheme does not require deterministic servers and can thus handle dynamic content. CoRAL actively replicates the TCP connection state while maintaining logs of HTTP requests and replies. In the event of a primary server failure, active client connections fail over to a spare, where their processing continues seamlessly. We describe key aspects of the design and implementation as well as several performance optimizations. Measurements of system overhead, failover performance, and preliminary validation using fault injection are presented. 相似文献
8.
9.
面向更新密集型应用的内存数据库系统,其检查点技术应符合几个关键的要求,包括检查点操作对正常事务处理的干扰尽可能小、能够处理存取倾斜状况、支持数据库系统的快速恢复、提供恢复过程中的系统可用性等.该文提出一种事务一致的分区检查点技术,采用基于元组的动态多版本并发控制机制,避免了读写事务的加锁冲突,提高系统吞吐能力;检查点操作以只读事务形式实现,存多版本并发控制下,避免检查点操作对正常事务处理的堵塞;由于检查点文件是事务一致的,只需要记录事务的Redo 日志信息,在系统恢复过程中,只需要对日志文件进行一遍扫描处理,加快恢复过程;基于优先级的数据分区装载和恢复,使得恢复过程中新事务的数据存取请求迅速得到满足,保证了恢复过程中的系统可用性.由于采用两级版本管理机制以及动态版本共享技术,多版本管理的空间开销降低到可以接受的水平.实验结果表明,文中提出的检查点技术方案获得比模糊检查点技术高27%的系统吞吐量,同时版本管理的空间开销在可接受的范围之内,满足高性能应用的要求. 相似文献
10.
《Journal of Systems Architecture》2015,61(2):71-81
This paper addresses the energy minimization issue when executing real-time applications that have stringent reliability and deadline requirements. To guarantee the satisfaction of the application’s reliability and deadline requirements, checkpointing, Dynamic Voltage Frequency Scaling (DVFS) and backward fault recovery techniques are used. We formally prove that if using backward fault recovery, executing an application with a uniform frequency or neighboring frequencies if the desired frequency is not available, not only consumes the minimal energy but also results in the highest system reliability. Based on this theoretical conclusion, we develop a strategy that utilizes DVFS and checkpointing techniques to execute real-time applications so that not only the applications reliability and deadline requirements are guaranteed, but also the energy consumption for executing the applications is minimized. The developed strategy needs at most one execution frequency change during the execution of an application, hence, the execution overhead caused by frequency switching is small, which makes the strategy particularly useful for processors with a large frequency switching overhead. We empirically compare the developed real-time application execution strategy with recently published work. The experimental results show that, without sacrificing reliability and deadline satisfaction guarantees, the proposed approach can save up to 12% more energy when compared with other approaches. 相似文献
11.
Krishna Kant 《Information Sciences》1983,30(3):225-239
The paper proposes a new technique for providing software fault tolerance in concurrent systems. It combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism. This mechanism involves smaller overhead in case of moderate to high process interaction than the schemes considered in past, which are based upon the idea of local checkpointing. A model for computing the optimum checkpointing interval is also presented. A particular distribution is hypothesized for the coverage of the recovery, and the behavior of the model is studied in detail for this case. 相似文献
12.
Hiroyuki Okamura Author Vitae Tadashi Dohi Author Vitae 《Journal of Systems and Software》2010,83(9):1591-1604
This paper examines comprehensive evaluation of aperiodic time-based checkpointing and rejuvenation schemes maximizing the steady-state system availability in an operational software system. We consider two kinds of maintenance policies: checkpointing prior to rejuvenating (CPTR) and rejuvenating prior to checkpointing (RPTC). These schemes are complementary from each other to schedule checkpoints and rejuvenation points. In addition, under a periodic full maintenance operation, we show that aperiodic checkpointing or rejuvenation scheme is optimal to maximize the steady-state system availability by applying the dynamic programming. In numerical examples, CPTR and RPTC are comparatively examined with same overhead parameters, and the effects of CPTR and RPTC on maximizing the steady-state system availability are investigated. 相似文献
13.
Summary This paper presents a method for obtaining the optimum checkpoint interval of a transaction processing computer system subject to time dependent failures. The system uses checkpointing to create a valid system state, and roll-back in order to recover from failures. Maximizing system availability we derive the optimum checkpoint interval as a function of the load of the system and of the time dependent failure rate. The results are illustrated numerically for the Weibull failure rate.On leave from Universidad de Los Andes, Venezuela 相似文献
14.
作为具备高性能和高可伸缩性的分布式存储解决方案,键值存储系统近年来被广泛采用,例如Redis、MongoDB、Cassandra等.分布式存储系统中广泛使用的多副本机制一方面提高了系统吞吐量和可靠性,但同时也增加了系统协调和副本一致性的额外开销.对于跨域分布式系统来说,远距离的副本协调开销甚至可能成为系统的性能瓶颈,降低系统的可用性和吞吐量.提出分布式键值存储系统Elsa,这是一种面向跨区域架构的无协调键值存储系统. Elsa在保证高性能和高可拓展性的基础上,采用无冲突备份数据结构(CRDT)技术来无协调的保证副本间的强最终一致性,降低了系统节点间的协调开销.在阿里云上构建了跨4数据中心8节点的跨区域分布式环境,进行了大规模分布式性能对比实验,实验结果表明:在跨域的分布式环境下,对于高并发争用的负载, Elsa系统的性能具备明显的优势,最高达到MongoDB集群的7.37倍, Cassandra集群的1.62倍. 相似文献
15.
For some critical safety applications, sensor nodes embed valuable information, and they should be able to operate unattended and unfailing for several months or years. One promising solution is to adopt a checkpointing that periodically saves the state of a sensor node, thereby maintaining node reliability and network availability. Thus, this study first shows the design and implementation of a full checkpointing for WSNs. However, checkpointing is expensive. Therefore, incremental checkpointing was previously proposed to eliminate the checkpoint overhead by relying on the page protection hardware to identify dirty pages. Because sensor nodes are resource-constrained and do not equip with the page protection hardware, previous incremental checkpointings cannot be directly applied. To address this issue, this paper proposes three incremental checkpointings for WSNs. These three methods differ in the granularity of the checkpoint memory data unit and module execution overhead. In addition, we designed an incremental checkpoint file format that simultaneously supports proposed three different incremental checkpointings and accommodates them with sensor network characteristics. We implemented the full and three incremental checkpointings on SOS in the mica2 sensor motes. A performance evaluation of the three incremental checkpointings is presented. We also discuss and evaluate a method for selecting the appropriate incremental checkpointing. To the best of our knowledge, this study is the first to design and implement incremental checkpointing in MMU-less WSNs. 相似文献
16.
17.
Marcelo Pereira da Silva Rafael Rodrigues Obelheiro 《International Journal of Parallel, Emergent and Distributed Systems》2017,32(4):348-367
With the ever increasing dependence on computers and networks, many systems are required to be continuously available in order to fulfil their mission. Virtualization technology enables high availability to be offered in a convenient, cost-effective manner: with the encapsulation provided by virtual machines (VMs), entire systems can be replicated transparently in software, obviating the need for expensive fault-tolerant hardware. Remus is a VM replication mechanism for the Xen hypervisor that provides high availability despite crash failures. Replication is performed by checkpointing the VM at fixed intervals. However, there is an antagonism between processing and communication regarding the optimal checkpoint interval: while longer intervals benefit processor-intensive applications, shorter intervals favour network-intensive applications. Thus, any chosen interval may not always be suitable for the hosted applications, limiting Remus usage in many scenarios. This work introduces Adaptive Remus, a proposal for adaptive checkpointing in Remus that dynamically adjusts the replication frequency according to the characteristics of running applications. Experimental results indicate that our proposal improves performance for applications that require both processing and communication, without harming applications that use only one type of resource. 相似文献
18.
Najme MANSOURI 《Frontiers of Computer Science》2016,10(5):925-935
Cloud computing is becoming a very popular word in industry and is receiving a large amount of attention from the research community. Replica management is one of the most important issues in the cloud, which can offer fast data access time, high data availability and reliability. By keeping all replicas active, the replicas may enhance system task successful execution rate if the replicas and requests are reasonably distributed. However, appropriate replica placement in a large-scale, dynamically scalable and totally virtualized data centers is much more complicated. To provide cost-effective availability, minimize the response time of applications and make load balancing for cloud storage, a new replica placement is proposed. The replica placement is based on five important parameters: mean service time, failure probability, load variance, latency and storage usage. However, replication should be used wisely because the storage size of each site is limited. Thus, the site must keep only the important replicas.We also present a new replica replacement strategy based on the availability of the file, the last time the replica was requested, number of access, and size of replica. We evaluate our algorithm using the CloudSim simulator and find that it offers better performance in comparison with other algorithms in terms of mean response time, effective network usage, load balancing, replication frequency, and storage usage. 相似文献
19.
20.
A network file system called Multifile is described. It meets response, availability, and stability requirements as primitive functions. Multifile has a high degree of responsiveness because its component parts compete among themselves to service file requests; it has high availability because it maintains multiple copies of files; and it exhibits stable behavior over wise range of system parameters. The responsiveness of Multifile to read requests improves as the number of pages per request rises, implying that read ahead pages can profitably be cached at client sites. The throughput of Multifile improves as the request size increases and as the number of clients increases. As server load increases, the responsiveness of Multifile to read requests is stable in most configurations. The throughput of writes is unstable as the number of pages in the wire request rises, implying that write back pages should not be cached at client sites. The scale of events in file service is dominated by disk activity, so lost message exceptions do not occur frequently enough to affect file service; however, duplicate message exceptions are a factor in performance 相似文献