首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
User-perceived dependability and performance metrics are very different from conventional ones in that the dependability and performance properties must be assessed from the perspective of users accessing the system. In this paper, we develop techniques based on stochastic Petri nets (SPN) to analyze user-perceived dependability and performance properties of quorum-based algorithms for managing replicated data. A feature of the techniques developed in the paper is that no assumption is made regarding the interconnection topology, the number of replicas, or the quorum definition used by the replicated system, thus making it applicable to a wide class of quorum-based algorithms. We illustrate this technique by comparing conventional and user-perceived metrics in majority voting algorithms. Our analysis shows that when the user-perceiveness is taken into consideration, the effect of increasing the network connectivity and number of replicas on the availability and dependability properties perceived by users is very different from that under conventional metrics. Thus, unlike conventional metrics, user-perceived metrics allow a tradeoff to be exploited between the hardware invested, i.e., higher network connectivity and number of replicas, and the performance and dependability properties perceived by users.  相似文献   

2.
A fault-tolerant architectural approach for dependable systems   总被引:2,自引:0,他引:2  
A system's structure enables it to generate its intended behavior from its components' behavior. A well-structured system simplifies relationships among components, which can increase dependability. With software systems, the architecture is an abstraction of the structure. Architectural reasoning about dependability has become increasingly important because emerging applications are increasingly complex. We've developed an architectural approach for effectively representing and analyzing fault-tolerant software systems. The proposed solution relies on exception handling to tolerate faults associated with component and connector failures, architectural mismatches, and configuration faults. Our approach, a specialization of the peer-to-peer architectural style, hides inside the architectural elements the complexities of exception handling and propagation. Our goal is to improve a system's overall reliability and availability by making it tolerant of nonmalicious faults.  相似文献   

3.
Based on extensive field failure data for Tandem's GUARDIAN operating system, the paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate  相似文献   

4.
Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive run-time. This paper presents two fault-tolerance mechanisms called Theft Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both, benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multi-threaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small and the maximum work lost by a crashed process is small and bounded.  相似文献   

5.
If an off-the-shelf software product exhibits poor dependability due to design faults, then software fault tolerance is often the only way available to users and system integrators to alleviate the problem. Thanks to low acquisition costs, even using multiple versions of software in a parallel architecture, which is a scheme formerly reserved for few and highly critical applications, may become viable for many applications. We have studied the potential dependability gains from these solutions for off-the-shelf database servers. We based the study on the bug reports available for four off-the-shelf SQL servers plus later releases of two of them. We found that many of these faults cause systematic noncrash failures, which is a category ignored by most studies and standard implementations of fault tolerance for databases. Our observations suggest that diverse redundancy would be effective for tolerating design faults in this category of products. Only in very few cases would demands that triggered a bug in one server cause failures in another one, and there were no coincident failures in more than two of the servers. Use of different releases of the same product would also tolerate a significant fraction of the faults. We report our results and discuss their implications, the architectural options available for exploiting them, and the difficulties that they may present.  相似文献   

6.
The Internet of Things (IoT) is a promising networking paradigm which immerses objects (cell phones, goods, watches, sensing motes, TVs, etc.) in a worldwide connection. Despite its high degree of applicability, the IoT faces some challenges. One of the most challenging problems is its dependability (reliability and availability), since a device failure might put people in danger or result in financial loss. The lack of a design tool for assessing the dependability of IoT applications at the early planning and design phases prevents system designers from optimizing their decisions so as to minimize the effects of such faults on the network devices. In this paper, we propose a dependability evaluation tool for IoT applications, when hardware faults and permanent link faults are considered.  相似文献   

7.
Today is an era where multiprocessor technology plays a major role in designs of modern computer architecture. While multiprocessor systems offer extra computing power, it also opens a new range of opportunities to improve fault-robustness. This paper focuses on a problem of achieving fault-tolerance using replications in real-time, multiprocessor systems. In the problem, multiple replicas, or copies, of a computing task are executed on distinct processors to resist potential processor failures and computing faults. Two greedy, approximation heuristics, named Worst Fit Increasing K-Replication and First Fit Increasing K-Replication, are studied to maximise the number of real-time tasks assigned on a system with identical processors, respecting to the tasks’ replicating and timely requirements. Worst case performance is analysed by using an approximation ratio between the algorithms and an optimal solution. We mathematically prove that the ratios of using both algorithms are infinitely close to 2. Simulations are performed on a large set of testing cases which can be used to bring to light the average performance of using the algorithms in practice. The results show that both heuristic algorithms provide simple but fast and effective solutions to solve the problem.  相似文献   

8.
9.
Summary. In a shared-memory distributed system, n independent asynchronous processes communicate by reading and writing to shared variables. An algorithm is adaptive (to total contention) if its step complexity depends only on the actual number, k, of active processes in the execution; this number is unknown in advance and may change in different executions of the algorithm. Adaptive algorithms are inherently wait-free, providing fault-tolerance in the presence of an arbitrary number of crash failures and different processes' speed. A wait-free adaptive collect algorithm with O(k) step complexity is presented, together with its applications in wait-free adaptive alg orithms for atomic snapshots, immediate snapshots and renaming. Received: August 1999 / Accepted: August 2001  相似文献   

10.
随着计算机技术的迅速发展,分布式应用的规模迅速增加,越来越多的软件系统开始采用面向服务的体系结构SOA。为了提高SOA的可靠性和可扩展性,一种有效的方式就是提供服务副本,并通过基于中间件的负载平衡服务在不同的服务副本之间平衡负载。通过使用中间件,我们可以满足当前的面向服务应用对于性能、可扩展性和可用性的需求。然而,我们必须保证对于负载的计算具有一定的预测性以避免负载峰值的影响。对于复杂的面向服务应用来说,负载峰值意味着系统可能在短时间内具有极高的负载,而在大多数时间内负载较为平稳,从而因为负载取样的延时性导致系统过载而响应时间增加、总体吞吐量也受到影响。因此,为了降低响应时间,以及在负载频繁波动的情况下也能有效地利用服务副本,我们基于中间件为自适应和灵活的负载平衡机制的需求提出并实现了一种基于机器学习的预测机制。  相似文献   

11.
P2P流媒体cache是一种有效减少带宽开销、提高对象利用率的技术,通常采用FIFO,LRU等算法置换内容.然而,流媒体不同于Web对象,P2P网络也有别于客户/服务器模式.在分布式应用中这些算法可能影响系统的性能,为此,分析了FIFO和LRU置换算法,提出了基于供求关系的SD算法,以及基于分片副本数量的REP算法,并对其进行评估和比较.针对不同的节点到达间隔,将SD和REP同FIFO,LRU进行比较,发现在启动延迟、媒体副本数量和根节点依赖度方面SD和REP几乎均优于FIFO和LRU.同LSB(least sent bytes)算法相比,某些场景中SD的启动延迟减少了约40%,而REP在副本数量方面远远超过LSB的结果,说明在P2P网络流媒体服务中使用SD和REP缓存置换算法有助于提高系统性能.  相似文献   

12.
Large-scale distributed applications such as online information retrieval and collaboration over computational elements demand an approach to self-managed computing systems with a minimum of human interference. However, large scales and full distribution often lead to poor system dependability and security, and increase the difficulty in managing and controlling redundancy for fault tolerance. In particular, fault tolerance schemes for mobile agents to survive agent server crash failures in an autonomie environment are complex since developers normally have no control over remote agent servers. Some solutions inject a replica into stable storage upon its arrival at an agent server. But in the event of an agent server crash the replica is unavailable until the agent server recovers. In this paper we present a failure model and an exception handling framework for mobile agent systems. An exception handling scheme is developed for mobile agents to survive agent server crash failures. A replica mobile agent operates at the agent server visited prior to its master's current location. If a master crashes its replica is available as a replacement. The proposed scheme is examined in comparison with a simple time-out scheme. Experimental evaluation is performed, and performance results show that the scheme leads to some overhead in the round trip time when fault tolerance measures are exercised. However the scheme offers the advantage that fault tolerance is provided during the mobile agent trip, i.e. in the event of an agent server crash all agent servers are not revisited.  相似文献   

13.
A dependable middleware should be able to adaptively share the distributed resources it manages in order to meet diverse application requirements, even when the quality of service (QoS) is degraded due to uncertain variations in load and unanticipated failures. We have addressed this issue in the context of a dependable middleware that adaptively manages replicated servers to deliver a timely and consistent response to time-sensitive client applications. These applications have specific temporal and consistency requirements, and can tolerate a certain degree of relaxed consistency in exchange for better response time. We propose a flexible QoS model that allows clients to specify their timeliness and consistency constraints. We also propose an adaptive framework that dynamically selects replicas to service a client's request based on the prediction made by probabilistic models. These models use the feedback from online performance monitoring of the replicas to provide probabilistic guarantees for meeting a client's QoS specification. The experimental results we have obtained demonstrate the role of feedback and the efficacy of simple analytical models for adaptively sharing the available replicas among the users under different workload scenarios.  相似文献   

14.
Work to date on algorithms for message-passing systems has explored a wide variety of types of faults, but corresponding work on shared memory systems has usually assumed that only crash faults are possible. In this work, we explore situations in which processes accessing shared objects can fail arbitrarily (Byzantine faults). Received: December 2000 / Accepted: July 2002 RID="*" ID="*" A preliminary version of the results presented in this paper appeared in Proceedings of the 14th International Symposium on Distributed Computing, Toledo, Spain, October 2000.  相似文献   

15.
This paper presents the design and implementation of Jgroup/ARM, a distributed object group platform with autonomous replication management along with a novel measurement‐based assessment technique that is used to validate the fault‐handling capability of Jgroup/ARM. Jgroup extends Java RMI through the group communication paradigm and has been designed specifically for application support in partitionable systems. ARM aims at improving the dependability characteristics of systems through a fault‐treatment mechanism. Hence, ARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest. The main objective of ARM is to localize failures and to reconfigure the system according to application‐specific dependability requirements. Combining Jgroup and ARM can significantly reduce the effort necessary for developing, deploying and managing dependable, partition‐aware applications. Jgroup/ARM is evaluated experimentally to validate its fault‐handling capability; the recovery performance of a system deployed in a wide area network is evaluated. In this experiment multiple nearly coincident reachability changes are injected to emulate network partitions separating the service replicas. The results show that Jgroup/ARM is able to recover applications to their initial state in several realistic failure scenarios, including multiple, concurrent network partitionings. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

16.
Adaptive compensation for infinite number of actuator failures or faults   总被引:1,自引:0,他引:1  
It is both theoretically and practically important to investigate the problem of accommodating infinite number of actuator failures or faults in controlling uncertain systems. However, there is still no result available in developing adaptive controllers to address this problem. In this paper, a new adaptive failure/fault compensation control scheme is proposed for parametric strict feedback nonlinear systems. The techniques of nonlinear damping and parameter projection are employed in the design of controllers and parameter estimators, respectively. It is proved that the boundedness of all closed-loop signals can still be ensured in the case with infinite number of failures or faults, provided that the time interval between two successive changes of failure/fault pattern is bounded below by an arbitrary positive number. The performance of the tracking error in the mean square sense with respect to the frequency of failure/fault pattern changes is also established. Moreover, asymptotic tracking can be achieved when the total number of failures and faults is finite.  相似文献   

17.
CSCW中的对象同步与合并   总被引:11,自引:0,他引:11  
为了满足响应时间以及可靠性的要求,协同应用通常采用全复制的体系结构.由此带来的一个巨大挑战就是如何保证复制在各个协作站点处的对象的一致性以及获得一个公共状态,总结了在线和离线两种协作模式下的不同需求,在此基础上给出了相应的控制机制——在线模式下的对象同步算法和离线模式下的对象合并算法.这些机制已经在协作支撑平台Cova系统中实现,实际的应用表明,这些措施与其他机制一起能够很好地满足协同应用数据一致性的要求。  相似文献   

18.
This paper addresses the two fundamental issues in replication, namely deciding on the number and placement of the replicas and the distribution of requests among replicas. We first introduce a centralized algorithm for replicating objects that can keep a balanced load on sites. In order to meet the requirement due to the dynamic nature of the Internet traffic and the rapid change in the access pattern of the World-Wide Web (Web), we also propose a distributed algorithm where each site relies on some collected information to decide on where to replicate and migrate objects to achieve good performance. The performance of the proposed algorithms is evaluated experimentally and a comparison of their measured performance is presented.  相似文献   

19.
Patching technologies are commonly applied to improve the dependability of software after release. This paper reports the design of an automated hot patching (AHP) framework that fully automates reasoning for the causes of failures and patching the binary code of Web-based applications. AHP admits the hardness for rooting out all faults before product release, and autonomously patches problems of application programs. By directly operating on binary code, AHP is universal to virtually all applications. A promising application of AHP is to shortcut a function of the remote maintenance center (RMC) and hence to reduce the turn around time for patches.  相似文献   

20.
Adaptation is a desirable requirement in a distributed system as it helps the system to perform efficiently under different environments. For many problems, more than one protocol exists, such that one protocol performs better in one environment while the other performs better in another. In such cases, adaptive distributed systems can be designed by dynamically switching between the protocols as the environment changes. Distributed protocol switching is also important for performance enhancement, or fault-tolerance of a distributed system. In this work, we illustrate distributed protocol switching by providing a distributed algorithm for adaptive broadcast that dynamically switches from a BFS tree to a DFS tree. The proposed switching algorithm can also handle arbitrary crash failures. It ensures that switching eventually terminates in spite of failures and the desired tree (DFS tree) results as the output. We also investigate the properties that can be guaranteed on the delivery of broadcast messages under specific failure conditions. We show that under no failure, each broadcast message is eventually correctly delivered to all the nodes in spite of switching. Under arbitrary crash fault, we ensure that switching eventually terminates with the desired tree as the broadcast topology. We also investigate the specific delivery guarantees that can be provided when a single crash fault happens, both during switching and when no switching is in progress.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号