首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Based on extensive field failure data for Tandem's GUARDIAN operating system, the paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate  相似文献   

2.
The AQuA architecture provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of dependability. This paper presents the algorithms that AQUA uses, when an application's dependability requirements can change at runtime, to tolerate both value faults in applications and crash failures simultaneously. In particular, we provide an active replication communication scheme that maintains data consistency among replicas, detects crash failures, collates the messages generated by replicated objects, and delivers the result of each vote. We also present an adaptive majority voting algorithm that enables the correct ongoing vote while both the number of replicas and the majority size dynamically change. Together, these two algorithms form the basis of the mechanism for tolerating and recovering from value faults and crash failures in AQuA  相似文献   

3.
Large-scale distributed applications such as online information retrieval and collaboration over computational elements demand an approach to self-managed computing systems with a minimum of human interference. However, large scales and full distribution often lead to poor system dependability and security, and increase the difficulty in managing and controlling redundancy for fault tolerance. In particular, fault tolerance schemes for mobile agents to survive agent server crash failures in an autonomie environment are complex since developers normally have no control over remote agent servers. Some solutions inject a replica into stable storage upon its arrival at an agent server. But in the event of an agent server crash the replica is unavailable until the agent server recovers. In this paper we present a failure model and an exception handling framework for mobile agent systems. An exception handling scheme is developed for mobile agents to survive agent server crash failures. A replica mobile agent operates at the agent server visited prior to its master's current location. If a master crashes its replica is available as a replacement. The proposed scheme is examined in comparison with a simple time-out scheme. Experimental evaluation is performed, and performance results show that the scheme leads to some overhead in the round trip time when fault tolerance measures are exercised. However the scheme offers the advantage that fault tolerance is provided during the mobile agent trip, i.e. in the event of an agent server crash all agent servers are not revisited.  相似文献   

4.
In general, faults cannot be prevented; instead, they need to be tolerated to guarantee certain degrees of software dependability. We develop a theory for fault tolerance for a distributed pi-calculus, whereby locations act as units of failure and redundancy is distributed across independently failing locations. We give formal definitions for fault tolerant programs in our calculus, based on the well studied notion of contextual equivalence. We then develop bisimulation proof techniques to verify fault tolerance properties of distributed programs and show they are sound with respect to our definitions for fault tolerance.  相似文献   

5.
Basic concepts and taxonomy of dependable and secure computing   总被引:33,自引:0,他引:33  
This paper gives the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc. Security brings in concerns for confidentiality, in addition to availability and integrity. Basic definitions are given first. They are then commented upon, and supplemented by additional definitions, which address the threats to dependability and security (faults, errors, failures), their attributes, and the means for their achievement (fault prevention, fault tolerance, fault removal, fault forecasting). The aim is to explicate a set of general concepts, of relevance across a wide range of situations and, therefore, helping communication and cooperation among a number of scientific and technical communities, including ones that are concentrating on particular types of system, of system failures, or of causes of system failures.  相似文献   

6.
The authors describe a number of results from a quantitative study of faults and failures in two releases of a major commercial software system. They tested a range of basic software engineering hypotheses relating to: the Pareto principle of distribution of faults and failures; the use of early fault data to predict later fault and failure data; metrics for fault prediction; and benchmarking fault data. For example, we found strong evidence that a small number of modules contain most of the faults discovered in prerelease testing and that a very small number of modules contain most of the faults discovered in operation. We found no evidence to support previous claims relating module size to fault density nor did we find evidence that popular complexity metrics are good predictors of either fault-prone or failure-prone modules. We confirmed that the number of faults discovered in prerelease testing is an order of magnitude greater than the number discovered in 12 months of operational use. The most important result was strong evidence of a counter-intuitive relationship between pre- and postrelease faults; those modules which are the most fault-prone prerelease are among the least fault-prone postrelease, while conversely, the modules which are most fault-prone postrelease are among the least fault-prone prerelease. This observation has serious ramifications for the commonly used fault density measure. Our results provide data-points in building up an empirical picture of the software development process  相似文献   

7.
An important step in the development of dependable systems is the validation of their fault tolerance properties. Fault injection has been widely used for this purpose, however with the rapid increase in processor complexity, traditional techniques are also increasingly more difficult to apply. This paper presents a new software-implemented fault injection and monitoring environment, called Xception, which is targeted at modern and complex processors. Xception uses the advanced debugging and performance monitoring features existing in most modern processors to inject quite realistic faults by software, and to monitor the activation of the faults and their impact on the target system behavior in detail. Faults are injected with minimum interference with the target application. The target application is not modified, no software traps are inserted, and it is not necessary to execute the target application in special trace mode (the application is executed at full speed). Xception provides a comprehensive set of fault triggers, including spatial and temporal fault triggers, and triggers related to the manipulation of data in memory. Faults injected by Xception can affect any process running on the target system (including the kernel), and it is possible to inject faults in applications for which the source code is not available. Experimental, results are presented to demonstrate the accuracy and potential of Xception in the evaluation of the dependability properties of the complex computer systems available nowadays  相似文献   

8.
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology  相似文献   

9.
This paper is focused on the validation by means of physical fault injection at pin-level of a time-triggered communication controller: the TTP/C versions C1 and C2. The controller is a commercial off-the-shelf product used in the design of by-wire systems. Drive-by-wire and fly-by-wire active safety controls aim to prevent accidents. They are considered to be of critical importance because a serious situation may directly affect user safety. Therefore, dependability assessment is vital in their design.This work was funded by the European project ‘Fault Injection for TTA’ and it is divided into two parts. In the first part, there is a verification of the dependability specifications of the TTP communication protocol, based on TTA, in the presence of faults directly induced in communication lines. The second part contains a validation and improvement proposal for the architecture in case of data errors. Such errors are due to faults that occurred during writing (or reading) actions on memory or during data storage.  相似文献   

10.
Demands on software reliability and availability have increased tremendously due to the nature of present day applications. We focus on the aspect of software for the high availability of application servers since the unavailability of servers more often originates from software faults rather than hardware faults. The software rejuvenation technique has been widely used to avoid the occurrence of unplanned failures, mainly due to the phenomena of software aging or caused by transient failures. In this paper, first we present a new way of using the virtual machine based software rejuvenation named VMSR to offer high availability for application server systems. Second we model a single physical server which is used to host multiple virtual machines (VMs) with the VMSR framework using stochastic modeling and evaluate it through both numerical analysis and SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) tool simulation. This VMSR model is very general and can capture application server characteristics, failure behavior, and performability measures. Our results demonstrate that VMSR approach is a practical way to ensure uninterrupted availability and to optimize performance for aging applications. This research was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) under Grant No. KRF2007-210-D00006.  相似文献   

11.
Patching technologies are commonly applied to improve the dependability of software after release. This paper reports the design of an automated hot patching (AHP) framework that fully automates reasoning for the causes of failures and patching the binary code of Web-based applications. AHP admits the hardness for rooting out all faults before product release, and autonomously patches problems of application programs. By directly operating on binary code, AHP is universal to virtually all applications. A promising application of AHP is to shortcut a function of the remote maintenance center (RMC) and hence to reduce the turn around time for patches.  相似文献   

12.
In the hot-standby replication system, the system cannot process its tasks anymore when all replicated nodes have failed. Thus, the remaining living nodes should be well-protected against failure when parts of replicated nodes have failed. Design faults and system-specific weaknesses may cause chain reactions of common faults on identical replicated nodes in replication systems. These can be alleviated by replicating diverse hardware and software. Going one-step forward, failures on the remaining nodes can be suppressed by predicting and preventing the same fault when it has occurred on a replicated node. In this paper, we propose a fault avoidance scheme which increases system dependability by avoiding common faults on remaining nodes when parts of nodes fail, and analyze the system dependability.  相似文献   

13.
宇宙射线辐射所导致的瞬态故障一直是航天计算面临的最主要挑战之一.而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,其可信性也正日益面临着瞬态故障的严重威胁.当前针对瞬态故障的容错技术可大致分为两类:基于硬件实现和基于软件实现.相比较前者,后者由于在实现成本和灵活性等方面的优势而备受关注.本文首先概述...  相似文献   

14.
This paper proposes a novel methodology and an architectural framework for handling multiple classes of faults (namely, hardware-induced software errors in the application, process and/or host crashes or hangs, and errors in the persistent system stable storage) in a COTS and legacy-based application. The basic idea is to use an evidence-accruing fault tolerance manager to choose and carry out one of multiple fault recovery strategies, depending upon the perceived severity of the fault. The methodology and the framework have been applied to a case study system consisting of a legacy system, which makes use of a COTS DBMS for persistent storage facilities. A thorough performability analysis has also been conducted via combined use of direct measurements and analytical modeling. Experimental results demonstrate that effective fault treatment, consisting of careful diagnosis and damage assessment, plays a key role in leveraging the dependability of COTS and legacy-based applications  相似文献   

15.
Complex real-time system design needs to address dependability requirements, such as safety, reliability, and security. We introduce a modelling and simulation based approach which allows for the analysis and prediction of dependability constraints. Dependability can be improved by making use of fault tolerance techniques. The de-facto example, in the real-time system literature, of a pump control system in a mining environment is used to demonstrate our model-based approach. In particular, the system is modelled using the Discrete EVent system Specification (DEVS) formalism, and then extended to incorporate fault tolerance mechanisms. The modularity of the DEVS formalism facilitates this extension. The simulation demonstrates that the employed fault tolerance techniques are effective. That is, the system performs satisfactorily despite the presence of faults. This approach also makes it possible to make an informed choice between different fault tolerance techniques. Performance metrics are used to measure the reliability and safety of the system, and to evaluate the dependability achieved by the design. In our model-based development process, modelling, simulation and eventual deployment of the system are seamlessly integrated.  相似文献   

16.
We develop a methodology to measure the quality levels of a number of releases of a software product in its evolution process. The proposed quality measurement plan is based on the faults detected in field operation of the software. We describe how fault discovery data can be analyzed and reported in a framework very similar to that of the QMP (quality measurement plan) proposed by B. Hoadley (1986). The proposed procedure is especially useful in situations where one has only very little data from the latest release. We present details of implementation of solutions to a class of models on the distribution of fault detection times. The conditions under which the families: exponential, Weibull, or Pareto distributions might be appropriate for fault detection times are discussed. In a variety of typical data sets that we investigated one of these families was found to provide a good fit for the data. The proposed methodology is illustrated with an example involving three releases of a software product, where the fault detection times are exponentially distributed. Another example for a situation where the exponential fit is not good enough is also considered  相似文献   

17.
The goals of cross-product reuse in a software product line (SPL) are to mitigate production costs and improve the quality. In addition to reuse across products, due to the evolutionary development process, a SPL also exhibits reuse across releases. In this paper, we empirically explore how the two types of reuse—reuse across products and reuse across releases—affect the quality of a SPL and our ability to accurately predict fault proneness. We measure the quality in terms of post-release faults and consider different levels of reuse across products (i.e., common, high-reuse variation, low-reuse variation, and single-use packages), over multiple releases. Assessment results showed that quality improved for common, low-reuse variation, and single-use packages as they evolved across releases. Surprisingly, within each release, among preexisting (‘old’) packages, the cross-product reuse did not affect the change and fault proneness. Cross-product predictions based on pre-release data accurately ranked the packages according to their post-release faults and predicted the 20 % most faulty packages. The predictions benefited from data available for other products in the product line, with models producing better results (1) when making predictions on smaller products (consisting mostly of common packages) rather than on larger products and (2) when trained on larger products rather than on smaller products.  相似文献   

18.
A fault-tolerant architectural approach for dependable systems   总被引:2,自引:0,他引:2  
A system's structure enables it to generate its intended behavior from its components' behavior. A well-structured system simplifies relationships among components, which can increase dependability. With software systems, the architecture is an abstraction of the structure. Architectural reasoning about dependability has become increasingly important because emerging applications are increasingly complex. We've developed an architectural approach for effectively representing and analyzing fault-tolerant software systems. The proposed solution relies on exception handling to tolerate faults associated with component and connector failures, architectural mismatches, and configuration faults. Our approach, a specialization of the peer-to-peer architectural style, hides inside the architectural elements the complexities of exception handling and propagation. Our goal is to improve a system's overall reliability and availability by making it tolerant of nonmalicious faults.  相似文献   

19.
This paper deals with evaluation of the dependability (considered as a generic term, whose main measures are reliability, availability, and maintainability) of software systems during their operational life, in contrast to most of the work performed up to now, devoted mainly to development and validation phases. The failure process due to design faults, and the behavior of a software system up to the first failure and during its life cycle are successively examined. An approximate model is derived which enables one to account for the failures due to the design faults in a simple way when evaluating a system's dependability. This model is then used for evaluating the dependability of 1) a software system tolerating design faults, and 2) a computing system with respect to physical and design faults.  相似文献   

20.
On the value of static analysis for fault detection in software   总被引:1,自引:0,他引:1  
No single software fault-detection technique is capable of addressing all fault-detection concerns. Similarly to software reviews and testing, static analysis tools (or automated static analysis) can be used to remove defects prior to release of a software product. To determine to what extent automated static analysis can help in the economic production of a high-quality product, we have analyzed static analysis faults and test and customer-reported failures for three large-scale industrial software systems developed at Nortel Networks. The data indicate that automated static analysis is an affordable means of software fault detection. Using the orthogonal defect classification scheme, we found that automated static analysis is effective at identifying assignment and checking faults, allowing the later software production phases to focus on more complex, functional, and algorithmic faults. A majority of the defects found by automated static analysis appear to be produced by a few key types of programmer errors and some of these types have the potential to cause security vulnerabilities. Statistical analysis results indicate the number of automated static analysis faults can be effective for identifying problem modules. Our results indicate static analysis tools are complementary to other fault-detection techniques for the economic production of a high-quality software product.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号