期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Distributed fault tolerance: lessons from Delta-4

Powell D. 《Micro, IEEE》1994,14(1):36-47

Because they avoid extensive redesign of specialized hardware, software-implemented approaches to fault tolerance are very resilient to change. Europe's Delta-4 project argues persuasively for implementing fault tolerance in a distributed fashion. The Delta-4 approach achieves fault tolerance by replicating capsules-runtime representations of application objects-on distributed, LAN-interconnected nodes. It can configure capsule groups to tolerate either stopping or arbitrary failures. Its multipoint protocols serve to coordinate capsule groups and for error processing and fault treatment 相似文献

2.

Calibrating embedded protocols on asynchronous systems

Yukiko Yamauchi Doina Bein Toshimitsu Masuzawa Linda Morales I. Hal Sudborough 《Information Sciences》2010,180(10):1793-1801

Embedding is a method of projecting one topology into another. In one-to-one node embedding, paths in the target topology correspond to links in the original topology. A protocol running on the original topology can be modified to be executed on a target topology by means of embedding. However, if the protocol is tolerant to a number of faults - faults that affect the data but not the code of a distributed protocol executed by the nodes in a distributed systems - then the adapted protocol will not have the fault tolerance property preserved, due to the fact that links in the original topology can be embedded into paths of length greater than one: faults at the intermediate nodes on such paths are not accounted for in the protocol. We propose a communication protocol in the target topology that preserves the fault tolerance characteristics of any protocol designed for the original topology, namely by our mechanism the modification preserves fault tolerance. 相似文献

3.

Filtering Data Streams for Entity-Based Continuous Queries

Cheng Reynold Kao Ben Kwan Alan Prabhakar Sunil Tu Yicheng 《Knowledge and Data Engineering, IEEE Transactions on》2010,22(2):234-248

The idea of allowing query users to relax their correctness requirements in order to improve performance of a data stream management system (e.g., location-based services and sensor networks) has been recently studied. By exploiting the maximum error (or tolerance) allowed in query answers, algorithms for reducing the use of system resources have been developed. In most of these works, however, query tolerance is expressed as a numerical value, which may be difficult to specify. We observe that in many situations, users may not be concerned with the actual value of an answer, but rather which object satisfies a query (e.g., "who is my nearest neighbor?”). In particular, an entity-based query returns only the names of objects that satisfy the query. For these queries, it is possible to specify a tolerance that is "nonvalue-based.” In this paper, we study fraction-based tolerance, a type of nonvalue-based tolerance, where a user specifies the maximum fractions of a query answer that can be false positives and false negatives. We develop fraction-based tolerance for two major classes of entity-based queries: 1) nonrank-based query (e.g., range queries) and 2) rank-based query (e.g., k-nearest-neighbor queries). These definitions provide users with an alternative to specify the maximum tolerance allowed in their answers. We further investigate how these definitions can be exploited in a distributed stream environment. We design adaptive filter algorithms that allow updates be dropped conditionally at the data stream sources without affecting the overall query correctness. Extensive experimental results show that our protocols reduce the use of network and energy resources significantly. 相似文献

4.

Contingencies-based reconfiguration of distributed factory automation 总被引：1，自引：0，他引：1

Scott Olsen James Wang Alejandro Ramirez-Serrano Robert W. Brennan 《Robotics and Computer》2005,21(4-5):379-390

In this paper, we describe our experience using a Java-based platform to implement an emerging real-time distributed control model (IEC 61499). We provide a simple example of a control application that is distributed across two devices (Dallas Semiconductor TINI boards) and also investigate how this distributed implementation can be exploited to enhance the system's fault tolerance using a contingencies-based approach to reconfiguration. 相似文献

5.

基于虚拟化技术的仿真系统容错优化方法

陈志佳朱元昌邸彦强冯少冲《计算机应用》2015,35(8):2392-2396

节点崩溃或者仿真资源不足导致的分布式仿真系统故障,降低了仿真系统可靠性。为保证系统容错效果,降低容错开销,提出了一种基于虚拟化技术的仿真系统容错方法,按照系统故障发生的位置,对不同类型故障动态采用不同类型的容错策略。分析了检查点容错策略的优化方法,给出了最优设置间隔;结合虚拟化技术的优势,解决了副本容错策略的节点选择、副本数量以及位置分布问题;同时,引入基于虚拟机迁移的容错策略,并将其作为检查点容错策略和副本容错策略的补充,以降低容错开销。通过仿真实验数据对比,分析了动态容错策略与普通容错策略的性能,可知动态容错策略保证了系统容错性能,容错开销也保持在较低水平。相似文献

6.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

7.

Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance 总被引：1，自引：1，他引：0

Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献

8.

Recovering distributed objects

《Information Processing Letters》2001,77(2-4):143-150

Distributed multithreaded applications operating in shared-nothing environments present challenges to classical fault tolerance mechanisms. The piecewise determinism assumption is lost (due to multithreading), and data must be replicated (because of the shared-nothing environment). In this paper, we explore a systematic approach to providing fault tolerance, by considering data-race-free programs that have the benefits of piecewise determinism and yet allow multithreading. We base our logging and recovery algorithm on a logical ring structure that allows the underlying distributed system to migrate threads, migrate and replicate objects, and perform multi-object transactions. 相似文献

9.

A Framework for Adaptive Fault-Tolerant Execution of Workflows in the Grid: Empirical and Theoretical Analysis

Felipe Pontes Guimaraes Pedro Célestin Daniel Macedo Batista Genaína Nunes Rodrigues Alba Cristina Magalhaes Alves de Melo 《Journal of Grid Computing》2014,12(1):127-151

In this paper, we propose and evaluate a framework for fault tolerant workflow execution in Grid environments. Different from previous work in the literature, our system dynamically chooses an appropriate fault tolerance technique while using a user-defined rule-based system. We also provide a generic interface that can be used to add fault tolerance techniques to the framework. The results obtained with real workflows in an experimental Grid environment show that the overhead introduced by our framework in a failure-free execution is, in the worst evaluated case, approximately 10 %. Moreover, we show that, using our framework, workflows are able to execute successfully in the presence of failures and that the framework can dynamically choose an appropriate fault tolerance technique. The main contributions of our work are twofold: the developed framework and the model-based dependability analysis we performed on it. The purpose in carrying out a model-based dependability analysis consists on evaluating the interaction between our framework and the distributed Grid environment beyond the physical limitations of an empirical evaluation. By doing this, we provide means to plan the assurance of QoS in the Grid resource allocation, while applying the fault-tolerance mechanisms we implement in our framework regardless of the underlying middleware. 相似文献

10.

Distributed fault tolerance in optimal interpolative nets

Simon D. 《Neural Networks, IEEE Transactions on》2001,12(6):1348-1357

The recursive training algorithm for the optimal interpolative (OI) classification network is extended to include distributed fault tolerance. The conventional OI Net learning algorithm leads to network weights that are nonoptimally distributed (in the sense of fault tolerance). Fault tolerance is becoming an increasingly important factor in hardware implementations of neural networks. But fault tolerance is often taken for granted in neural networks rather than being explicitly accounted for in the architecture or learning algorithm. In addition, when fault tolerance is considered, it is often accounted for using an unrealistic fault model (e.g., neurons that are stuck on or off rather than small weight perturbations). Realistic fault tolerance can be achieved through a smooth distribution of weights, resulting in low weight salience and distributed computation. Results of trained OI Nets on the Iris classification problem show that fault tolerance can be increased with the algorithm presented in this paper. 相似文献

11.

Improving reliability of cooperative concurrent systems with exception flow analysis 总被引：1，自引：0，他引：1

Fernando Castor Filho Author Vitae Alexander Romanovsky^{Author Vitae} 《Journal of Systems and Software》2009,82(5):874-890

Developers of fault-tolerant distributed systems need to guarantee that fault tolerance mechanisms they build are in themselves reliable. Otherwise, these mechanisms might in the end negatively affect overall system dependability, thus defeating the purpose of introducing fault tolerance into the system. To achieve the desired levels of reliability, mechanisms for detecting and handling errors should be developed rigorously or formally. We present an approach to modeling and verifying fault-tolerant distributed systems that use exception handling as the main fault tolerance mechanism. In the proposed approach, a formal model is employed to specify the structure of a system in terms of cooperating participants that handle exceptions in a coordinated manner, and coordinated atomic actions serve as representatives of mechanisms for exception handling in concurrent systems. We validate the approach through two case studies: (i) a system responsible for managing a production cell, and (ii) a medical control system. In both systems, the proposed approach has helped us to uncover design faults in the form of implicit assumptions and omissions in the original specifications. 相似文献

12.

面向分布式图计算作业的容错技术研究综述

张程博李影贾统《软件学报》2021,32(7):2078-2102

随着图数据规模的日益庞大和图计算作业的日益复杂,图计算的分布化成为必然趋势.然而图计算作业在运行过程中面临着分布式图计算系统内外各种来源的非确定性所带来的严峻的可靠性问题.本文首先分析了分布式图计算框架中不确定性因素和不同类型图计算作业的鲁棒性,并提出了基于成本、效率和质量三个维度的面向分布式图计算作业的容错技术评估框架,然后分别对分布式图计算的四种容错机制——基于检查点的容错、基于日志的容错、基于复制的容错、基于算法补偿的容错等机制结合国内外相关工作做了深入地分析、评估和比较.最后对未来的研究方向做了展望. 相似文献

13.

Fault tolerant aggregation in heterogeneous sensor networks

Laukik Chitnis Alin DobraAuthor VitaeSanjay RankaAuthor Vitae 《Journal of Parallel and Distributed Computing》2009

Fault tolerance and scalability are important considerations in the design of sensor network applications. Data aggregation is an essential operation in sensor networks. Multiple techniques have been proposed recently to tackle the issues of scalability and fault tolerance of aggregation in sensor networks. In this article, we analyze the impact of using a few of the more reliable, though expensive, nodes–such as the Intel XScale–called microservers, in addition to the standard motes, on the fault tolerance and scalability of the aggregation algorithms in sensor networks. In particular, we propose a simple model that captures the essence of tree aggregation in such heterogeneous sensor networks. We validate this theoretical model with simulation results. We also study the effective impact on the sustainable probability of failure, and perform cost-benefit analysis. We also show how hybrid aggregation can be utilized instead of tree, to improve the performance of aggregation in heterogeneous sensor networks. We show that our work can be applied for effectively optimizing the use of expensive hardware while designing fault-tolerant, distributed sensor networks. 相似文献

14.

Algorithm-based fault tolerance applied to high performance computing

George Bosilca Rémi Delmas Jack Dongarra Julien Langou 《Journal of Parallel and Distributed Computing》2009

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518–528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix–matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix–matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly. 相似文献

15.

Parallel computing in networks of workstations with Paralex

Davoli R. Giachini L.-A. Bebaoglu O. Amoroso A. Alvisi L. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(4):371-384

Modern distributed systems consisting of powerful workstations and high-speed interconnection networks are an economical alternative to special-purpose supercomputers. The technical issues that need to be addressed in exploiting the parallelism inherent in a distributed system include heterogeneity, high-latency communication, fault tolerance and dynamic load balancing. Current software systems for parallel programming provide little or no automatic support towards these issues and require users to be experts in fault-tolerant distributed computing. The Paralex system is aimed at exploring the extent to which the parallel application programmer can be liberated from the complexities of distributed systems. Paralex is a complete programming environment and makes extensive use of graphics to define, edit, execute, and debug parallel scientific applications. All of the necessary code for distributing the computation across a network and replicating it to achieve fault tolerance and dynamic load balancing is automatically generated by the system. In this paper we give an overview of Paralex and present our experiences with a prototype implementation 相似文献

16.

Almost certain fault diagnosis through algorithm-based faulttolerance

Blough D.M. Pelc A. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(5):532-539

Algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. We investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements 相似文献

17.

一个基于MIDAS容错技术的问题及其解决方法 总被引：2，自引：0，他引：2

王志刚赵跃龙《计算机工程与应用》2003,39(2):144-145,166

MIDAS是Dephi为开发多层分布式应用系统提供的一个中间透明引擎。而多层分布式系统的一个关键问题是容错,如果不能很好地解决容错问题,则多层分布式应用系统就不能可靠工作。论文主要讨论运用MIDAS构成本地容错系统时遇到的一个问题及解决方法。相似文献

18.

异构分布式系统混合型实时容错调度算法 总被引：1，自引：1，他引：0

邓建波张立臣邓惠敏《计算机科学》2011,38(3):87-92

基/副版本技术是实现实时分布式系统容错的一个重要手段。提出了一种异构分布式混合型容错模型,该模型与传统的异构分布式实时调度模型相比同时考虑了周期和非周期调度任务。在此基础上给出3种容错调度算法:以可调度性为目的SSA算法、以可靠性为目的RSA算法、以负载均衡性为目的BSA算法。算法能够在异构系统中同时调度具有周期和非周期容错需求的实时任务,且能够保证在异构系统中某节点机失效情况下,实时任务仍然能在截止时间内完成。最后从可调度性、可靠性代价、负载均衡性、周期与非周期任务数及任务周期与粒度J个方面对算法进行了分析。模拟实验结果显示算法各有优缺点,所以在选择调度算法时应该根据异构系统的特点来选择。相似文献

19.

Node-Capability-Aware Replica Management for Peer-to-Peer Grids 总被引：1，自引：0，他引：1

Agneeswaran V.S. Janakiram D. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2009,39(4):807-818

Data objects have to be replicated in large-scale distributed systems for reasons of fault tolerance, availability, and performance. Furthermore, computations may have to be scheduled on these objects, when these objects are part of a grid computation. Although replication mechanism for unstructured peer-to-peer (P2P) systems can place replicas on capable nodes, they may not be able to provide deterministic guarantees on searching. Replication mechanisms in structured P2P systems provide deterministic guarantees on searching but do not address node capability in replica placement. We propose Virat, a node-capability-aware P2P middleware for managing replicas in large-scale distributed systems. Virat uses a unique two-layered architecture that builds a structured overlay over an unstructured P2P layer, combining the advantages of both structured and unstructured P2P systems. Detailed performance comparison is made with a replication mechanism realized over OpenDHT, a state-of-the-art structured P2P system. We show that the 99th percentile response time for Virat does not exceed 600 ms, whereas for OpenDHT, it goes beyond 2000 ms in our test bed, created specifically for the aforementioned comparison. 相似文献

20.

分布式计算系统回卷恢复容错的仿真设计

董奇 黄斌 颜耀 李韦韦 曾玮妮 张恒 《计算机与现代化》2017,(4):48

为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。  相似文献