首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 63 毫秒
1.
Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable DSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure  相似文献   

2.
A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

3.
基于虚拟文件操作的文件检查点设置   总被引:1,自引:0,他引:1  
刘少锋  汪东升  朱晶 《软件学报》2002,13(8):1528-1533
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点.  相似文献   

4.
This paper presents Visper, a novel object-oriented framework that identifies and enhances common services and programming primitives, and implements a generic set of classes applicable to multiple programming models in a distributed environment. Groups of objects, which can be programmed in a uniform and transparent manner, and agent-based distributed system management, are also featured in Visper. A prototype system is designed and implemented in Java, with a number of visual utilities that facilitate program development and portability. As a use case, Visper integrates parallel programming in an MPI-like message-passing paradigm at a high level with services such as checkpointing and fault tolerance at a lower level. The paper reports a range of performance evaluation on the prototype and compares it to related works  相似文献   

5.
Unix进程检查点设置关键技术   总被引:4,自引:0,他引:4  
Unix进程的检查点设置是实现分布/并行系统容错、重播调试、进程迁移、系统模拟和作业切换等功能的基础。该论文主要论述UNIX进程检查点基本信息的保存与恢复、文件检查点、检查点信息的优化等关键技术,最后介绍Libckpt、Condor以及自行设计的Libcsm等检查点设置工具。  相似文献   

6.
Raid, a robust and adaptable distributed database system for transaction processing, is described. Raid is a message-passing system, with server processes on each site. The servers manage concurrent processing, consistent replicated copies during site failures and atomic distributed commitment. A high-level, layered communications package provides a clean, location-independent interface between servers. The latest design of the communications package delivers messages via shared memory in a high-performance configuration in which several servers are linked into a single process. Raid provides the infrastructure to experimentally investigate various methods for supporting reliable distributed transaction processing. Measurements on transaction processing time and server CPU time are presented. Data and conclusions of experiments in three categories are also presented: communications software, consistent replicated copy control during site failures, and concurrent distributed checkpointing. A software tool for the evaluation of transaction processing algorithms in an operating system kernel is proposed  相似文献   

7.
Perl-speaks-NONMEM (PsN)--a Perl module for NONMEM related programming   总被引:6,自引:0,他引:6  
The NONMEM program is the most widely used nonlinear regression software in population pharmacokinetic/pharmacodynamic (PK/PD) analyses. In this article we describe a programming library, Perl-speaks-NONMEM (PsN), intended for programmers that aim at using the computational capability of NONMEM in external applications. The library is object oriented and written in the programming language Perl. The classes of the library are built around NONMEM's data, model and output files. The specification of the NONMEM model is easily set or changed through the model and data file classes while the output from a model fit is accessed through the output file class. The classes have methods that help the programmer perform common repetitive tasks, e.g. summarising the output from a NONMEM run, setting the initial estimates of a model based on a previous run or truncating values over a certain threshold in the data file. PsN creates a basis for the development of high-level software using NONMEM as the regression tool.  相似文献   

8.

Due to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregator processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications’ characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources.

  相似文献   

9.
王一拙  陈旭  计卫星  苏岩  王小军  石峰 《软件学报》2016,27(7):1789-1804
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.  相似文献   

10.
刘勇燕  刘勇鹏  冯华  迟万庆 《计算机科学》2011,38(5):287-289,305
检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。  相似文献   

11.
The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology  相似文献   

12.
一种面向图的分布式软件动态配置和容错方法   总被引:1,自引:0,他引:1  
宋毅  刘云超 《计算机应用》2003,23(12):37-41
提出一种新的方法,通过动态配置对基于组件的分布式软件的容错提供支持。此方法采用面向图的GOP编程模型,将整个分布式软件的体系结构用一张逻辑图来描述,系统的动态配置可以通过执行图上预定义的一组操作来完成。检测到故障或异常的时候实施这种动态配置能够支持系统的容错。文中描述了此方法的基本模型、系统结构和基于CORBA的原型实现。  相似文献   

13.
基于组件的分布式软件的动态配置和容错   总被引:1,自引:0,他引:1  
论文提出一种结构化新方法,它能通过动态配置支持基于组件的分布式软件的容错。采用面向图形的编程模型,基于组件的分布式软件的软件体系结构可用一个逻辑图来表示,该逻辑图可以精化为一个明确的对象并分布到网络中,软件的动态配置通过执行定义在图上的一系列操作来实现,发生错误时通过动态重配置软件来支持容错。论文描述了该方法的基本模型、系统结构及其在CORBA上的实现原型。  相似文献   

14.
Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches.  相似文献   

15.
16.
17.
一种中间件服务容错配置管理方法   总被引:1,自引:0,他引:1  
李军国  黄罡  邹键  梅宏 《计算机学报》2007,30(10):1696-1704
提出一种基于运行时刻软件体系结构的容错管理方法,支持开发者和管理员针对不同中间件服务失效定制合适的故障检测和修复机制.首先,运行时刻软件体系结构自动构造构件依赖视图和错误传播①视图,为理解和分析整个系统的可靠性提供全局视图;然后,操作运行时刻软件体系结构配置容错机制;最后利用AOP技术将容错机制插装到中间件中,使其具备指定的容错能力.上述过程在一个可视化工具的辅助下半自动实施,并在J2EE中间件上得到验证.  相似文献   

18.
19.
The need for flexible file sharing in distributed systems is increasing in applications such as calendar management, collaborative editing of documents, collaborative software developments etc. The file sharing policies required in these applications are often very different from the traditional read/write policies. Hence, a flexible way of specifying and implementing sharing policies on individual files in file systems is required. We propose a distributed object-based system model of constructing file systems. The object-based system model is based on a pattern called FlexiFrag. We show how a distributed object-based system and in particular distributed file system can be constructed using the pattern in a flexible way.  相似文献   

20.
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号