期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Mutable checkpoints: a new checkpointing approach for mobilecomputing systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):157-172

Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage 相似文献

2.

Surviving sensor node failures by MMU-less incremental checkpointing

《Journal of Systems and Software》2014

For some critical safety applications, sensor nodes embed valuable information, and they should be able to operate unattended and unfailing for several months or years. One promising solution is to adopt a checkpointing that periodically saves the state of a sensor node, thereby maintaining node reliability and network availability. Thus, this study first shows the design and implementation of a full checkpointing for WSNs. However, checkpointing is expensive. Therefore, incremental checkpointing was previously proposed to eliminate the checkpoint overhead by relying on the page protection hardware to identify dirty pages. Because sensor nodes are resource-constrained and do not equip with the page protection hardware, previous incremental checkpointings cannot be directly applied. To address this issue, this paper proposes three incremental checkpointings for WSNs. These three methods differ in the granularity of the checkpoint memory data unit and module execution overhead. In addition, we designed an incremental checkpoint file format that simultaneously supports proposed three different incremental checkpointings and accommodates them with sensor network characteristics. We implemented the full and three incremental checkpointings on SOS in the mica2 sensor motes. A performance evaluation of the three incremental checkpointings is presented. We also discuss and evaluate a method for selecting the appropriate incremental checkpointing. To the best of our knowledge, this study is the first to design and implement incremental checkpointing in MMU-less WSNs. 相似文献

3.

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

D. Manivannan Q. Jiang Jianchang Yang M. Singhal 《Information Sciences》2008,178(15):3110-3117

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. 相似文献

4.

Design of a Scalable Multicast Scheme With an Application-Network Cross-Layer Approach

Xiaohua Tian Yu Cheng Bin Liu 《Multimedia, IEEE Transactions on》2009,11(6):1160-1169

This paper develops an efficient and scalable multicast scheme for high-quality multimedia distribution. The traditional IP multicast, a pure network-layer solution, is bandwidth efficient in data delivery but not scalable in managing the multicast tree. The more recent overlay multicast establishes the data-dissemination structure at the application layer; however, it induces redundant traffic at the network layer. We propose an application-oriented multicast (AOM) protocol, which exploits the application-network cross-layer design. With AOM, each packet carries explicit destinations information, instead of an implicit group address, to facilitate the multicast data delivery; each router leverages the unicast IP routing table to determine necessary multicast copies and next-hop interfaces. In our design, all the multicast membership and addressing information traversing the network is encoded with bloom filters for low storage and bandwidth overhead. We theoretically prove that the AOM service model is loop-free and incurs no redundant traffic. The false positive performance of the bloom filter implementation is also analyzed. Moreover, we show that the AOM protocol is a generic design, applicable for both intra-domain and inter-domain scenarios with either symmetric or asymmetric routing. 相似文献

5.

面向大规模MPI程序的应用级checkpointing技术

王攀峰杜云飞周海芳杨学军《计算机研究与发展》2009,46(Z2)

应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术.但是应用级checkpointing技术要求用户决定哪些是需要保存的关键数据,这增加了用户的负担.介绍一个基于MPI并行程序活跃变量分析的源到源的预编译工具ALEC,它可用于辅助应用级checkpointing.在一个512处理器的Cluster系统上,对经过ALEC编译的5个Fortran/MPI应用进行了性能评测.结果表明,ALEC能够有效减小checkpoint的大小和应用级checkpointing保存和恢复的开销. 相似文献

6.

Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations 总被引：1，自引：0，他引：1

Xuanhua Jean-Louis Eric Hai Hongbo 《Future Generation Computer Systems》2010,26(2):236-244

Grid applications have been prone to encountering problems such as failures or malicious attacks during execution in recent years, due to their distributed and large-scale features. The application itself, however, has limited power to address these problems. This paper presents the design, implementation, and evaluation of an adaptive framework— Dynasa, which strives to handle security problems using adaptive fault-tolerance (i.e., checkpointing and replication) during the execution of applications according to the status of the Grid environments. We evaluate our adaptive framework experimentally using the Grid5000 testbed and the experimental results have demonstrated that Dynasa enables the application itself to handle the security problems efficiently. The starting of the adaptive component is less than 1 s and the adaptive action is less than 0.1 s with the checkpoint interval of 20 s. Compared with non-adaptive method, experimental results demonstrate that Dynasa achieves better performance in terms of execution time, network bandwidth consumed, and CPU load, resulting in up to a 50% lower overhead. 相似文献

7.

基于iSCSI的网络存储体系结构研究 总被引：1，自引：0，他引：1

易非李仁发陈佐张光剑《计算机工程与应用》2004,40(27):132-134,214

数据的爆炸式增长和网络技术的飞速发展引发了网络存储技术的出现,iSCSI作为端到端的协议,定义了SCSI到TCP的映射,使计算机可以通过网络进行块级存储,实现了存储和现有TCP/IP网络的无缝融合。基于iSCSI的SAN可以被看作是负责存储传输的“后端”网络,而“前端”网络负责正常的TCP/IP传输,它是目前新型的分布式存储解决方案。相似文献

8.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids 总被引：1，自引：0，他引：1

Chtepen M. Claeys F.H.A. Dhoedt B. De Turck F. Demeester P. Vanrolleghem P.A. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):180-190

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency. 相似文献

9.

SCALAR: Scalable data lookup and replication protocol for mobile ad hoc networks

Emre Atsan Öznur Özkasap 《Computer Networks》2013,57(17):3654-3672

Data replication, as an essential service for MANETs, is used to increase data availability by creating local or nearly located copies of frequently used items, reduce communication overhead, achieve fault-tolerance and load balancing. Data replication protocols proposed for MANETs are often prone to scalability problems due to their definitions or underlying routing protocols they are based on. In particular, they exhibit poor performance when the network size is scaled up. However, scalability is an important criterion for several MANET applications. We propose a scalable and reactive data replication approach, named SCALAR, combined with a low-cost data lookup protocol. SCALAR is a virtual backbone based solution, in which the network nodes construct a connected dominating set based on network topology graph. To the best of our knowledge, SCALAR is the first work applying virtual backbone structure to operate a data lookup and replication process in MANETs. Theoretical message-complexity analysis of the proposed protocols is given. Extensive simulations are performed to analyze and compare the behavior of SCALAR, and it is shown to outperform the other solutions in terms of data accessibility, message overhead and query deepness. It is also demonstrated as an efficient solution for high-density, high-load, large-scale mobile ad hoc networks. 相似文献

10.

Towards efficient and practical network coding in delay tolerant networks

Baokang Zhao Wei Peng Ziming Song Jinshu Su Chunqing Wu Wanrong Yu Qiaolin Hu 《Computers & Mathematics with Applications》2012,63(2):588-600

Network coding techniques offer an emerging solution to efficient data transmission in Delay Tolerant Networks (DTN). To date, abundant techniques have been developed on exploiting network coding in DTN, however, most of them bring additional overhead due to the extra coded message redundancy. In this paper, we analyze the coded message redundancy issue, and then propose NTC, an efficient network coding scheme for DTN. In NTC, a novel metric named “redundancy ratio” is introduced within the anti-entropy message exchange process. We also discuss the design and implementation of practical NTC in detail. To evaluate the performance of our proposed NTC scheme, we implement NTC in ONE, the current state-of-the-art simulator for DTN. Simulation results show that, comparing with existing schemes, our proposed NTC scheme has significant advantages in enhancing the message delivery ratio and reducing the transmission overhead. 相似文献

11.

Checkpointing Distributed Shared Memory

Silva Luis M. Silva João Gabriel 《The Journal of supercomputing》1997,11(2):137-158

相似文献

12.

Active optimistic and distributed message logging for message‐passing applications

Thomas Ropars Christine Morin 《Concurrency and Computation》2011,23(17):2167-2178

Message logging is an attractive solution to provide fault tolerance for message‐passing applications because it is more scalable than coordinated checkpointing. Sender‐based message logging is a well‐known optimization that allows the saving of message payload in the sender memory. Thus, only message reception events have to be logged reliably by using an event logger. This paper proposes solutions to further improve message logging protocol scalability. In existing works on message logging, the event logger has always been considered as a centralized process. We propose a distributed event logger that takes advantage of multi‐core processors that are to be executed in parallel with application processes, leveraging the volatile memory of the nodes to save events reliably. We also propose the combination of our distributed event logger and O2P, an active optimistic message logging protocol using a gossip‐based protocol to disseminate information on new stable events. Our distributed event logger and O2P are implemented in the Open MPI library. Our results show the following: (i) distributed event logging improves message logging protocol scalability and (ii) using O2P with a distributed event logger provides an efficient and scalable fault‐tolerant solution for message‐passing applications. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

13.

基于Lustre文件系统的MPI检查点系统实现技术与性能测试 总被引：1，自引：0，他引：1

谢旻卢宇彤周恩强曹宏嘉杨学军《计算机研究与发展》2007,44(10):1709-1716

基于协同式检查点的回卷恢复是在大规模并行计算机系统中得到采用的一项重要容错技术,其性能开销主要为协同协议和检查点映像存储所决定.描述了一个在MPICH2中实现的应用透明的并行检查点系统,相比已有的技术,该系统有以下特点：1）协同协议操作利用了并行应用的近邻通信特性,通过虚连接方法减少协议的处理开销;2）采用Lustre文件系统简化检查点映像文件管理的复杂性;3）通过并行I/O操作提高性能,优化检查点映像的存储过程.实际应用的测试表明,该检查点系统具有较小的运行时间开销和良好的可扩展性. 相似文献

14.

网络存储技术的研究与应用 总被引：3，自引：0，他引：3

王月贾卓生《微机发展》2006,16(6):107-109

在当前的企业信息数据存储应用中,网络存储技术发挥着十分重要的作用,如何在具体应用中选择合适的网络存储技术,使其最大程度地发挥作用成为了企业面临的主要问题。文中主要研究了4种主流网络存储技术:NAS,SAN,CAS及IP SAN,对这些技术进行了层次及技术性能方面的比较。通过对4种技术的分析比较,找到了各种技术间较好的结合点并加以应用。最后针对中小企业的综合业务现状提出了一个基于现有TCP/IP协议的网络存储解决方案,并在此基础上指出了网络存储的未来发展趋势。相似文献

15.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

16.

Application Level Fault Tolerance in Heterogeneous Networks of Workstations 总被引：2，自引：0，他引：2

Adam Beguelin Erik Seligman Peter Stephan 《Journal of Parallel and Distributed Computing》1997,43(2):2078

We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems. 相似文献

17.

Distributed Matrix-Free Solution of Large Sparse Linear Systems over Finite Fields

E. Kaltofen A. Lobo 《Algorithmica》1999,24(3-4):331-348

We describe a coarse-grain parallel approach for the homogeneous solution of linear systems. Our solutions are symbolic, i.e., exact rather than numerical approximations. We have performed an outer loop parallelization that works well in conjunction with a black box abstraction for the coefficient matrix. Our implementation can be run on a network cluster of UNIX workstations as well as on an SP-2 multiprocessor. Task distribution and management are effected through MPI and other packages. Fault tolerance, checkpointing, and recovery are incorporated. Detailed timings are presented for experiments with systems that arise in RSA challenge integer factoring efforts. For example, we can solve a 252,222 × 252,222 system with about 11.04 million nonzero entries over the Galois field with two elements using four processors of an SP-2 multiprocessor, in about 26.5 hours CPU time. Received June 1, 1997; revised March 10, 1998. 相似文献

18.

基于PVM的准同步检查点设置方法

张宇张玉芳《计算机工程与设计》2006,27(3):494-496

检查点是并行系统中实现容错的重要手段，同步检查点方法已广泛应用在工作站机群系统中。PVM所提供的消息传递机制支持高效的异构网络计算，但不支持客错功能。为了降低同步检查点设置的时间开销，提出了一种基于PVM的准同步检查点设置方法，它吸取了同步检查点方法的优点，又通过消息记录方式实现各节点间独立进行状态保存，大大降低了检查点的同步开销，提高了检查点操作效率，该方法在PVM环境下得以实现，实验结果表明所提出的方法具有较好的客错性能。相似文献

19.

RSEDP: an effective hybrid data placement algorithm for large-scale storage systems

Nong Xiao Tao Chen Fang Liu 《The Journal of supercomputing》2011,55(1):103-122

The reliability and scalability of large-scale network storage systems are confronted with big challenges, which require designing a reliable, scalable, and efficient data placement algorithm. Previous techniques can only partially satisfy these requirements. In this work, we develop an effective hybrid approach, RSEDP, which combines reliable replication data placement (RRDP) with scalable and efficient data placement (SEDP) to achieve the requirements mentioned above. RRDP distributes replicated data over large-scale heterogeneous network storage systems in which the same replica is distributed to different devices and not inclined to consecutive devices, achieving high redundancy degree and failure resilience. SEDP assigns data evenly among devices according to their weight and scales well to the expansions or curtailments of the systems. In order to take the advantages of both RRDP and SEDP, RSEDP integrates them by categorizing data into hot and cold data based on their access frequency, placing hot data by RRDP, and distributing the remainder by SEDP. The theoretical analysis and the experimental study show that the combined RSEDP can increase redundancy degree and failure resilience, and has a good scalability and time efficiency with small memory overhead. 相似文献

20.

iSCSI SAN中存储管理器的设计与实现

周敬利张威余胜生《计算机工程与应用》2004,40(12):96-98,118

iSCSI(InternetSCSI)协议是一种新兴的网络存储协议,目前已经成为IETF的建议标准,正在被越来越多的网络存储系统所采用。文章介绍了iSCSI协议以及iSCSISAN方案,提出了一种iSCSISAN中关键部件———存储管理器的设计和实现,并通过存储管理器实现了iSCSISAN的存储虚拟化,充分展示了存储管理器的高效性,灵活性和可管理性。相似文献