期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

Ifeanyi P. Egwutuoha Shiping Chen David Levy Bran Selic Rafael Calvo 《International Journal of Parallel, Emergent and Distributed Systems》2014,29(4):363-378

Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches. 相似文献

2.

Movement-based checkpointing and logging for failure recovery of database applications in mobile environments

Sapna E. George Ing-Ray Chen 《Distributed and Parallel Databases》2008,23(3):189-205

In this paper, we present an efficient failure recovery scheme for mobile database applications based on movement-based checkpointing and logging. Current approaches take checkpoints periodically without regard to the mobility behavior of mobile users. Our movement-based checkpointing scheme takes a checkpoint only after a threshold of mobility handoffs has been exceeded. The optimal threshold is governed by the failure rate, log arrival rate, and the mobility rate of the mobile host. This allows the tuning of the checkpointing rate on a per-user basis. We identify the optimal movement threshold which will minimize the recovery cost per failure as a function of the mobile node’s mobility rate, failure rate and log arrival rate. We derive the mobile database application recoverability, i.e., the probability that the recovery can be done by a specified recovery time deadline. Numeric data are presented to demonstrate the feasibility of our approach with its applicability given. 相似文献

3.

Memory exclusion: optimizing the performance of checkpointing systems

James S. Plank Yuqun Chen Kai Li Micah Beck Gerry Kingsley 《Software》1999,29(2):125-142

Checkpointing systems are a convenient way for users to make their programs fault‐tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well‐known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt , which works on Unix‐based workstations, and CLIP , which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long‐running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

4.

Modular Checkpointing for Atomicity

Lukasz Ziarek Philip Schatz Suresh Jagannathan 《Electronic Notes in Theoretical Computer Science》2007,174(9):85

Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to correctly re-execute a region of code, it must ensure that all other threads which have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward since thread interactions are a dynamic property of the program.In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Global states are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our checkpointing abstraction provides atomicity and isolation guarantees during state restoration ensuring restored global states are safe.Our experimental results on several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs. Our experiments conclude with a case study illustrating how to build open nested transactions from our checkpointing mechanism. 相似文献

5.

A Low-Cost Checkpointing Technique for Distributed Databases

Jun-Lin Lin Margaret H. Dunham 《Distributed and Parallel Databases》2001,10(3):241-268

For distributed databases, checkpointing is used to ensure an efficient way to perform global reconstruction. However, the need for global reconstruction is infrequent. Most current checkpointing approaches for distributed databases are too expensive during run time. Some of them allow the checkpointing process to run in parallel with normal transactions at the cost of more data and resource contention, which in turn causes longer response time for normal transactions. Thus, an efficient way to checkpoint distributed databases is needed to avoid degrading the system performance. This paper presents a low-cost solution, called Loosely Synchronized Local Fuzzy Checkpointing (LSLFC), to these problems. LSLFC supports global reconstruction, and our performance study shows that LSLFC has little overhead during run time. 相似文献

6.

Replication-Based Fault Tolerance for MPI Applications

Walters John Paul Chaudhary Vipin 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(7):997-1010

As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system. 相似文献

7.

An efficient exact adjoint of the parallel MIT General Circulation Model, generated via automatic differentiation

Patrick Heimbach Chris Hill Ralf Giering 《Future Generation Computer Systems》2005,21(8):413-1371

We describe computational aspects of automatic differentiation applied to global ocean circulation modeling and state estimation. The task of minimizing a cost function measuring the ocean simulation versus observation misfit is achieved through efficient calculation of the cost gradient w.r.t. a set of controls via the adjoint technique. The adjoint code of the parallel MIT general circulation model is generated using TAMC or its successor TAF. To achieve a tractable problem in both CPU and memory requirements, in the light of control flow reversal, the adjoint code relies heavily on the balancing of storing versus recomputation via the checkpointing method. Further savings are achieved by exploiting self-adjointness of part of the computation. To retain scalability of domain decomposition-based parallelism, hand-written adjoint routines are provided. These complement routines of the parallel support package to perform corresponding operations in reverse mode. The unique feature of the TAF tool which enables the dumping of the adjoint state and restart the adjoint integration is exploited to overcome batch execution limitations on HPC machines for large-scale ocean and climate simulations. Strategies to test the correctness of the adjoint-generated gradient are presented. The size of a typical adjoint application is illustrated for the case of the global ocean state estimation problem undertaken by the SIO-JPL-MIT Consortium “Estimating the Circulation and Climate of the Ocean” (ECCO). Results are given by way of example. 相似文献

8.

Reliable user‐level rollback recovery implementation for multithreaded processes on windows

Jin‐Min Yang Da‐Fang Zhang Xue‐Dong Yang Wen‐Wei Li 《Software》2007,37(3):331-346

The existing user‐level checkpointing schemes support only a limited portion of multithreaded programs because they are derived from the schemes for single‐threaded applications. This paper addresses the impact of thread suspension point on rollback recovery, and presents a checkpointing scheme for multithreaded processes. Unlike the existing schemes in which the checkpointer suspends every working thread, our scheme employs a distinctive strategy that every working thread suspends itself. This technique manages to avoid the suspension point in the API code or kernel code, ensuring correct rollback recovery. Our scheme supports inter‐thread synchronization and thread lifetime. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

9.

Checkpointing for optimistic concurrency control methods

Thomasian A. 《Knowledge and Data Engineering, IEEE Transactions on》1995,7(2):332-339

Restart-oriented concurrency control (CC) methods, such as optimistic CC, outperform blocking-oriented methods, such as standard two-phase locking in a high data contention environment, but this is at the cost of wasted processing due to restarts. Volatile savepoints are considered in this study as a method to reduce this wasted processing and to improve response time. There is the usual tradeoff between the checkpointing overhead and the wasted processing when a transaction is restarted. Our study shows that in a system where objects are accessed and updated uniformly during the lifetime of transactions, significant improvement in performance at high data conflict levels are attainable only when the checkpointing cost is low. This implies few optimally placed checkpoints per transaction. It is observed that checkpointing may result in a significant improvement in performance when access to database hot-spots are deferred to the final steps of transaction execution. The parametric studies reported in this paper are facilitated by closed-form analytic solutions expressing system performance, as well as an iterative solution which takes into account hardware resource contention in addition to data contention 相似文献

10.

Problem difficulty for tabu search in job-shop scheduling 总被引：2，自引：0，他引：2

Jean-Paul Watson J.Christopher Beck Adele E. Howe L.Darrell Whitley 《Artificial Intelligence》2003,143(2):189-217

Tabu search algorithms are among the most effective approaches for solving the job-shop scheduling problem (JSP). Yet, we have little understanding of why these algorithms work so well, and under what conditions. We develop a model of problem difficulty for tabu search in the JSP, borrowing from similar models developed for SAT and other NP-complete problems. We show that the mean distance between random local optima and the nearest optimal solution is highly correlated with the cost of locating optimal solutions to typical, random JSPs. Additionally, this model accounts for the cost of locating sub-optimal solutions, and provides an explanation for differences in the relative difficulty of square versus rectangular JSPs. We also identify two important limitations of our model. First, model accuracy is inversely correlated with problem difficulty, and is exceptionally poor for rare, very high-cost problem instances. Second, the model is significantly less accurate for structured, non-random JSPs. Our results are also likely to be useful in future research on difficulty models of local search in SAT, as local search cost in both SAT and the JSP is largely dictated by the same search space features. Similarly, our research represents the first attempt to quantitatively model the cost of tabu search for any NP-complete problem, and may possibly be leveraged in an effort to understand tabu search in problems other than job-shop scheduling. 相似文献

11.

A policy-based approach for strong mobility of composed Web services

Soumaya Marzouk Mohamed Jmaiel 《Service Oriented Computing and Applications》2013,7(4):293-315

This paper presents a flexible, portable, and transparent solution for strong mobility of composed Web services relying on policy-oriented techniques. The proposed approach provides a checkpoint solution based on automatic code instrumentation using correct source code transformation rules. This checkpoint technique permits to save the execution state of a mobile orchestration process as well as the execution states of its orchestrated partners. Thus, after migration, only non-executed codes will be resumed. In addition, our approach enables dynamic adaptation of the employed checkpointing and mobility techniques using aspects. For that, we use policies allowing dynamic selection of the used checkpointing and mobility techniques according to the execution context. Moreover, the proposed solution includes a module allowing the determination of the checkpointing interval satisfying QoS requirements. Experimentations show the efficiency of the proposed solution. 相似文献

12.

Time-Space Tradeoffs for Undirected Graph Traversal by Graph Automata 总被引：1，自引：0，他引：1

Paul Beame Allan Borodin Prabhakar Raghavan Walter L. Ruzzo Martin Tompa 《Information and Computation》1996,130(2):101

We investigate time-space tradeoffs for traversing undirected graphs, using a variety of structured models that are all variants of Cook and Rackoff's “Jumping Automata for Graphs.” Our strongest tradeoff is a quadratic lower bound on the product of time and space for graph traversal. For example, achieving linear time requires linear space, implying that depth-first search is optimal. Since our bound in fact applies to nondeterministic algorithms fornonconnectivity, it also implies that closure under complementation of nondeterministic space-bounded complexity classes is achieved only at the expense of increased time. To demonstrate that these structured models are realistic, we also investigate their power. In addition to admitting well known algorithms such as depth-first search and random walk, we show that one simple variant of this model is nearly as powerful as a Turing machine. Specifically, for general undirected graph problems, it can simulate a Turing machine with only a constant factor increase in space and a polynomial factor increase in time. 相似文献

13.

Preprocessing of Intractable Problems

Marco Cadoli Francesco M. Donini Paolo Liberatore Marco Schaerf 《Information and Computation》2002,176(2):89

Some computationally hard problems, e.g., deduction in logical knowledge bases– are such that part of an instance is known well before the rest of it, and remains the same for several subsequent instances of the problem. In these cases, it is useful to preprocess off-line this known part so as to simplify the remaining on-line problem. In this paper we investigate such a technique in the context of intractable, i.e., NP-hard, problems. Recent results in the literature show that not all NP-hard problems behave in the same way: for some of them preprocessing yields polynomial-time on-line simplified problems (we call them compilable), while for other ones their compilability implies some consequences that are considered unlikely. Our primary goal is to provide a sound methodology that can be used to either prove or disprove that a problem is compilable. To this end, we define new models of computation, complexity classes, and reductions. We find complete problems for such classes, “completeness” meaning they are “the less likely to be compilable.” We also investigate preprocessing that does not yield polynomial-time on-line algorithms, but generically “decreases” complexity. This leads us to define “hierarchies of compilability,” that are the analog of the polynomial hierarchy. A detailed comparison of our framework to the idea of “parameterized tractability” shows the differences between the two approaches. 相似文献

14.

Nonblocking checkpointing for optimistic parallel simulation: description and an implementation

Quaglia F. Santoro A. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):593-610

Describes a nonblocking checkpointing mode in support of optimistic parallel discrete event simulation. This mode allows real concurrency in the execution of state saving and other simulation specific operations (e.g, event list update, event execution) with the aim of removing the cost of recording state information from the completion time of the parallel simulation application. We present an implementation of a C library supporting nonblocking checkpointing on a myrinet based cluster, which demonstrates the practical viability of this checkpointing mode on standard off-the-shelf hardware. By the results of an empirical study on classical parameterized synthetic benchmarks, we show that, except for the case of minimal state granularity applications, nonblocking checkpointing allows improvement of the speed of the parallel execution, as compared to commonly adopted, optimized checkpointing methods based on the classical blocking mode. A performance study for the case of a personal communication system (PCS) simulation is additionally reported to point out the benefits from nonblocking checkpointing for a real world application. 相似文献

15.

Embracing change with extreme programming 总被引：6，自引：0，他引：6

Beck K. 《Computer》1999,32(10):70-77

Traditional software engineering means have been characterized by a rather predictable process in the past. Users tell once and for all exactly what they want. Programmers design the system that will deliver those features. They code it; test it, and all is well. But all was not always well. The users did not tell once and for all exactly what they wanted. They changed their minds, and the users were not the only problem. Programmers could misjudge their progress. The academic software engineering community took the high cost of changing software as a challenge, creating technologies like relational databases, modular programming, and information hiding. This is where extreme programming comes in. Rather than planning, analyzing, and designing for the far-flung future, XP exploits the reduction in the cost of changing software to do all of these activities a little at a time, throughout software development. The paper discusses the major practices of XP 相似文献

16.

Further results on the Bellman equation for optimal control problems with exit times and nonnegative Lagrangians

Michael Malisoff 《Systems & Control Letters》2003,50(1):65-79

In a series of papers, we proved theorems characterizing the value function in exit time optimal control as the unique viscosity solution of the corresponding Bellman equation that satisfies appropriate side conditions. The results applied to problems which satisfy a positivity condition on the integral of the Lagrangian. This positive integral condition assigned a positive cost for remaining outside the target on any interval of positive length. In this note, we prove a new theorem which characterizes the exit time value function as the unique bounded-from-below viscosity solution of the Bellman equation that vanishes on the target. The theorem applies to problems satisfying an asymptotic condition on the trajectories, including cases where the positive integral condition is not satisfied. Our results are based on an extended version of “Barb lat's lemma”. We apply the theorem to variants of the Fuller problem and other examples where the Lagrangian is degenerate. 相似文献

17.

一种基于检查点的并行程序调试器的设计与实现 总被引：4，自引：1，他引：4

刘建汪东升沈美明郑纬民《计算机研究与发展》2002,39(12):1580-1586

为支持大规模长时间运行并行程序的调试，有必要将检查点机制引入到并行程序调试器中，检查点设置与卷回应用中需要解决中途消息，孤儿消息和多米诺效应，活锁4个问题，并行程序调试中需要解决不确定性问题，提出的基于状态冻结的确定性检查点设置方法，可以避免检查点应用中孤儿消息和多米诺效应，活锁3个问题，通过消化记录的方法处理中途消息问题，采用记录／重放方法解决并行调试中的不确定性问题，基于状态冻结的确定性检查点设置方法，有效地解决了并行程序调试器和检查点结合时产生的诸多问题，该方法具有结构清晰，易于实现的优点，基于此技术，设计并实现了一个并行调试工具－DENNET。相似文献

18.

Joint user profiling with hierarchical attention networks

Xiaojian LIU Yi ZHU Xindong WU 《Frontiers of Computer Science》2023,17(3):173608

User profiling by inferring user personality traits, such as age and gender, plays an increasingly important role in many real-world applications. Most existing methods for user profiling either use only one type of data or ignore handling the noisy information of data. Moreover, they usually consider this problem from only one perspective. In this paper, we propose a joint user profiling model with hierarchical attention networks (JUHA) to learn informative user representations for user profiling. Our JUHA method does user profiling based on both inner-user and inter-user features. We explore inner-user features from user behaviors (e.g., purchased items and posted blogs), and inter-user features from a user-user graph (where similar users could be connected to each other). JUHA learns basic sentence and bag representations from multiple separate sources of data (user behaviors) as the first round of data preparation. In this module, convolutional neural networks (CNNs) are introduced to capture word and sentence features of age and gender while the self-attention mechanism is exploited to weaken the noisy data. Following this, we build another bag which contains a user-user graph. Inter-user features are learned from this bag using propagation information between linked users in the graph. To acquire more robust data, inter-user features and other inner-user bag representations are joined into each sentence in the current bag to learn the final bag representation. Subsequently, all of the bag representations are integrated to lean comprehensive user representation by the self-attention mechanism. Our experimental results demonstrate that our approach outperforms several state-of-the-art methods and improves prediction performance. 相似文献

19.

A technique for non-invasive application-level checkpointing 总被引：1，自引：1，他引：0

Ritu Arora Purushotham Bangalore Marjan Mernik 《The Journal of supercomputing》2011,57(3):227-255

One of the key elements required for writing self-healing applications for distributed and dynamic computing environments is checkpointing. Checkpointing is a mechanism by which an application is made resilient to failures by storing its state periodically to the disk. The main goal of this research is to enable non-invasive reengineering of existing applications to insert Application-Level Checkpointing (ALC) mechanism. The Domain-Specific Language (DSL) developed in this research serves as a perfect means towards this end and is used for obtaining the ALC-specifications from the end-users. These specifications are used for generating and inserting the actual checkpointing code into the existing application. The performance of the application having the generated checkpointing code is comparable to the performance of the application in which the checkpointing code was inserted manually. With slight modifications, the DSL developed in this research can be used for specifying the ALC mechanism in several base languages (e.g., C/C++, Java, and FORTRAN). 相似文献

20.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献