期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Garbage collection in uncoordinated checkpointing algorithms

LIU Yunlong CHEN Junliang 《计算机科学技术学报》1999,14(3):242-249

In this paper,the hard problem of the thorough garbage collection in uncoordinated checkpointing algorithms is studied.After introduction of the traditional garbage collecting scheme,with which only obsolete checkpoints can be discarded,it is shown that this kind of traditional method may fail to discard any checkpoint in some special cases,and it is necessary and urgent to find a thorough garbage collecting method,with which all the checkpoints useless for any future rollback-recovery including the obsolete ones can be discarded.Then,th Thorough Garbage Collection Theorem is proposed and proved,which ensures th feasibility of the thorough garbage collection,and gives the method to calculate the set of the useful checkpoints as well. 相似文献

2.

Checkpoint space reclamation for uncoordinated checkpointing inmessage-passing systems

Yi-Min Wang Pi-Yu Chung In-Jen Lin Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):546-554

Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N+1)/2 checkpoints where N is the number of processes 相似文献

3.

Adaptive exception handling for scientific workflows

Rafael Tolosana‐Calasanz Jos A. Baares Omer F. Rana Pedro lvarez Joaquín Ezpeleta Andreas Hoheisel 《Concurrency and Computation》2010,22(5):617-642

Scientific workflow systems often operate in highly unreliable, heterogeneous and dynamic environments, and have accordingly incorporated different fault tolerance techniques. We propose an exception‐handling mechanism, based on techniques adopted in programming languages, for modifying at run‐time the structure of a workflow. In contrast to other proposals that achieve the required flexibility by means of the infrastructure, our proposal expresses the exception‐handling mechanism within the workflow language—primarily as two exception‐handling patterns that are exclusively based on the Reference Nets‐within‐Nets formalism (a specific type of Petri nets). When an exception is detected, a workflow in our approach can be re‐written (replaced), based on the particular failure condition that has been detected. This enables workflow users to have better control and understanding of the behaviour of their workflow without having to be aware of the underlying infrastructure. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

4.

Periodicity in optimal hierarchical checkpointing schemes for adjoint computations

Guillaume Aupy Julien Herrmann 《Optimization methods & software》2017,32(3):594-624

We reexamine the work of Aupy et al. on optimal algorithms for hierarchical adjoint computations, where two levels of memories are available. The previous optimal algorithm had a quadratic execution time. Here, with structural arguments, namely periodicity, on the optimal solution, we provide an optimal algorithm in constant time and space, with appropriate pre-processing. We also provide an asymptotically optimal algorithm for the online problem, when the adjoint chain size is not known before-hand. Again, these algorithms rely on the proof that the optimal solution for hierarchical adjoint computations is weakly periodic. We conjecture a closed-form formula for the period. Finally, we assess the convergence speed of the approximation ratio for the online problem through simulations. 相似文献

5.

Common motifs in scientific workflows: An empirical analysis

《Future Generation Computer Systems》2014

Workflow technology continues to play an important role as a means for specifying and enacting computational experiments in modern science. Reusing and re-purposing workflows allow scientists to do new experiments faster, since the workflows capture useful expertise from others. As workflow libraries grow, scientists face the challenge of finding workflows appropriate for their task, understanding what each workflow does, and reusing relevant portions of a given workflow. We believe that workflows would be easier to understand and reuse if high-level views (abstractions) of their activities were available in workflow libraries. As a first step towards obtaining these abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna, Wings, Galaxy and Vistrails. Our analysis has resulted in a set of scientific workflow motifs that outline (i) the kinds of data-intensive activities that are observed in workflows (Data-Operation motifs), and (ii) the different manners in which activities are implemented within workflows (Workflow-Oriented motifs). These motifs are helpful to identify the functionality of the steps in a given workflow, to develop best practices for workflow design, and to develop approaches for automated generation of workflow abstractions. 相似文献

6.

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

《Parallel Computing》2015

We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques. 相似文献

7.

An analytical model for hybrid checkpointing in time warpdistributed simulation

Soliman H.M. Elmaghraby A.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):947-951

The Time Warp distributed simulation algorithm uses checkpointing to save process states after certain event executions for later recovery at the time of a rollback. Two main techniques have been used for checkpointing: periodic state saving and incremental state saving. The former technique introduces large overheads in reconstructing a desired state by coasting forward from an earlier checkpointed state if the computational granularity is large. The latter technique also has large overheads in applications with large rollback distances. A hybrid checkpointing technique is proposed which uses both periodic and incremental state saving simultaneously in such a way that it reduces checkpointing time overheads. A detailed analytical model is developed for the hybrid technique, and comparisons are made using similar analytical models with periodic and incremental state saving techniques. Results show that when the system parameters are chosen to represent large and complex simulated systems, the hybrid approach has less checkpointing time overhead than the other two techniques 相似文献

8.

Information flow analysis of scientific workflows

Ping Yang Shiyong Lu Mikhail I. Gofman Zijiang Yang 《Journal of Computer and System Sciences》2010,76(6):390-402

Recently, scientific workflows have emerged as a platform for automating and accelerating data processing and data sharing in scientific communities. Many scientific workflows have been developed for collaborative research projects that involve a number of geographically distributed organizations. Sharing of data and computation across organizations in different administrative domains is essential in such a collaborative environment. Because of the competitive nature of scientific research, it is important to ensure that sensitive information in scientific workflows can be accessed by and propagated to only authorized parties. To address this problem, we present techniques for analyzing how information propagates in scientific workflows. We also present algorithms for incrementally analyzing how information propagates upon every change to an existing scientific workflow. 相似文献

9.

Localising temporal constraints in scientific workflows

Jinjun Chen Yun Yang 《Journal of Computer and System Sciences》2010,76(6):464-474

Temporal constraints are often set when complex e-science processes are modelled as scientific workflow specifications. However, many existing processes such as climate modelling often have only a few coarse-grained temporal constraints globally. This is not sufficient to control overall temporal correctness as we can not find temporal violations locally in time for handling. Local handling affects fewer workflow activities, hence more cost effective than global handling with coarse-grained temporal constraints. Therefore, in this paper, we systematically investigate how to localise a group of fine-grained temporal constraints so that temporal violations can be indentified locally for better handling cost effectiveness. The corresponding algorithms are developed. The quantitative evaluation demonstrates that with local fine-grained temporal constraints, we can improve handling cost effectiveness significantly than only with coarse-grained ones. 相似文献

10.

Scripting distributed scientific workflows using Weaver

Peter Bui Li Yu Andrew Thrasher Rory Carmichael Irena Lanc Patrick Donnelly Douglas Thain 《Concurrency and Computation》2012,24(15):1685-1707

相似文献

11.

A structure-aware algorithm for fault-tolerant scheduling of scientific workflows

Masoumi Maryam Motallebi Hassan 《The Journal of supercomputing》2022,78(15):17348-17377

The Journal of Supercomputing - Here, we propose a fault-tolerant workflow scheduling algorithm that combines basic redundancies to reduce execution time through minimizing the redundancy overhead.... 相似文献

12.

An index-based checkpointing algorithm for autonomous distributedsystems

Baldoni R. Quaglia F. Fornara P. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):181-192

This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions 相似文献

13.

科学工作流中基于上下文感知的资源访问控制

范绍坤窦万春刘茜萍《计算机工程与设计》2008,29(2):463-465

网格环境的动态性使得科学工作流执行过程中的资源访问控制成为一个重要的研究课题.因此,提出一种基于上下文感知的资源访问控制机制,对科学工作流的任务上下文及其约束进行了分析和定义.描述了基于上下文感知的资源访问控制算法,并在此基础上设计了基于上下文感知的科学工作流管理系统框架.最后,通过天气预报这个科学工作流实例验证了该算法. 相似文献

14.

一种云环境下科学工作流执行计划的优化方法

郭宏乐陈旺虎马生俊李新田乔保民《计算机工程与科学》2019,41(3):433-439

为降低云环境下科学工作流的执行代价,提出了一种执行计划的优化方法。引入猴群算法,依靠对当前执行计划的层内和层间优化,在保证工作流全局截止时间约束的前提下,通过同层任务的逻辑聚合和任务的层间调整,尽可能减少各层任务数的差异,以避免资源的闲置浪费,缩短任务的等待时间。实验表明,该方法与类似研究相比,可降低资源消耗量,减小总的延迟时间。相似文献

15.

A model for error recovery with global checkpointing

Krishna Kant 《Information Sciences》1983,30(3):225-239

The paper proposes a new technique for providing software fault tolerance in concurrent systems. It combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism. This mechanism involves smaller overhead in case of moderate to high process interaction than the schemes considered in past, which are based upon the idea of local checkpointing. A model for computing the optimum checkpointing interval is also presented. A particular distribution is hypothesized for the coverage of the recovery, and the behavior of the model is studied in detail for this case. 相似文献

16.

An efficient protocol for checkpointing recovery in distributedsystems

Kim J.L. Park T. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(8):955-960

The authors present an efficient synchronized checkpointing protocol that exploits the dependency relation between processes in distributed systems. In this protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols. As a result, the checkpointing coordination time is substantially reduced and the possibility of total abort of the checkpointing coordination is reduced 相似文献

17.

Integer linear programming-based multi-objective scheduling for scientific workflows in multi-cloud environments

Mohammadi Somayeh PourKarimi Latif Pedram Hossein 《The Journal of supercomputing》2019,75(10):6683-6709

Scientific communities are motivated to schedule the data-intensive scientific workflows in multi-cloud environments, where considerable diverse resources are provided by multiple clouds and resource limitation imposed by individual clouds is overcome. However, this scheduling involves two conflicting objectives: minimizing cost and makespan. In general, dealing with such conflicting criteria is a difficult task. But fortunately recent efficient methods for solving multi-objective optimization problems motivated us to provide a multi-objective model considering minimization of cost and makespan as objectives. For solving this model, we use different scalarization procedures such as weighted-sum, Benson's scalarization and weighted min–max under different scenarios. Moreover, we investigate the stability of obtained solutions and propose a new approach for determining the most stable solution related to weighted-sum and weighted min–max as post-optimality analysis. Results indicate that our proposed weighted-sum approach outperforms the previously developed methods in terms of hypervolume.

相似文献

18.

Optimizing virtual machine allocation for parallel scientific workflows in federated clouds

《Future Generation Computer Systems》2015

Cloud computing has established itself as an interesting computational model that provides a wide range of resources such as storage, databases and computing power for several types of users. Recently, the concept of cloud computing was extended with the concept of federated clouds where several resources from different cloud providers are inter-connected to perform a common action (e.g. execute a scientific workflow). Users can benefit from both single-provider and federated cloud environment to execute their scientific workflows since they can get the necessary amount of resources on demand. In several of these workflows, there is a demand for high performance and parallelism techniques since many activities are data and computing intensive and can execute for hours, days or even weeks. There are some Scientific Workflow Management Systems (SWfMS) that already provide parallelism capabilities for scientific workflows in single-provider cloud. Most of them rely on creating a virtual cluster to execute the workflow in parallel. However, they also rely on the user to estimate the amount of virtual machines to be allocated to create this virtual cluster. Most SWfMS use this initial virtual cluster configuration made by the user for the entire workflow execution. Dimensioning the virtual cluster to execute the workflow in parallel is then a top priority task since if the virtual cluster is under or over dimensioned it can impact on the workflow performance or increase (unnecessarily) financial costs. This dimensioning is far from trivial in a single-provider cloud and specially in federated clouds due to the huge number of virtual machine types to choose in each location and provider. In this article, we propose an approach named GraspCC-fed to produce the optimal (or near-optimal) estimation of the amount of virtual machines to allocate for each workflow. GraspCC-fed extends a previously proposed heuristic based on GRASP for executing standalone applications to consider scientific workflows executed in both single-provider and federated clouds. For the experiments, GraspCC-fed was coupled to an adapted version of SciCumulus workflow engine for federated clouds. This way, we believe that GraspCC-fed can be an important decision support tool for users and it can help determining an optimal configuration for the virtual cluster for parallel cloud-based scientific workflows. 相似文献

19.

An environment for designing exceptions in workflows

《Information Systems》1999,24(3):255-273

When designing a workflow schema, the workflow designer must often explicitly deal with exceptional situations, such as abnormal process termination or suspension of task execution. This paper shows how the designer can be supported by tools allowing him to capture exceptional behavior within a workflow schema, by reusing an available set of pre-configured exceptions skeletons. Exceptions are expressed by means of triggers, to be executed on the top of an active database environment. In particular, the paper deals with the handling of typical workflow exceptional situations which are modeled as generic exception skeletons to be included in a new workflow schema by simply specializing or instantiating them. Such skeletons, called patterns, are stored in a catalog; the paper describes the catalog structure and its management tools constituting an integrated environment for pattern-based exception design and reuse. 相似文献

20.

An asynchronous distributed architecture model for the Boltzmannmachine control mechanism

De Gloria A. Olivieri M. 《Neural Networks, IEEE Transactions on》1996,7(6):1538-1541

We present a study addressing a hardware implementation of the Boltzmann machine that relies on the concept of asynchronous digital system. The constraint of concurrently switching only unconnected neurons is dynamically satisfied by using an asynchronous distributed control mechanism. The design of the control architecture is derived from a formal definition of the problem by means of the trace theory. Computer simulations show the efficiency of the proposed approach. 相似文献