期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Common motifs in scientific workflows: An empirical analysis

《Future Generation Computer Systems》2014

Workflow technology continues to play an important role as a means for specifying and enacting computational experiments in modern science. Reusing and re-purposing workflows allow scientists to do new experiments faster, since the workflows capture useful expertise from others. As workflow libraries grow, scientists face the challenge of finding workflows appropriate for their task, understanding what each workflow does, and reusing relevant portions of a given workflow. We believe that workflows would be easier to understand and reuse if high-level views (abstractions) of their activities were available in workflow libraries. As a first step towards obtaining these abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna, Wings, Galaxy and Vistrails. Our analysis has resulted in a set of scientific workflow motifs that outline (i) the kinds of data-intensive activities that are observed in workflows (Data-Operation motifs), and (ii) the different manners in which activities are implemented within workflows (Workflow-Oriented motifs). These motifs are helpful to identify the functionality of the steps in a given workflow, to develop best practices for workflow design, and to develop approaches for automated generation of workflow abstractions. 相似文献

2.

A new optimization phase for scientific workflow management systems

《Future Generation Computer Systems》2014

Scientific workflows have emerged as an important tool for combining the computational power with data analysis for all scientific domains in e-science, especially in the life sciences. They help scientists to design and execute complex in silico experiments. However, with rising complexity it becomes increasingly impractical to optimize scientific workflows by trial and error. To address this issue, we propose to insert a new optimization phase into the common scientific workflow life cycle. This paper describes the design and implementation of an automated optimization framework for scientific workflows to implement this phase. Our framework was integrated into Taverna, a life-science oriented workflow management system and offers a versatile programming interface (API), which enables easy integration of arbitrary optimization methods. We have used this API to develop an example plugin for parameter optimization that is based on a Genetic Algorithm. Two use cases taken from the areas of structural bioinformatics and proteomics demonstrate how our framework facilitates setup, execution, and monitoring of workflow parameter optimization in high performance computing e-science environments. 相似文献

3.

Using ontologies for verification and validation of workflow-based experiments

《Journal of Web Semantics》2017

相似文献

4.

Using a suite of ontologies for preserving workflow-centric research objects

《Journal of Web Semantics》2015

Scientific workflows are a popular mechanism for specifying and automating data-driven in silico experiments. A significant aspect of their value lies in their potential to be reused. Once shared, workflows become useful building blocks that can be combined or modified for developing new experiments. However, previous studies have shown that storing workflow specifications alone is not sufficient to ensure that they can be successfully reused, without being able to understand what the workflows aim to achieve or to re-enact them. To gain an understanding of the workflow, and how it may be used and repurposed for their needs, scientists require access to additional resources such as annotations describing the workflow, datasets used and produced by the workflow, and provenance traces recording workflow executions.In this article, we present a novel approach to the preservation of scientific workflows through the application of research objects—aggregations of data and metadata that enrich the workflow specifications. Our approach is realised as a suite of ontologies that support the creation of workflow-centric research objects. Their design was guided by requirements elicited from previous empirical analyses of workflow decay and repair. The ontologies developed make use of and extend existing well known ontologies, namely the Object Reuse and Exchange (ORE) vocabulary, the Annotation Ontology (AO) and the W3C PROV ontology (PROVO). We illustrate the application of the ontologies for building Workflow Research Objects with a case-study that investigates Huntington’s disease, performed in collaboration with a team from the Leiden University Medial Centre (HG-LUMC). Finally we present a number of tools developed for creating and managing workflow-centric research objects. 相似文献

5.

A comparison of using Taverna and BPEL in building scientific workflows: the case of caGrid

Wei Tan Paolo Missier Ian Foster Ravi Madduri David De Roure Carole Goble 《Concurrency and Computation》2010,22(9):1098-1117

When the emergence of ‘service‐oriented science,’ the need arises to orchestrate multiple services to facilitate scientific investigation—that is, to create ‘science workflows.’ We present here our findings in providing a workflow solution for the caGrid service‐based grid infrastructure. We choose BPEL and Taverna as candidates, and compare their usability in the lifecycle of a scientific workflow, including workflow composition, execution, and result analysis. Our experience shows that BPEL as an imperative language offers a comprehensive set of modeling primitives for workflows of all flavors; whereas Taverna offers a dataflow model and a more compact set of primitives that facilitates dataflow modeling and pipelined execution. We hope that this comparison study not only helps researchers to select a language or tool that meets their specific needs, but also offers some insight into how a workflow language and tool can fulfill the requirement of the scientific community. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

6.

Visualization workflows for level-12 HUC scales: Towards an expert system for watershed analysis in a distributed computing environment

《Environmental Modelling & Software》2016

Visualization workflows are important services for expert users to analyze watersheds when using our HydroTerre end-to-end workflows. Analysis is an interactive and iterative process and we demonstrate that the expert user can focus on model results, not data preparation, by using a web application to rapidly create, tune, and calibrate hydrological models anywhere in the continental USA (CONUS). The HydroTerre system captures user interaction for provenance and reproducibility to share modeling strategies with modelers. Our end-to-end workflow consists of four workflows. The first is data workflows using Essential Terrestrial Variables (ETV) data sets that we demonstrated to construct watershed models anywhere in the CONUS (Leonard and Duffy, 2013). The second is data-model workflows that transform the data workflow results to model inputs. The model inputs are consumed in the third workflow, model workflows (Leonard and Duffy, 2014a) that handle distribution of data and model within High Performance Computing (HPC) environments. This article focuses on our fourth workflow, visualization workflows, which consume the first three workflows to form an end-to-end system to create and share hydrological model results efficiently for analysis and peer review. We show how visualization workflows are incorporated into the HydroTerre infrastructure design and demonstrate the efficiency and robustness for an expert modeler to produce, analyze, and share new hydrological models using CONUS national datasets. 相似文献

7.

A formal semantics for the Taverna 2 workflow model

Jacek Sroka Jan Hidders Paolo Missier Carole Goble 《Journal of Computer and System Sciences》2010,76(6):490-508

This paper presents a formal semantics for the Taverna 2 scientific workflow system. Taverna 2 is a successor to Taverna, an open-source workflow system broadly adopted within the e-science community worldwide. The new version improves upon the existing model in two main ways: (i) by adding support for data pipelining, which in turns enables input streams of indefinite length to be processed efficiently; and (ii) by providing new extensibility points that make it possible to add new operators to the workflow model. Consistent with previous work by some of the authors, we use trace semantics to describe the effect of workflow computations, and we show how they can be used to describe the new features in the Taverna 2 model. 相似文献

8.

Automated data provenance capture in spreadsheets,with case studies

Hazeline U. Asuncion 《Future Generation Computer Systems》2013,29(8):2169-2181

One of the most important tasks in eScience is capturing the provenance of data. While scientists frequently use off-the-shelf analysis tools to process and manipulate data, current provenance techniques such as those based on scientific workflows are typically not able to trace internal data manipulations that occur within these tools. In this paper, we focus on one such off-the-shelf tool, MS Excel, which is used by many scientists; specifically, we propose InSituTrac, an automated in situ provenance approach for spreadsheet data in Excel. Our framework captures data provenance unobtrusively in the background, allows for user annotations, provides undo/redo functionality at various levels of granularity, presents the captured provenance in an accessible format, and visualizes captured provenance to support analysis of the provenance log. We highlight several motivating use case scenarios which show how provenance queries can be answered by our approach. Finally, case studies with an atmospheric science research group and a fisheries research group suggest that the automated provenance approach is both efficient and useful to scientists. 相似文献

9.

A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Daniel de Oliveira Kary A. C. S. Oca?a Fernanda Bai?o Marta Mattoso 《Journal of Grid Computing》2012,10(3):521-552

In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution. 相似文献

10.

Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows

Claudia Szabo Quan Z. Sheng Trent Kroeger Yihong Zhang Jian Yu 《Journal of Grid Computing》2014,12(2):245-264

An important challenge for the adoption of cloud computing in the scientific community remains the efficient allocation and execution of data-intensive scientific workflows to reduce execution time and the size of transferred data. The transferred data overhead is becoming significant with emerging scientific workflows that have input/output files and intermediate data products ranging in the hundreds of gigabytes. The allocation of scientific workflows on public clouds can be described through a variety of perspectives and parameters, and has been proved to be NP-complete. This paper proposes an evolutionary approach for task allocation on public clouds considering data transfer and execution time. In our framework, a solution is represented using an allocation chromosome that encodes the allocation of tasks to nodes, and an ordering chromosome that defines the execution order according to the scientific workflow representation. We propose a multi-objective optimization that relies on a cloud cost model and employs tailored evolution operators. Starting from a population of possible solutions, we employ crossover and mutation operators on both chromosomes aiming at optimizing the data transferred between nodes as well as the total workflow runtime. The crossover operators combine parts of solutions to reduce data overhead, whereas the mutation operators swamp between parts of the same chromosome according to pre-defined rules. Our experimental study compares between the proposed approach and current state-of-the art approaches using synthetic and real-life workflows. Our algorithm performs similarly to existing heuristics for small workflows and shows up to 80 % improvements for larger synthetic workflows. To further validate our approach we compare between the allocation and scheduling obtained by our approach with that obtained by popular scientific workflow managers, when real workflows with hundreds of tasks are executed on a public cloud. The results show a 10 % improvement in runtime over existing schedulers, caused by a 80 % reduction in transferred data and optimized allocation and ordering of tasks. This improved data locality has greater impact as it can be employed to improve and study data provenance and facilitate data persistence for scientific workflows. 相似文献

11.

Nonintrusive collection and management of data provenance in scientific workflows

Giorgos Tylissanakis Yiannis Cotronis 《Concurrency and Computation》2012,24(18):2268-2281

In this paper, we introduce an efficient mechanism to collect, store, and retrieve data provenance information in workflows of multiphysics simulations. Using notifications, we enable the nonintrusive collection of information about workflow events during workflow execution. Combining these events with workflow structure information, constant for every execution of a workflow, we obtain the data provenance information for the specific run of the workflow. Data provenance information is structured into a graph that represents workflow events on the basis of their causal dependency. We use a graph database to store this graph and utilize the traversal framework provided, to efficiently retrieve data provenance information from the graph by traversing backwards from a data object to every workflow event that is part of its provenance. Finally, we integrate data provenance information with semantics of workflow services to provide complete and meaningful data provenance information. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

12.

Deriving scientific workflows from algebraic experiment lines: A practical approach

《Future Generation Computer Systems》2017

The exploratory nature of a scientific computational experiment involves executing variations of the same workflow with different approaches, programs, and parameters. However, current approaches do not systematize the derivation process from the experiment definition to the concrete workflows and do not track the experiment provenance down to the workflow executions. Therefore, the composition, execution, and analysis for the entire experiment become a complex task. To address this issue, we propose the Algebraic Experiment Line (AEL). AEL uses a data-centric workflow algebra, which enriches the experiment representation by introducing a uniform data model and its corresponding operators. This representation and the AEL provenance model map concepts from the workflow execution data to the AEL derived workflows with their corresponding experiment abstract definitions. We show how AEL has improved the understanding of a real experiment in the bioinformatics area. By combining provenance data from the experiment and its corresponding executions, AEL provenance queries navigate from experiment concepts defined at high abstraction level to derived workflows and their execution data. It also shows a direct way of querying results from different trials involving activity variations and optionalities, only present at the experiment level of abstraction. 相似文献

13.

P-PIF: a ProvONE provenance interoperability framework for analyzing heterogeneous workflow specifications and provenance traces

Ajinkya Prabhune Aaron Zweig Rainer Stotzka Jürgen Hesser Michael Gertz 《Distributed and Parallel Databases》2018,36(1):219-264

相似文献

14.

A characterization of workflow management systems for extreme-scale applications

《Future Generation Computer Systems》2017

Automation of the execution of computational tasks is at the heart of improving scientific productivity. Over the last years, scientific workflows have been established as an important abstraction that captures data processing and computation of large and complex scientific applications. By allowing scientists to model and express entire data processing steps and their dependencies, workflow management systems relieve scientists from the details of an application and manage its execution on a computational infrastructure. As the resource requirements of today’s computational and data science applications that process vast amounts of data keep increasing, there is a compelling case for a new generation of advances in high-performance computing, commonly termed as extreme-scale computing, which will bring forth multiple challenges for the design of workflow applications and management systems. This paper presents a novel characterization of workflow management systems using features commonly associated with extreme-scale computing applications. We classify 15 popular workflow management systems in terms of workflow execution models, heterogeneous computing environments, and data access methods. The paper also surveys workflow applications and identifies gaps for future research on the road to extreme-scale workflows and management systems. 相似文献

15.

Abstract,link, publish,exploit: An end to end framework for workflow sharing

《Future Generation Computer Systems》2017

Scientific workflows are increasingly used to manage and share scientific computations and methods to analyze data. A variety of systems have been developed that store the workflows executed and make them part of public repositories However, workflows are published in the idiosyncratic format of the workflow system used for the creation and execution of the workflows. Browsing, linking and using the stored workflows and their results often becomes a challenge for scientists who may only be familiar with one system. In this paper we present an approach for addressing this issue by publishing and exploiting workflows as data on the Web with a representation that is independent from the workflow system used to create them. In order to achieve our goal, we follow the Linked Data Principles to publish workflow inputs, intermediate results, outputs and codes; and we reuse and extend well established standards like W3C PROV. We illustrate our approach by publishing workflows and consuming them with different tools designed to address common scenarios for workflow exploitation. 相似文献

16.

Model-as-you-go: An Approach for an Advanced Infrastructure for Scientific Workflows

Mirko Sonntag Dimka Karastoyanova 《Journal of Grid Computing》2013,11(3):553-583

Most of the existing scientific workflow systems rely on proprietary concepts and workflow languages. We are convinced that the conventional workflow technology that is established in business scenarios for years is also beneficial for scientists and scientific applications. We are therefore working on a scientific workflow system based on business workflow concepts and technologies. The system offers advanced flexibility features to scientists in order to support them in creating workflows in an explorative manner and to increase robustness of scientific applications. We named the approach Model-as-you-go because it enables users to model and execute workflows in an iterative process that eventually results in a complete scientific workflow. In this paper, we present main ingredients of Model-as-you-go, show how existing workflow concepts have to be extended in order to cover the requirements of scientists, discuss the application of the concepts to BPEL, and introduce the current prototype of the system. 相似文献

17.

Workflows for Heliophysics

Anja Le Blanc John Brooke Donal Fellows Marco Soldati David Pérez-Suárez Alessandro Marassi Andrej Santin 《Journal of Grid Computing》2013,11(3):481-503

In this paper we describe how we have introduced workflows into the working practices of a community for whom the concept of workflows is very new, namely the heliophysics community. Heliophysics is a branch of astrophysics which studies the Sun and the interactions between the Sun and the planets, by tracking solar events as they travel throughout the Solar system. Heliophysics produces two major challenges for workflow technology. Firstly it is a systems science where research is currently developed by many different communities who need reliable data models and metadata to be able to work together. Thus it has major challenges in the semantics of workflows. Secondly, the problem of time is critical in heliophysics; the workflows must take account of the propagation of events outwards from the sun. They have to address the four dimensional nature of space and time in terms of the indexing of data. We discuss how we have built an environment for Heliophysics workflows building on and extending the Taverna workflow system and utilising the myExperiment site for sharing workflows. We also describe how we have integrated the workflows into the existing practices of the communities involved in Heliophysics by developing a web portal which can hide the technical details from the users, who can concentrate on the data from their scientific point of view rather than on the methods used to integrate and process the data. This work has been developed in the EU Framework 7 project HELIO, and is being disseminated to the worldwide Heliophysics community, since Heliophysics requires integration of effort on a global scale. 相似文献

18.

DFL: A dataflow language based on Petri nets and nested relational calculus

Jan Hidders Natalia Kwasnikowska Jacek Sroka Jerzy Tyszkiewicz Jan Van den Bussche 《Information Systems》2008

In this paper we propose DFL—a formal, graphical workflow language for dataflows, i.e., workflows where large amounts of complex data are manipulated, and the structure of the manipulated data is reflected in the structure of the workflow. It is a common extension of (1) Petri nets, which are responsible for the organization of the processing tasks, and (2) nested relational calculus, which is a database query language over complex objects, and is responsible for handling collections of data items (in particular, for iteration) and for the typing system. We demonstrate that dataflows constructed in a hierarchical manner, according to a set of refinement rules we propose, are semi-sound, i.e., initiated with a single token (which may represent a complex scientific data collection) in the input node, terminate with a single token in the output node (which represents the output data collection). In particular they never leave any “debris data” behind and an output is always eventually computed regardless of how the computation proceeds. 相似文献

19.

Improving workflow modularity using a concern-specific layer on top of Unify

《Science of Computer Programming》2014

Workflows are a popular means of automating processes in many domains, ranging from high-level business process modeling to lower-level web service orchestration. However, state-of-the-art workflow languages offer a limited set of modularization mechanisms. This results in monolithic workflow specifications, in which different concerns are scattered across the workflow and tangled with one another. This hinders the design, evolution, and reusability of workflows expressed in these languages. We address this problem through the Unify framework. This framework enables uniform modularization of workflows by supporting the specification of all workflow concerns – including crosscutting ones – in isolation of each other. These independently specified workflow concerns are connected to each other using workflow-specific connectors. In order to further facilitate the development of workflows, we enable the definition of concern-specific languages (CSLs) on top of the Unify framework. A CSL facilitates the expression of a family of workflow concerns by offering abstractions that map well to the concerns' domain. Thus, domain experts can add concerns to a workflow using concern-specific language constructs. We exemplify the specification of a workflow in Unify, and show the definition and application of two concern-specific languages built on top of Unify. 相似文献

20.

Zhangbing Zhou Zehui Cheng Yueqin Zhu 《中国科学:信息科学(英文版)》2016,59(11):113101

This article proposes to identify and recommend scientific workflows for reuse and repurposing. Specifically, a scientific workflow is represented as a layer hierarchy that specifies the hierarchical relations between this workflow, its sub-workflows, and activities. Semantic similarity is calculated between layer hierarchies of workflows. A graph-skeleton based clustering technique is adopted for grouping layer hierarchies into clusters. Barycenters in each cluster are identified, which serve as core workflows in this cluster, for facilitating the cluster identification and workflow ranking and recommendation with respect to the requirement of scientists. 相似文献