首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
Many scientific workflows are data intensive: large volumes of intermediate datasets are generated during their execution. Some valuable intermediate datasets need to be stored for sharing or reuse. Traditionally, they are selectively stored according to the system storage capacity, determined manually. As doing science on clouds has become popular nowadays, more intermediate datasets in scientific cloud workflows can be stored by different storage strategies based on a pay-as-you-go model. In this paper, we build an intermediate data dependency graph (IDG) from the data provenances in scientific workflows. With the IDG, deleted intermediate datasets can be regenerated, and as such we develop a novel algorithm that can find a minimum cost storage strategy for the intermediate datasets in scientific cloud workflow systems. The strategy achieves the best trade-off of computation cost and storage cost by automatically storing the most appropriate intermediate datasets in the cloud storage. This strategy can be utilised on demand as a minimum cost benchmark for all other intermediate dataset storage strategies in the cloud. We utilise Amazon clouds’ cost model and apply the algorithm to general random as well as specific astrophysics pulsar searching scientific workflows for evaluation. The results show that benchmarking effectively demonstrates the cost effectiveness over other representative storage strategies.  相似文献   

2.
李学俊  吴洋  刘晓  程慧敏  朱二周  杨耘 《软件学报》2016,27(7):1861-1875
科学工作流是一种复杂的数据密集型应用程序.如何在混合云环境中对数据进行有效布局,是科学工作流所面临的重要问题,尤其是混合云的安全性要求给科学云工作流数据布局研究带来了新的挑战.传统数据布局方法大多采用基于负载均衡的划分模型布局数据集,该方法可以获得很好的负载平衡布局,然而传输时间并非最优.针对传统数据布局方法的不足,并结合混合云中数据布局的特点,首先设计一种基于数据依赖破坏度的矩阵划分模型,生成对数据依赖度破坏最小的划分;然后提出一种面向数据中心的数据布局方法,该方法依据划分模型将依赖度高的数据集尽量放在同一数据中心,从而减少数据集跨数据中心的传输时间.实验结果表明,该方法能够有效地缩短科学工作流运行时跨数据中心的数据传输时间.  相似文献   

3.
The problem of Virtual Machine (VM) placement is critical to the security and efficiency of the cloud infrastructure. Nowadays most research focuses on the influences caused by the deployed VM on the data center load, energy consumption, resource loss, etc. Few works consider the security and privacy issues of the tenant data on the VM. For instance, as the application of virtualization technology, the VM from different tenants may be placed on one physical host. Hence, attackers may steal secrets from other tenants by using the side-channel attack based on the shared physical resources, which will threat the data security of the tenants in the cloud computing. To address the above issues, this paper proposes an efficient and secure VM placement strategy. Firstly, we define the related security and efficiency indices in the cloud computing system. Then, we establish a multi-objective constraint optimization model for the VM placement considering the security and performance of the system, and find resolution towards this model based on the discrete firefly algorithm. The experimental results in OpenStack cloud platform indicates that the above strategy can effectively reduce the possibility of malicious tenants and targeted tenants on the same physical node, and reduce energy consumption and resource loss at the data center.  相似文献   

4.
An important challenge for the adoption of cloud computing in the scientific community remains the efficient allocation and execution of data-intensive scientific workflows to reduce execution time and the size of transferred data. The transferred data overhead is becoming significant with emerging scientific workflows that have input/output files and intermediate data products ranging in the hundreds of gigabytes. The allocation of scientific workflows on public clouds can be described through a variety of perspectives and parameters, and has been proved to be NP-complete. This paper proposes an evolutionary approach for task allocation on public clouds considering data transfer and execution time. In our framework, a solution is represented using an allocation chromosome that encodes the allocation of tasks to nodes, and an ordering chromosome that defines the execution order according to the scientific workflow representation. We propose a multi-objective optimization that relies on a cloud cost model and employs tailored evolution operators. Starting from a population of possible solutions, we employ crossover and mutation operators on both chromosomes aiming at optimizing the data transferred between nodes as well as the total workflow runtime. The crossover operators combine parts of solutions to reduce data overhead, whereas the mutation operators swamp between parts of the same chromosome according to pre-defined rules. Our experimental study compares between the proposed approach and current state-of-the art approaches using synthetic and real-life workflows. Our algorithm performs similarly to existing heuristics for small workflows and shows up to 80 % improvements for larger synthetic workflows. To further validate our approach we compare between the allocation and scheduling obtained by our approach with that obtained by popular scientific workflow managers, when real workflows with hundreds of tasks are executed on a public cloud. The results show a 10 % improvement in runtime over existing schedulers, caused by a 80 % reduction in transferred data and optimized allocation and ordering of tasks. This improved data locality has greater impact as it can be employed to improve and study data provenance and facilitate data persistence for scientific workflows.  相似文献   

5.
科学工作流应用是一种复杂且数据密集型的应用,常应用于结构生物学、高能物理学和神经学等涉及分布式数据源的学科。数据分散存储在基于互联网的云计算平台上,致使科学工作流在执行时伴随着大量的数据传输。云计算是一种按使用量付费的模式,数据传输产生传输费用,尤其在多个工作流相互协同的情况下,将产生更高的传输成本。该文从全局的角度建立基于多工作流数据依赖图的传输成本模型,研究基于二进制粒子群算法(BPSO)的数据布局优化策略,从而减少对云计算传输资源的租赁费用。  相似文献   

6.
Efficient data-aware methods in job scheduling, distributed storage management and data management platforms are necessary for successful execution of data-intensive applications. However, research about methods for data-intensive scientific applications are insufficient in large-scale distributed cloud and cluster computing environments and data-aware methods are becoming more complex. In this paper, we propose a Data-Locality Aware Workflow Scheduling (D-LAWS) technique and a locality-aware resource management method for data-intensive scientific workflows in HPC cloud environments. D-LAWS applies data-locality and data transfer time based on network bandwidth to scientific workflow task scheduling and balances resource utilization and parallelism of tasks at the node-level. Our method consolidates VMs and consider task parallelism by data flow during the planning of task executions of a data-intensive scientific workflow. We additionally consider more complex workflow models and data locality pertaining to the placement and transfer of data prior to task executions. We implement and validate the methods based on fairness in cloud environments. Experimental results show that, the proposed methods can improve performance and data-locality of data-intensive workflows in cloud environments.  相似文献   

7.
覃浩  王平辉  张若非  覃遵颖 《软件学报》2023,34(3):1292-1309
监控视频关键帧检索和属性查找在交通、安防、教育等领域具有众多应用场景,应用深度学习模型处理海量视频数据在一定程度上缓解了人力消耗,但是存在隐私泄露、计算资源消耗大、时间长等特点.基于上述场景,提出了一个面向大规模监控视频的安全、快速的视频检索模型.具体地,根据云端算力大、监控摄像头算力规模小的特点,在云端部署重量级模型,并使用所提出的宽容训练策略对其进行定制化知识蒸馏,将蒸馏后的轻量级模型部署在监控摄像头内,同时使用局部加密算法对图像敏感部分进行加密,结合云端TEE技术和用户授权机制,在极低资源消耗的情况下实现隐私保护.通过合理控制蒸馏策略的“容忍度”,能够较好地平衡摄像头视频输入阶段和云端检索阶段的耗时,在保证极高准确率的前提下,保证极低的检索时延.相比于传统检索方法,该模型具有安全高效、可伸缩、低延时的特点.实验结果显示,在多个公开数据集上,该模型相比于传统检索方法提供9×-133×的加速.  相似文献   

8.
Security is increasingly critical for various scientific workflows that are big data applications and typically take quite amount of time being executed on large-scale distributed infrastructures. Cloud computing platform is such an infrastructure that can enable dynamic resource scaling on demand. Nevertheless, based on pay-per-use and hourly-based pricing model, users should pay attention to the cost incurred by renting virtual machines (VMs) from cloud data centers. Meanwhile, workflow tasks are generally heterogeneous and require different instance series (i.e., computing optimized, memory optimized, storage optimized, etc.). In this paper, we propose a security and cost aware scheduling (SCAS) algorithm for heterogeneous tasks of scientific workflow in clouds. Our proposed algorithm is based on the meta-heuristic optimization technique, particle swarm optimization (PSO), the coding strategy of which is devised to minimize the total workflow execution cost while meeting the deadline and risk rate constraints. Extensive experiments using three real-world scientific workflow applications, as well as CloudSim simulation framework, demonstrate the effectiveness and practicality of our algorithm.  相似文献   

9.
In the last years, scientific workflows have emerged as a fundamental abstraction for structuring and executing scientific experiments in computational environments. Scientific workflows are becoming increasingly complex and more demanding in terms of computational resources, thus requiring the usage of parallel techniques and high performance computing (HPC) environments. Meanwhile, clouds have emerged as a new paradigm where resources are virtualized and provided on demand. By using clouds, scientists have expanded beyond single parallel computers to hundreds or even thousands of virtual machines. Although the initial focus of clouds was to provide high throughput computing, clouds are already being used to provide an HPC environment where elastic resources can be instantiated on demand during the course of a scientific workflow. However, this model also raises many open, yet important, challenges such as scheduling workflow activities. Scheduling parallel scientific workflows in the cloud is a very complex task since we have to take into account many different criteria and to explore the elasticity characteristic for optimizing workflow execution. In this paper, we introduce an adaptive scheduling heuristic for parallel execution of scientific workflows in the cloud that is based on three criteria: total execution time (makespan), reliability and financial cost. Besides scheduling workflow activities based on a 3-objective cost model, this approach also scales resources up and down according to the restrictions imposed by scientists before workflow execution. This tuning is based on provenance data captured and queried at runtime. We conducted a thorough validation of our approach using a real bioinformatics workflow. The experiments were performed in SciCumulus, a cloud workflow engine for managing scientific workflow execution.  相似文献   

10.
Workflows are used to orchestrate data-intensive applications in many different scientific domains. Workflow applications typically communicate data between processing steps using intermediate files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. As a result, the efficient management of data is a key factor in achieving good performance for workflow applications in distributed environments. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon’s EC2 cloud computing platform. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.  相似文献   

11.
Haoyu Luo  Jin Liu  Xiao Liu  Yun Yang 《Software》2018,48(4):775-795
Workflow temporal violations, namely, intermediate workflow runtime delays, often occur and have a serious impact on the on‐time completion of massive concurrent requests. Therefore, accurate prediction of cloud workflow temporal violations is critical as its result can serve as an essential reference for temporal violation prevention and handling strategies. Conventional studies mainly focus on the time delays of a single workflow activity or a single workflow instance but overlook the propagation of time delays among them. This is a serious problem as time delays can propagate in cloud workflow system due to resource sharing and the dependencies among workflow activities. This paper first proposes a novel temporal violation transmission model inspired by an epidemic model to model the dynamics of time delay propagation. Afterward, a novel temporal violation prediction strategy is presented to estimate the number of temporal violations that may occur and determine the number of violations that must be handled to achieve the target service‐level agreement, namely, the on‐time completion rate. To the best of our knowledge, this is the first attempt to predict cloud workflow temporal violations at the workflow build‐time stage by analyzing the propagation of temporal violations. Experimental results demonstrate that our strategy can make highly accurate predictions and is scalable for a large batch of parallel workflows running in the cloud.  相似文献   

12.
尚蕾  刘茜萍 《计算机工程》2020,46(5):122-130,138
云环境下科学工作流的数据布局成为当前工作流研究中的一个热点问题,对科学工作流中任务和数据之间多对多关系进行分析,可以发现不同数据布局方案在数据传输上的费用各不相同,在很大程度上影响工作流的运行成本。为降低科学工作流数据集传输费用,提出一种基于任务分配和数据集副本的科学工作流数据布局方法。该方法从任务分配开始,在定量计算任务依赖度的基础上进行任务分配,根据分配结果给出一个基于数据集副本的两阶段数据布局方法,以实现科学工作流运行中传输费用的优化。实例结果表明,与工作流层方法相比,该方法可以有效降低科学工作流的运行成本。  相似文献   

13.
Cloud computing has established itself as an interesting computational model that provides a wide range of resources such as storage, databases and computing power for several types of users. Recently, the concept of cloud computing was extended with the concept of federated clouds where several resources from different cloud providers are inter-connected to perform a common action (e.g. execute a scientific workflow). Users can benefit from both single-provider and federated cloud environment to execute their scientific workflows since they can get the necessary amount of resources on demand. In several of these workflows, there is a demand for high performance and parallelism techniques since many activities are data and computing intensive and can execute for hours, days or even weeks. There are some Scientific Workflow Management Systems (SWfMS) that already provide parallelism capabilities for scientific workflows in single-provider cloud. Most of them rely on creating a virtual cluster to execute the workflow in parallel. However, they also rely on the user to estimate the amount of virtual machines to be allocated to create this virtual cluster. Most SWfMS use this initial virtual cluster configuration made by the user for the entire workflow execution. Dimensioning the virtual cluster to execute the workflow in parallel is then a top priority task since if the virtual cluster is under or over dimensioned it can impact on the workflow performance or increase (unnecessarily) financial costs. This dimensioning is far from trivial in a single-provider cloud and specially in federated clouds due to the huge number of virtual machine types to choose in each location and provider. In this article, we propose an approach named GraspCC-fed to produce the optimal (or near-optimal) estimation of the amount of virtual machines to allocate for each workflow. GraspCC-fed extends a previously proposed heuristic based on GRASP for executing standalone applications to consider scientific workflows executed in both single-provider and federated clouds. For the experiments, GraspCC-fed was coupled to an adapted version of SciCumulus workflow engine for federated clouds. This way, we believe that GraspCC-fed can be an important decision support tool for users and it can help determining an optimal configuration for the virtual cluster for parallel cloud-based scientific workflows.  相似文献   

14.
Cloud backup has been an important issue ever since large quantities of valuable data have been stored on the personal computing devices. Data reduction techniques, such as deduplication, delta encoding, and Lempel-Ziv (LZ) compression, performed at the client side before data transfer can help ease cloud backup by saving network bandwidth and reducing cloud storage space. However, client-side data reduction in cloud backup services faces efficiency and privacy challenges. In this paper, we present Pangolin, a secure and efficient cloud backup service for personal data storage by exploiting application awareness. It can speedup backup operations by application-aware client-side data reduction technique, and mitigate data security risks by integrating selective encryption into data reduction for sensitive applications. Our experimental evaluation, based on a prototype implementation, shows that our scheme can improve data reduction efficiency over the state-of-the-art methods by shortening the backup window size to 33%-75%, and its security mechanism for' sensitive applications has negligible impact on backup window size.  相似文献   

15.
李俊祺  林伟伟  石方  李克勤 《软件学报》2022,33(11):3944-3966
数据中心的虚拟机(virtual machine,VM)整合技术是当今云计算领域的一个研究热点.要在保证服务质量(QoS)的前提下尽可能地降低云数据中心的服务器能耗,本质上是一个多目标优化的NP难问题.为了更好地解决该问题,面向异构服务器云环境提出了一种基于差分进化与粒子群优化的混合群智能节能虚拟机整合方法(HSI-VMC).该方法包括基于峰值效能比的静态阈值超载服务器检测策略(PEBST)、基于迁移价值比的待迁移虚拟机选择策略(MRB)、目标服务器选择策略、混合离散化启发式差分进化粒子群优化虚拟机放置算法(HDH-DEPSO)以及基于负载均值的欠载服务器处理策略(AVG).其中,PEBST,MRB,AVG策略的结合能够根据服务器的峰值效能比和CPU的负载均值检测出超载和欠载服务器,并选出合适的虚拟机进行迁移,降低负载波动引起的服务水平协议违约率(SLAV)和虚拟机迁移的次数;HDH-DEPSO算法结合DE和PSO的优点,能够搜索出更优的虚拟机放置方案,使服务器尽可能地保持在峰值效能比下运行,降低服务器的能耗开销.基于真实云环境数据集(PlanetLab/Mix/Gan)的一系列实验结果表明:HSI-VMC方法与当前主流的几种节能虚拟机整合方法相比,能够更好地兼顾多个QoS指标,并有效地降低云数据中心的服务器能耗开销.  相似文献   

16.
In a neural network of deep learning, it needs a series of algorithms that endeavor to recognize underlying relationships in a set of data. In order to protect the privacy of user’s datasets, traditional schemes can perform the prediction task by setting only a single data provider in the system. However, the data may come from multiple separated data providers rather than single data source in real world since each data provider might hold partial features of a complete prediction sample. It requires that multiple data providers cooperate to perform the prediction for the neural networks by sending their own local data to a well-trained prediction model deployed on a remote cloud server to obtain a predictive label. However, the data owned by multiple data providers usually contain a large amount of private information, which can lead to serious security problems once leaked. To resolve the security and privacy issues of the data owned by multiple data providers, in this paper, we propose a Privacy-Preserving Neural Network Prediction model (PPNNP) that deploys multi-client inner-product functional encryption to the first layer of prediction model. Multiple data providers encrypt their data and upload it to a well-trained model deployed on cloud server, and the server makes predictions by calculating inner-products related to them. It can provide sufficient privacy and security for the data while deploying different neural network architectures with activation functions that are even non-linear on the remote server. We evaluate our scheme based on the real datasets and provide a comparison with the related schemes. Experimental results demonstrate that our scheme can reduce the computational cost of the whole process while significantly reducing the encryption time. It can obtain an accuracy of over 90% in different network architectures with even non-linear activation functions. Meanwhile, our solution can reduce communication overhead in the whole protocol.  相似文献   

17.
In many-task computing (MTC), applications such as scientific workflows or parameter sweeps communicate via intermediate files; application performance strongly depends on the file system in use. The state of the art uses runtime systems providing in-memory file storage that is designed for data locality: files are placed on those nodes that write or read them. With data locality, however, task distribution conflicts with data distribution, leading to application slowdown, and worse, to prohibitive storage imbalance. To overcome these limitations, we present MemFS, a fully symmetrical, in-memory runtime file system that stripes files across all compute nodes, based on a distributed hash function. Our cluster experiments with Montage and BLAST workflows, using up to 512 cores, show that MemFS has both better performance and better scalability than the state-of-the-art, locality-based file system, AMFS. Furthermore, our evaluation on a public commercial cloud validates our cluster results. On this platform MemFS shows excellent scalability up to 1024 cores and is able to saturate the 10G Ethernet bandwidth when running BLAST and Montage.  相似文献   

18.
In this paper, we propose a simulation model to study real‐world replication workflows for cloud storage systems. With this model, we present three new methods to maximize the storage space usage during replica creation, and two novel QoS aware greedy algorithms for replica placement optimization. By using a simulation method, our algorithms are evaluated, through a comparison with the existing placement algorithms, to show that (i) a more evenly distributed replicas for a data set can be achieved by using round‐robin methods in replica creation phase and (ii) the two proposed greedy algorithms, named GS_QoS and GS_QoS_C1, not only have more economical results than those from Chen et al., but also guarantee the QoS for clients. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

19.
Yi Wei  M. Brian Blake 《Computing》2016,98(5):523-538
A Cloud platform offers on-demand provisioning of virtualized resources and pay-per-use charge model to its hosted services to satisfy their fluctuating resource needs. Resource scaling in cloud is often carried out by specifying static rules or thresholds. As business processes and scientific jobs become more intricate and involve more components, traditional reactive or rule-based resource management methods are not able to meet the new requirements. In this paper, we extend our previous work on dynamically managing virtualized resources for service workflows in a cloud environment. Extensive experimental results of an adaptive resource management algorithm are reported. The algorithm makes resource management decisions based on predictive results and high level user specified thresholds. It is also able to coordinate resources among the component services of a workflow so that unnecessary resource allocations and terminations can be avoided. Based on observations from previous experiments, the algorithm is extended with a new resource merge strategy in order to prevent average resource size from shrinking. Simulation results from synthetic workload data demonstrated the effectiveness of the extension.  相似文献   

20.
Scientific workflows are increasingly used to manage and share scientific computations and methods to analyze data. A variety of systems have been developed that store the workflows executed and make them part of public repositories However, workflows are published in the idiosyncratic format of the workflow system used for the creation and execution of the workflows. Browsing, linking and using the stored workflows and their results often becomes a challenge for scientists who may only be familiar with one system. In this paper we present an approach for addressing this issue by publishing and exploiting workflows as data on the Web with a representation that is independent from the workflow system used to create them. In order to achieve our goal, we follow the Linked Data Principles to publish workflow inputs, intermediate results, outputs and codes; and we reuse and extend well established standards like W3C PROV. We illustrate our approach by publishing workflows and consuming them with different tools designed to address common scenarios for workflow exploitation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号