首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 367 毫秒
1.
A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.  相似文献   

2.
Failures are normal rather than exceptional in the cloud computing environments. To improve system avai- lability, replicating the popular data to multiple suitable locations is an advisable choice, as users can access the data from a nearby site. This is, however, not the case for replicas which must have a fixed number of copies on several locations. How to decide a reasonable number and right locations for replicas has become a challenge in the cloud computing. In this paper, a dynamic data replication strategy is put forward with a brief survey of replication strategy suitable for distributed computing environments. It includes: 1) analyzing and modeling the relationship between system availability and the number of replicas; 2) evaluating and identifying the popular data and triggering a replication operation when the popularity data passes a dynamic threshold; 3) calculating a suitable number of copies to meet a reasonable system byte effective rate requirement and placing replicas among data nodes in a balanced way; 4) designing the dynamic data replication algorithm in a cloud. Experimental results demonstrate the efficiency and effectiveness of the improved system brought by the proposed strategy in a cloud.  相似文献   

3.
With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.  相似文献   

4.
Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches.  相似文献   

5.
云计算环境下实体的多属性高效率评估策略设计   总被引:2,自引:0,他引:2  
鉴于当前在云计算的实体信任评估领域所实施的方法策略无法满足云计算环境的动态模糊性,以遗传自适应学习算法为理论基础,在云计算环境下的信任管理过程中,针对动态评估高效率的需求,设计了多属性信任关系的动态评估模型。其可以评估分析信任管理过程实体的多属性问题,同时可以在持续的评估过程中,自主的学习,使其评估分析的执行能力逐渐提升,最终能够满足上文中阐明的高效性与动态性的要求。经过实验分析表明,本文所设计的模型在直接面对实时变化的云计算网络环境实体参数时,其可以自适应的调整评估策略规则,最终克服了云计算环境的动态模糊性,对实体的多个属性进行了综合评估。  相似文献   

6.
云存储的优势吸引着越来越多的图书馆采用云存储解决馆藏数字资源的存储需求。但随着海量数据的增长,云存储节点的失效概率越来越大。单一的容错策略,如单一的复制或纠删码,不可避免地存在一些缺点,不能满足当今容错技术的需要。因此,根据馆藏文献访问的频率和大小,提出了一种自适应切换的容错策略,该策略可以在整个生命周期中动态地选择复制方案或纠删码方案。实验结果表明,该方案较单一复制策略节约了43%的存储空间,较单一纠删码策略提升了52%的节点故障恢复时间。  相似文献   

7.
一个适合大规模集群并行计算的检查点系统   总被引:5,自引:1,他引:4  
分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错.  相似文献   

8.
In its simplest structure, cloud computing technology is a massive collection of connected servers residing in a datacenter and continuously changing to provide services to users on-demand through a front-end interface. The failure of task during execution is no more an accident but a frequent attribute of scheduling systems in a large-scale distributed environment. Recently, some computational intelligence techniques have been mostly utilized to decipher the problems of scheduling in the cloud environment, but only a few emphasis on the issue of fault tolerance. This research paper puts forward a Checkpointed League Championship Algorithm (CPLCA) scheduling scheme to be used in the cloud computing system. It is a fault-tolerance aware task scheduling mechanisms using the checkpointing strategy in addition to tasks migration against unexpected independent task execution failure. The simulation results show that, the proposed CPLCA scheme produces an improvement of 41%, 33% and 23% as compared with the Ant Colony Optimization (ACO), Genetic Algorithm (GA) and the basic league championship algorithm (LCA) respectively as parametrically measured using the total average makespan of the schemes. Considering the total average response time of the schemes, the CPLCA scheme produces an improvement of 54%, 57% and 30% as compared with ACO, GA and LCA respectively. It also turns out significant failure decrease in jobs execution as measured in terms of failure metrics and performance improvement rate. From the results obtained, CPLCA provides an improvement in both tasks scheduling performance and failure awareness that is more appropriate for scheduling in the cloud computing model.  相似文献   

9.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.  相似文献   

10.
杨娜  刘靖 《软件学报》2019,30(4):1191-1202
通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP(N-version programming)等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力.  相似文献   

11.
Virtualization technology makes data centers more dynamic and easier to administrate. Today, cloud providers offer customers access to complex applications running on virtualized hardware. Nevertheless, big virtualized data centers become stochastic environments and the simplification on the user side leads to many challenges for the provider. He has to find cost-efficient configurations and has to deal with dynamic environments to ensure service level objectives (SLOs). We introduce a software solution that reduces the degree of human intervention to manage clouds. It is designed as a multi-agent system (MAS) and placed on top of the Infrastructure as a Service (IaaS) layer. Worker agents allocate resources, configure applications, check the feasibility of requests, and generate cost estimates. They are equipped with application specific knowledge allowing it to estimate the type and number of necessary resources. During runtime, a worker agent monitors the job and adapts its resources to ensure the specified quality of service—even in noisy clouds where the job instances are influenced by other jobs. They interact with a scheduler agent, which takes care of limited resources and does a cost-aware scheduling by assigning jobs to times with low costs. The whole architecture is self-optimizing and able to use public or private clouds. Building a private cloud needs to face the challenge to find a mapping of virtual machines (VMs) to hosts. We present a rule-based mapping algorithm for VMs. It offers an interface where policies can be defined and combined in a generic way. The algorithm performs the initial mapping at request time as well as a remapping during runtime. It deals with policy and infrastructure changes. An energy-aware scheduler and the availability of cheap resources provided by a spot market are analyzed. We evaluated our approach by building up an SaaS stack, which assigns resources in consideration of an energy function and that ensures SLOs of two different applications, a brokerage system and a high-performance computing software. Experiments were done on a real cloud system and by simulations.  相似文献   

12.
Adaptive checkpointing strategy to tolerate faults in economy based grid   总被引:3,自引:2,他引:1  
In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs.  相似文献   

13.
In this study, we describe the ​further development of Elastic Cloud Computing Cluster (EC3), a tool ​for creating self-managed cost-efficient virtual hybrid elastic clusters on top of Infrastructure as a Service (IaaS) clouds. By using spot ​instances and checkpointing techniques, EC3 can significantly reduce the total ​execution cost as well as facilitating automatic fault tolerance. Moreover, EC3 can deploy and manage hybrid clusters across on-premises and public ​cloud resources, thereby introducing ​cloud bursting capabilities. ​We present the results of a case study that we conducted to assess the effectiveness of the tool ​based on the structural dynamic analysis of buildings. In addition, we evaluated the checkpointing algorithms in a real ​cloud environment with existing workloads to study their effectiveness. The results ​demonstrate the feasibility and benefits of this type of ​cluster for computationally intensive applications.  相似文献   

14.
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.  相似文献   

15.
拜占庭容错算法是一类能够容忍各种形式的软件错误和安全漏洞的容错算法,对云计算的可靠性保障有着重要意义.与其他容错算法相比,拜占庭容错算法稳定性更高,但是其性能表现低下,不能满足当前系统对高吞吐、低延时的需求.在网计算是一种以数据为中心的体系结构,它用网络承担部分计算功能,使数据在流动过程中获得处理,从而提高系统性能.为解决拜占庭容错系统的问题,提出了一种基于在网计算的拜占庭容忍共识算法优化方案,将算法的一部分处理任务卸载到网卡上执行,利用网卡和处理器形成的多级流水线提升系统吞吐量.由于仅使用在网计算的方案在特定场景下效果不佳,因此,使用多线程方法来提升优化方案的可扩展性.同时,对算法进行了详细的系统评测,实验结果表明:相对于普通的拜占庭容错系统,使用在网计算与多线程结合的优化方案能够获得46%的吞吐率提升以及65%的延迟下降,证明了基于在网计算的拜占庭容忍共识算法优化方案的可行性与有效性.  相似文献   

16.
There are many security issues in cloud computing service environments, including virtualization, distributed big-data processing, serviceability, traffic management, application security, access control, authentication, and cryptography, among others. In particular, data access using various resources requires an authentication and access control model for integrated management and control in cloud computing environments. Cloud computing services are differentiated according to security policies because of differences in the permitted access right between service providers and users. RBAC (Role-based access control) and C-RBAC (Context-aware RBAC) models do not suggest effective and practical solutions for managers and users based on dynamic access control methods, suggesting a need for a new model of dynamic access control that can address the limitations of cloud computing characteristics. This paper proposes Onto-ACM (ontology-based access control model), a semantic analysis model that can address the difference in the permitted access control between service providers and users. The proposed model is a model of intelligent context-aware access for proactively applying the access level of resource access based on ontology reasoning and semantic analysis method.  相似文献   

17.
The paper proposes a new technique for providing software fault tolerance in concurrent systems. It combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism. This mechanism involves smaller overhead in case of moderate to high process interaction than the schemes considered in past, which are based upon the idea of local checkpointing. A model for computing the optimum checkpointing interval is also presented. A particular distribution is hypothesized for the coverage of the recovery, and the behavior of the model is studied in detail for this case.  相似文献   

18.
移动网格的资源环境具有很高的动态性,在任意时刻可能发生资源加入、退出、故障、移动等。采用任务复制策略实现对资源不可靠性的容错。用weibull分布刻画资源的可靠性,建立任务复制模型;形式化描述了基于复制策略的独立任务调度问题,给出调度目标和约束条件;通过遗传算法解决调度问题。仿真结果表明,调度算法具有良好的可扩展性,调度性能与资源可靠性呈线性关系。  相似文献   

19.
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend–resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.  相似文献   

20.
Li  Chunlin  Tang  Jianhang  Luo  Youlong 《The Journal of supercomputing》2019,75(11):7209-7243

The replica strategies in traditional cloud computing often result in excessive resource consumption and long response time. In the edge cloud environment, if the replica node cannot be managed efficiently, it will cause problems such as low user’s access speed and low system fault tolerance. Therefore, this paper proposed replica creation and selection strategy based on the edge cloud architecture. The dynamic replica creation algorithm based on access heat (DRC-AH) and replica selection algorithms based on node service capability (DRS-NSC) were proposed. The DRC-AH uses data block as replication granularity and Grey Markov chain to dynamically adjust the number of replicas. After the replica is created, when client receives the user’s request, the DRS-NSC selects the best replica node to respond to the user. The experiments show that the proposed algorithms have significant advantages in prediction accuracy, user’s request response time, resource utilization, etc., and improve the performance of the system to a certain extent.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号