期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Architectures for Extreme-Scale Computing 总被引：1，自引：0，他引：1

Torrellas Josep 《Computer》2009,42(11):28-35

Extreme-scale computers promise orders-of-magnitude improvement in performance over current high-end machines for the same machine power consumption and physical footprint. They also bring some important architectural challenges. 相似文献

2.

Extreme-scale parallel computing: bottlenecks and strategies

Ze-yao Mo 《浙江大学学报:C卷英文版》2018,19(10):1251-1260

Extreme-scale numerical simulations seriously demand extreme parallel computing capabilities. To address the challenges of these capabilities toward exascale, we systematically analyze the major bottlenecks of parallel computing research from three perspectives: computational scale, computing efficiency, and programming productivity. For these bottlenecks, we propose a series of urgent key issues and coping strategies. This study will be useful in synchronizing development between the numerical computing capability and supercomputer peak performance. 相似文献

3.

A survey of recoverable distributed shared virtual memory systems

Morin C. Puaut I. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(9):959-969

Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable DSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure 相似文献

4.

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献

5.

Failure detection algorithm for Fail-Lagging model applied to HPC

Ye Yingjun Zhang Yongdong Ye Weicai 《The Journal of supercomputing》2022,78(12):14009-14033

It is essential to use fault tolerance techniques on exascale high-performance computing systems, but this faces many challenges such as higher probability of failure, more complex types of faults, and greater difficulty in failure detection. In this paper, we designed the Fail-Lagging model to describe HPC process-level failure. The failure model does not distinguish whether the failed process is crashed or slow, but is compatible with the possible behavior of the process due to various failures, such as crash, slow, recovery. The failure detection in Fail-Lagging model is implemented by local detection and global decision among processes, which depend on a robust and efficient communication topology. Robust means that failed processes do not easily corrupt the connectivity of the topology, and efficient means that the time complexity of the topology used for collective communication is as low as possible. For this purpose, we designed a torus-tree topology for failure detection, which is scalable even at the scale of an extremely large number of processes. The Fail-Lagging model supports common fault tolerance methods such as rollback, replication, redundancy, algorithm-based fault tolerance, etc. and is especially able to better enable the efficient forward recovery mode. We demonstrate with large-scale experiments that the torus-tree failure detection algorithm is robust and efficient, and we apply fault tolerance based on the Fail-Lagging model to iterative computation, enabling applications to react to faults in a timely manner.

相似文献

6.

A Large-Scale Study of Failures on Petascale Supercomputers

下载免费PDF全文

Rui-Tao Liu Zuo-Ning Chen 《计算机科学技术学报》2018,33(1):24-41

与超级计算机的快速的开发,规模和复杂性曾经正在增加,并且可靠性和跳回面临更大的挑战。在容错有许多重要技术,例如基于差错预言的积极失败回避技术,反应容错基于检查点,和安排技术到改进可靠性。系统差错的特征上的质、量的描述为这些技术是很批评的。这研究在超级计算机把 Sunway BlueLight 称为的二典型 petascale 上分析失败的来源(基于多核心中央处理器) 并且 Sunway TaihuLight (基于异构的 manycore 中央处理器) 。它揭开一些有趣的差错特征并且在主要部件差错之中发现未知关联关系。最后,纸在资源和不同时间跨度的各种各样的谷物分析二台超级计算机的失败时间,并且为 petascale 超级计算机造一个一致多维的失败时间模型。相似文献

7.

An energy-aware scheduling algorithm under maximum power consumption constraints

《Journal of Manufacturing Systems》2020

This research investigates the production scheduling problems under maximum power consumption constraints. Probabilistic models are developed to model dispatching-dependent and stochastic machine energy consumption. A multi-objective scheduling algorithm called the energy-aware scheduling optimization method is proposed in this study to enhance both production and energy efficiency. The explicit consideration of the probabilistic energy consumption constraint and the following factors makes this work distinct from other existing studies in the literature: 1) dispatching-dependent energy consumption of machines, 2) stochastic energy consumption of machines, 3) parallel machines with different production rates and energy consumption pattern, and 4) maximum power consumption constraints. The proposed three-stage algorithm can quickly generate near-optimal solutions and outperforms other algorithms in terms of energy efficiency, makespan, and computation time. While minimizing the total energy consumption in the first and second stages, the proposed algorithm generates a detailed production schedule under the probabilistic constraint of peak energy consumption in the third stage. Numerical results show the superiority of the scheduling solution with regard to quality and computational time in real problems instances from manufacturing industry. While the scheduling solution is optimal in total energy consumption, the makespan is within 0.6 % of the optimal on average. 相似文献

8.

Fault tolerant algorithms for heat transfer problems

Hatem Ltaief Edgar Gabriel Marc Garbey 《Journal of Parallel and Distributed Computing》2008

With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution. 相似文献

9.

Fault-Aware Runtime Strategies for High-Performance Computing

Yawei Li Zhiling Lan Gujrati P. Xian-He Sun 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(4):460-473

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability). 相似文献

10.

基于功率谱包络能量和SVM的舰用发动机故障诊断

下载免费PDF全文

崔建国刘宝胜王桂华于明月高阳《计算机测量与控制》2015,23(12):21-21

发动机是军舰上的重要部件之一,其稳定性对军舰的正常航行具有重要影响。以舰用发动机关键部件(主泵轴承)为具体研究对象,提出了基于功率谱包络能量和支持向量机相结合的故障诊断方法。首先获取了大量可表征舰用发动机主泵轴承健康状态的振动加速度信息,对其进行功率谱分析,获得其功率谱的包络能量;以获取的舰用发动机主泵轴承功率谱的包络能量构建特征向量,并设计基于SVM的舰用发动机主泵轴承故障诊断模型,对主泵轴承的故障进行诊断研究。研究结果表明,采用基于功率谱包络能量和SVM相结合的舰用发动机关键部件故障诊断方法,可以很好实现主泵轴承的故障诊断效能,为舰用发动机主泵轴承故障诊断的工程应用奠定了基础。相似文献

11.

k元n方体的可靠性评估

冯凯李婧《计算机应用》2019,39(11):3323-3327

并行计算机系统功能的实现很大程度上依赖于系统互连网络的性能。为了精确度量以k元n方体为底层拓扑结构的并行计算机系统的容错能力,研究了点故障模型下k元n方体中k元（n-1）方体子网络的可靠性。当k ≥ 3且为奇数时,分别在固定划分模式和灵活划分模式下对k元n方体中不同数目的k元（n-1）方体子网络保持无故障状态的平均失效时间进行了分析,并得出了这一子网络可靠性评估参数的计算公式。结果表明,当基于k为奇数的k元n方体构建的并行计算机系统指派子网络执行用户任务时,在点故障模型下灵活划分模式相比固定划分模式有着更好的容错能力。相似文献

12.

An environment for developing fault-tolerant software

Purtilo J.M. Jalote P. 《IEEE transactions on pattern analysis and machine intelligence》1991,17(2):153-159

An environment that supports execution of programs using both N-version programming and recovery blocks in a uniform manner is described. For N-version programming, the system offers an easy and flexible way of specifying the target machines for the separate versions. The basic unit of fault tolerance supported by this system is at the procedure or function level. Each such program unit can be packaged as its own task, and different fault tolerance techniques can subsequently be employed, even within the same application. The environment also allows versions to be written in different programming languages and executed on different machines. This enhances the independence between the different versions, making the fault tolerance techniques more effective. This environment has been developed for use on Unix-based hosts and currently runs on a network of Sun and DEC workstations 相似文献

13.

可靠性云计算环境下能源效率的提高机制

崔雪涛张曦煌《计算机工程与应用》2016,52(1):42-47

针对当前云计算能源效率低以及电源故障等不可靠问题,提出了一种物理主机整合机制以及调度算法,在保障云计算可靠性的同时提高能源效率。能量优化机制可以察觉优化时机,在电源等故障时执行调度算法。算法调节虚拟机到物理主机的映射,同时将相应物理主机中空闲的CPU容量,分配到正在运行的虚拟机中,从而提高能源效率。实验结果表明,与传统的调度算法相比,该算法在工作效率上提高了15.8%,在能量消耗上降低了9.8%。相似文献

14.

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

《Parallel Computing》2015

We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques. 相似文献

15.

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

Ifeanyi P. Egwutuoha Shiping Chen David Levy Bran Selic Rafael Calvo 《International Journal of Parallel, Emergent and Distributed Systems》2014,29(4):363-378

Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches. 相似文献

16.

The Effects of an ARMOR-based SIFT environment on the performance and dependability of user applications

Whisnant K. Iyer R.K. Kalbarczyk Z.T. Jones P.H. III Rennels D.A. Some R. 《IEEE transactions on pattern analysis and machine intelligence》2004,30(4):257-277

Few, distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. We present an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at the Jet Propulsion Laboratory. The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications. An evaluation methodology is presented in which over 28,000 errors were injected into both the SIFT processes and two representative REE applications. The experiments were split into three groups of error injections, with each group successively stressing the SIFT error detection and recovery more than the previous group. The results show that the SIFT environment added negligible overhead to the application's execution time during failure-free runs. Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple failure scenarios. Only 28 cases were observed in which either the application failed to start or the SIFT environment failed to recognize that the application had completed. Further investigations showed that assertions within the SIFT processes-coupled with object-based incremental checkpointing-were effective in preventing system failures by protecting dynamic data within the SIFT processes. 相似文献

17.

Automating the addition of fault tolerance with discrete controller synthesis

Alain Girault Éric Rutten 《Formal Methods in System Design》2009,35(2):190-225

Discrete controller synthesis (DCS) is a formal approach, based on the same state-space exploration algorithms as model-checking. Its interest lies in the ability to obtain automatically systems satisfying by construction formal properties specified a priori. In this paper, our aim is to demonstrate the feasibility of this approach for fault tolerance. We start with a fault intolerant program, modeled as the synchronous parallel composition of finite labeled transition systems; we specify formally a fault hypothesis; we state some fault tolerance requirements; and we use DCS to obtain automatically a program, having the same behavior as the initial fault intolerant one in the absence of faults, and satisfying the fault tolerance requirements under the fault hypothesis. Our original contribution resides in the demonstration that DCS can be elegantly used to design fault tolerant systems, with guarantees on key properties of the obtained system, such as the fault tolerance level, the satisfaction of quantitative constraints, and so on. We show with numerous examples taken from case studies that our method can address different kinds of failures (crash, value, or Byzantine) affecting different kinds of hardware components (processors, communication links, actuators, or sensors). Besides, we show that our method also offers an optimality criterion very useful to synthesize fault tolerant systems compliant to the constraints of embedded systems, like power consumption. 相似文献

18.

一种支持容错的任务并行程序设计模型

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献

19.

Optimizing MapReduce for energy efficiency

下载免费PDF全文

Nidhi Tiwari Umesh Bellur Santonu Sarkar Maria Indrawan 《Software》2018,48(9):1660-1687

The efficient use of energy is essential to address concerns of cost and sustainability. Many data centers contain MapReduce clusters to process Big Data applications. A large number of machines and fault tolerance capabilities make MapReduce clusters energy inefficient. In this paper, we present a Configurator based on performance and energy models to improve the energy efficiency of MapReduce systems. Our solution is novel as it takes into account the dependence of the performance and energy consumption of a cluster on MapReduce parameters. While this dependence is known, we are the first to model it and design a Configurator to optimize these parameter settings for maximizing the energy efficiency of MapReduce systems. Our empirical evaluations show that the Configurator can result in up to 50% improvement in the energy efficiency of typical MapReduce applications in two architecturally different clusters. 相似文献

20.

Reliable scalable symbolic computation: The design of SymGridPar2

《Computer Languages, Systems and Structures》2014,40(1):19-35

Symbolic computation is an important area of both Mathematics and Computer Science, with many large computations that would benefit from parallel execution. Symbolic computations are, however, challenging to parallelise as they have complex data and control structures, and both dynamic and highly irregular parallelism. The SymGridPar framework (SGP) has been developed to address these challenges on small-scale parallel architectures. However the multicore revolution means that the number of cores and the number of failures are growing exponentially, and that the communication topology is becoming increasingly complex. Hence an improved parallel symbolic computation framework is required.This paper presents the design and initial evaluation of SymGridPar2 (SGP2), a successor to SymGridPar that is designed to provide scalability onto 10⁵ cores, and hence also provide fault tolerance. We present the SGP2 design goals, principles and architecture. We describe how scalability is achieved using layering and by allowing the programmer to control task placement. We outline how fault tolerance is provided by supervising remote computations, and outline higher-level fault tolerance abstractions.We describe the SGP2 implementation status and development plans. We report the scalability and efficiency, including weak scaling to about 32,000 cores, and investigate the overheads of tolerating faults for simple symbolic computations. 相似文献