首页 | 官方网站   微博 | 高级检索  
     


Reliability-aware performance model for optimal GPU-enabled cluster environment
Authors:Supada Laosooksathit  Raja Nassar  Chokchai Leangsuksun  Mihaela Paun
Affiliation:1. Department of Computer Science, Louisiana Tech University, Ruston, Louisiana, USA
2. Department of Mathematics and Statistics, Louisiana Tech University, Ruston, Louisiana, USA
3. National Institute for Research and Development for Biological Sciences, Bucharest, Romania
Abstract:Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号