首页 | 官方网站   微博 | 高级检索  
     


Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
Affiliation:1. Institute for Applied Analysis and Numerical Simulation, University of Stuttgart, Pfaffenwaldring 57, 70569 Stuttgart, Germany;2. Fakultät für Mathematik (LS3), TU Dortmund, Vogelpothsweg 87, 44227 Dortmund, Germany;1. Department of Engineering and Technological Research, Universidad Nacional de La Matanza, Argentina;2. Department of Computer Science, Universidad Católica del Maule, Chile;3. Facultad de Informática, Universidad Complutense de Madrid, Spain;1. Japan Atomic Energy Agency, 5-1-5 Kashiwanoha, Kashiwa 277-8587, Japan;2. Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Japan;3. Japan Atomic Energy Agency, 2-166 Omotedate, Obuchi, Rokkasho, Kamikita-gun 039-3212, Japan;4. National Institute for Fusion Science/The Graduate University for advanced Studies, 322-6 Oroshi-cho, Toki 509-5292, Japan;1. Science and Technology Facilities Council, Daresbury Laboratory, Sci-Tech Daresbury, Warrington WA4 4AD, United Kingdom;2. University of Zagreb, Faculty of Mechanical Engineering and Naval Architecture, Ivana Lučića 5, Zagreb 10000, Croatia;3. University of Zagreb, Faculty of Science, Department of Mathematics, Bijenička cesta 30, Zagreb 10000, Croatia
Abstract:We analyse novel fault tolerance schemes for data loss in multigrid solvers, which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We experimentally identify the root cause of convergence degradation in the presence of data loss using smoothness considerations. Our resulting schemes form a family of techniques that can be tailored to the expected error probability of (future) large-scale machines. A performance model gives further insight into the benefits and applicability of our techniques.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号