面向超大规模计算系统的监控、调度及网络优化实践 Practices on Monitoring,Scheduling,and Interconnection Optimization of Super-Large Computing System期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向超大规模计算系统的监控、调度及网络优化实践

引用本文：	秦晓宁,王家尧,胡梦龙,苏勇,万伟,李斌,戴荣,王志鹏,吉青. 面向超大规模计算系统的监控、调度及网络优化实践[J]. 数据与计算发展前沿, 2020, 2(1): 55-69

作者姓名：	秦晓宁王家尧胡梦龙苏勇万伟李斌戴荣王志鹏吉青

作者单位：	南京航空航天大学,计算机科学与技术学院,江苏南京 210016;曙光信息产业(北京)有限公司,北京 100193;中国人民大学附属中学,北京 100080

基金项目：	国家重点研发计划(2018YFB0204400)。

摘要：	【目的】为应对超大规模计算系统所带来的监控数据风暴、作业调度稳定性及灵活性、网络复杂度及高效性等实际挑战,本文分享了近期真实实践的经验和解决办法。【应用背景】当计算系统从P级逐渐向E级过渡,节点数量可超过10000个。在计算系统设计之初就需要确定网络拓扑的选型,而在系统的具体使用中更是离不开高效的调度和及时的监控。【方法】本文采用了基于动态负载均衡的分布式监控架构设计,基于高速缓存的分布式告警架构设计,基于SLURM的源码和配置优化,以及nd-Torus网络拓扑仿真对比等相关技术手段,基本满足了实际业务使用需求。【结果】数据表明,对于~10000节点的计算系统,实时告警数据库表的数据量大小基本可以控制在100万条以内。优化后的SLURM调度系统,可满足系统的业务级调度需求。网络方面,6D-Torus网络由于网络直径低、平均通信距离短,性能和网卡线缆用量较Fat-Tree网络和3D-Torus有一定提升,饱和吞吐率超过40%。【结论】分布式监控架构和告警架构可以有效解决监控数据风暴问题。SLURM在优化后可以实现对超大规模计算系统的作业调度功能。就线缆和交换机使用数量而言,6D-Torus相对于传统Fat-Tree网络更加经济,且性能优于3D-Torus,更适合超大规模计算系统。
关键词：	计算监控作业调度网络
Practices on Monitoring,Scheduling,and Interconnection Optimization of Super-Large Computing System

Qin Xiaoning,Wang Jiayao,Hu Menglong,Su Yong,Wan Wei,Li Bin,Dai Rong,Wang Zhipeng,Ji Qing. Practices on Monitoring,Scheduling,and Interconnection Optimization of Super-Large Computing System[J]. Frontiers of Data&Computing, 2020, 2(1): 55-69

Authors:	Qin Xiaoning Wang Jiayao Hu Menglong Su Yong Wan Wei Li Bin Dai Rong Wang Zhipeng Ji Qing

Affiliation:	(Institute of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing,Jiangsu 210016,China;Dawning Information Industry Co.,Beijing 100193,China;The High School Affiliated to Renmin University,Beijing 100080,China)

Abstract:	[Objective]As the super-large scale computing systems getting more and more popular,a series of challenges have been popped up,such as processing of the massive monitoring data,the stability and flexibility of job scheduling,and the complexity and efficiency of fabric interconnection etc..This paper summarizes the experiences and solutions for recent projects in these three aspects.[Context]The computing systems have been moving from peta-scale to the exascale level,and the scale of the system could easily exceed 10000 nodes.At the beginning of computing system design,we need to determine the selection of network topology.While during at the period of operation,efficient scheduling and timely monitoring are definitely non-trivial issues.[Methods]To resolve the challenges,this paper adopts a dynamic load balancing distributed monitoring architecture and a cache sensitive distributed alarm architecture.It also quantitatively simulates the performance of different nd-Torus topology.[Results]The data show that for the computing system(~10000 nodes),the data volume of the real-time alarm database table can be controlled within one million items.The optimized SLURM scheduling system can meet the business level requirements.As for network,the 6D-Torus topology exhibits higher performance than that of the 3D-Torus topology and fat tree topology in terms of the amount of switches&cables and the efficiency,due to its smaller network diameter and shorter average communication distance.As a result,the saturated throughput of the 6D-Torus topology could reach 40%.[Conclusions]Distributed monitoring architecture and alarm architecture can effectively solve the challenging problem of processing massive monitoring data.After optimization,SLURM successfully realizes the job scheduling function on super-large computing system.Compared with the fat tree and 3D-Torus topology,the 6D-Torus is a better choice for super-large computing systems.

Keywords:	computing monitoring job scheduling network
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏