首页 | 官方网站   微博 | 高级检索  
     

基于云平台的分布式高性能网络爬虫的研究与设计
引用本文:石恩名,肖晓军,卢宇.基于云平台的分布式高性能网络爬虫的研究与设计[J].电信科学,2017,33(8).
作者姓名:石恩名  肖晓军  卢宇
作者单位:广州优亿信息科技有限公司,广东广州,510630
摘    要:随着大数据时代的到来,数据成为最宝贵的资源,而网络爬虫技术作为外部数据采集的重要手段,已然成为数据分析的标配.介绍了一种高性能、灵活和便捷的基于云平台的爬虫架构设计和实现.从爬虫的整体架构、分布式设计以及各模块的设计等角度进行了详细的阐述.爬虫各模块用Docker封装,Kubernetes做集群的资源调度和管理,在性能优化上采用了MD5去重树算法、DNS优化和异步I/O等多种策略组合的形式.实验表明,对比未优化的方案,爬虫在性能上具有较明显的优势.

关 键 词:分布式系统架构  网络爬虫  Docker  高性能

Research and design of distributed high-performance network reptiles based on cloud platform
SHI Enming,XIAO Xiaojun,LU Yu.Research and design of distributed high-performance network reptiles based on cloud platform[J].Telecommunications Science,2017,33(8).
Authors:SHI Enming  XIAO Xiaojun  LU Yu
Abstract:With the arrival of large data age,data has become the most valuable resource.And web crawler technology as an important means of external data collection,has become a standard tool for data analysis.A high-performance,convenient cloud-based crawler architecture design was introduced.The overall structure of the reptile to the distributed design and the design of the sub-module was described in detail.Each module of the crawler was encapsulated in Docker,and Kubemetes was used as the resource scheduling and management of the cluster.In the perforrhance of optimization,the MD5 reset tree algorithm,DNS optimization and asynchronous I/O were adopted.Experimental results show that the performance of crawler has obvious advantages compared with the UN optimized scheme.
Keywords:distributed system architecture  web crawler  Docker  high-performance
本文献已被 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号