首页 | 官方网站   微博 | 高级检索  
     

一种数据采集与分析平台的研究与设计
引用本文:杨,宇.一种数据采集与分析平台的研究与设计[J].广东电脑与电讯,2021,1(11):90-94.
作者姓名:  
作者单位:贵州电子信息职业技术学院
摘    要:随着互联网大数据技术的不断发展,网络数据采集技术成为热门的研究领域之一。基于Python语言下的数据采集功能库如Urllib、Requests、Selenium等模块效率低、易阻塞,并且目前的数据采集和分析平台都是独立的功能模块,没有形成闭环,用户体验差。为了解决上述问题,提出一种数据采集和分析平台,首先使用Scrapy框架完成数据采集,其次将采集到的数据通过Kettle工具进行数据清洗,再次,将处理好的结果存入MySQL 数据库,最后利用Flask框架,结合Echarts 技术搭建Web 系统,对数据分析结果进行可视化。以北京公交网站数据作为爬虫测试平台,通过对公交线路类型、公交路线等信息进行采集分析及结果展示,分析结果对城市公交的规划具有一定的指导意义,同时,平台具有稳定可靠、操作简单、实时性强等特点。


Research and Design of a Data Acquisition and Analysis Platform
YANG Yu.Research and Design of a Data Acquisition and Analysis Platform[J].Computer & Telecommunication,2021,1(11):90-94.
Authors:YANG Yu
Abstract:With the continuous development of big data technology, network data collection technology has become a popular research field. Data collection function libraries based on Python language such as Urllib, Requests, Selenium and other modules are inefficient and easy to block, and the current data collection and analysis platforms are all independent functional modules, which do not form a closed loop and have a poor user experience. In order to solve the above problems, this paper proposes a data collection and analysis platform. First, the Scrapy framework is used to complete data collection, and then the Kettle tool is used to clean the collected data. The processed results are saved into the MySQL database. Finally, the Flask frame is combined with Echarts technology to build a Web system to visualize the data analysis results. This paper uses Beijing Public Transport website data as a crawlertest platform. Through the collection and analysis of bus line types, bus routes and other information, and the results display, the analysis results have certain guiding significance for the planning of urban public transport. At the same time, the platform is stable, reliable and easy to operate.
Keywords:
点击此处可从《广东电脑与电讯》浏览原始摘要信息
点击此处可从《广东电脑与电讯》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号