首页 | 官方网站   微博 | 高级检索  
     

基于URL模式集的主题爬虫
引用本文:胡萍瑞,李石君. 基于URL模式集的主题爬虫[J]. 计算机应用研究, 2018, 35(3)
作者姓名:胡萍瑞  李石君
作者单位:武汉大学 计算机学院,武汉大学 计算机学院
基金项目:国家自然科学(61272109, 61502350)
摘    要:为提高主题爬虫的性能,依据站点信息组织的特点和URL的特征,提出一种基于URL模式集的主题爬虫。爬虫分两阶段,在实验爬虫阶段,采集站点样本数据,采用基于URL前缀树的模式构建算法构建URL模式,形成模式关系图,并利用HITS算法分析该模式关系图,计算出各模式的重要度;在聚焦爬虫阶段,无需预先下载页面,即可利用生成的URL模式判断页面是否主题相关和能否指导爬虫深入抓取,并根据URL模式的重要度预测待抓取链接优先级。实验表明,该爬虫相比现有的主题爬虫能快速引导爬虫抓取主题相关页面,保证爬虫的查准率和查全率,有效提高爬虫抓取效率。

关 键 词:主题爬虫  URL模式  URL前缀树  模式关系图  URL模式重要性
收稿时间:2016-10-31
修稿时间:2018-03-18

Focused Crawler Based on URL Patterns
Hu Pingrui and Li Shijun. Focused Crawler Based on URL Patterns[J]. Application Research of Computers, 2018, 35(3)
Authors:Hu Pingrui and Li Shijun
Affiliation:College of Computer,Wuhan University,
Abstract:To improve the performance of the focused crawler, according to the features of site information organization and URL, this paper proposed an UPFC(Focused Crawler Based on URL Patterns) which in a two-phase framework. In the experimental crawler phase, it collected the site samples and built the URL patterns by the pattern construction algorithm based on URL prefix tree. Additionally, it adopted the HITS algorithm to calculate the importance of patterns based on the pattern graph. In the focused crawler phase, the topic relevance and the guiding significance of pages were determined by those URL patterns without pre-downloading, and the priority of links to be crawled were predicted according to the importance of URL patterns. Experimental results prove that the crawler can be guided to crawl the relevant pages quickly, guarantee the precision and recall, and improve the crawling efficiency.
Keywords:focused crawler   URL pattern   URL prefix tree   pattern graph   importance of URL pattern
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号