基于URL模式集的主题爬虫 Focused Crawler Based on URL Patterns期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于URL模式集的主题爬虫

引用本文：	胡萍瑞,李石君. 基于URL模式集的主题爬虫[J]. 计算机应用研究, 2018, 35(3)

作者姓名：	胡萍瑞李石君

作者单位：	武汉大学计算机学院,武汉大学计算机学院

基金项目：	国家自然科学(61272109, 61502350)

摘要：	为提高主题爬虫的性能，依据站点信息组织的特点和URL的特征，提出一种基于URL模式集的主题爬虫。爬虫分两阶段，在实验爬虫阶段，采集站点样本数据，采用基于URL前缀树的模式构建算法构建URL模式，形成模式关系图，并利用HITS算法分析该模式关系图，计算出各模式的重要度；在聚焦爬虫阶段，无需预先下载页面，即可利用生成的URL模式判断页面是否主题相关和能否指导爬虫深入抓取，并根据URL模式的重要度预测待抓取链接优先级。实验表明，该爬虫相比现有的主题爬虫能快速引导爬虫抓取主题相关页面，保证爬虫的查准率和查全率，有效提高爬虫抓取效率。
关键词：	主题爬虫 URL模式 URL前缀树模式关系图 URL模式重要性
收稿时间：	2016-10-31
修稿时间：	2018-03-18
Focused Crawler Based on URL Patterns

Hu Pingrui and Li Shijun. Focused Crawler Based on URL Patterns[J]. Application Research of Computers, 2018, 35(3)

Authors:	Hu Pingrui and Li Shijun

Affiliation:	College of Computer,Wuhan University,

Abstract:	To improve the performance of the focused crawler, according to the features of site information organization and URL, this paper proposed an UPFC(Focused Crawler Based on URL Patterns) which in a two-phase framework. In the experimental crawler phase, it collected the site samples and built the URL patterns by the pattern construction algorithm based on URL prefix tree. Additionally, it adopted the HITS algorithm to calculate the importance of patterns based on the pattern graph. In the focused crawler phase, the topic relevance and the guiding significance of pages were determined by those URL patterns without pre-downloading, and the priority of links to be crawled were predicted according to the importance of URL patterns. Experimental results prove that the crawler can be guided to crawl the relevant pages quickly, guarantee the precision and recall, and improve the crawling efficiency.

Keywords:	focused crawler URL pattern URL prefix tree pattern graph importance of URL pattern

	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏