基于Heritrix的面向特定主题的聚焦爬虫研究 Research of a Focused Crawler to Specific Topic Based on Heritrix期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Heritrix的面向特定主题的聚焦爬虫研究

引用本文：	朱敏,罗省贤.基于Heritrix的面向特定主题的聚焦爬虫研究[J].微机发展,2012(2):65-68.

作者姓名：	朱敏罗省贤

作者单位：	成都理工大学信息科学与技术学院,四川成都610059

摘要：	通过分析Heritirx开源爬虫的组件结构,针对Heritrix开源爬虫项目存在的问题,项目设计了特定的抓取逻辑和定向抓取包含某一特定内容的网页的类,并引入BKDRHash算法进行URL散列,实现了面向特定主题的网页信息搜索,达到了提高搜索数据的效率以及多线程抓取网页的目的。最后对某一特定主题的网页进行分析,并进行网页内容抓取,采用HTMLParser工具将抓取的网页数据源转换成特定的格式,可为面向主题的搜索信息系统以及数据挖掘提供数据源,为下一步研究工作做好准备。
关键词：	聚焦爬虫 Heritrix BKDRHash算法 HTMLParser 搜索引擎
Research of a Focused Crawler to Specific Topic Based on Heritrix

ZHU Min,LUO Sheng-xian.Research of a Focused Crawler to Specific Topic Based on Heritrix[J].Microcomputer Development,2012(2):65-68.

Authors:	ZHU Min LUO Sheng-xian

Affiliation:	(School of Information Science and Technology,Chengdu University of Technology,Chengdu 610059,China)

Abstract:	By analyzing the Heritrix open-source crawler＇s component architecture,on account of the existed problems of the Heritrix open-source project,the project designs specific capture logics and classes that can directly crawl particular content pages,implements search for particular topic pages;And introduce the BKDRHash algorithms to URL hashing to achieve a particular topic pages for information search and improve the efficiency of the search data,and achieve the purpose of multi-threaded web crawler.Finally,analyse a particular topic pages and capture content,use HTMLParser tool to crawl the web data source into a specific format,the search can provide a data source for the topic-oriented information systems and data mining,prepare a good potential for further research.

Keywords:	focused crawler Heritrix BKDRHash algorithm HTMLParser search engine
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏