首页 | 官方网站   微博 | 高级检索  
     

基于Heritrix的面向特定主题的聚焦爬虫研究
引用本文:朱敏,罗省贤.基于Heritrix的面向特定主题的聚焦爬虫研究[J].微机发展,2012(2):65-68.
作者姓名:朱敏  罗省贤
作者单位:成都理工大学信息科学与技术学院,四川成都610059
摘    要:通过分析Heritirx开源爬虫的组件结构,针对Heritrix开源爬虫项目存在的问题,项目设计了特定的抓取逻辑和定向抓取包含某一特定内容的网页的类,并引入BKDRHash算法进行URL散列,实现了面向特定主题的网页信息搜索,达到了提高搜索数据的效率以及多线程抓取网页的目的。最后对某一特定主题的网页进行分析,并进行网页内容抓取,采用HTMLParser工具将抓取的网页数据源转换成特定的格式,可为面向主题的搜索信息系统以及数据挖掘提供数据源,为下一步研究工作做好准备。

关 键 词:聚焦爬虫  Heritrix  BKDRHash算法  HTMLParser  搜索引擎

Research of a Focused Crawler to Specific Topic Based on Heritrix
ZHU Min,LUO Sheng-xian.Research of a Focused Crawler to Specific Topic Based on Heritrix[J].Microcomputer Development,2012(2):65-68.
Authors:ZHU Min  LUO Sheng-xian
Affiliation:(School of Information Science and Technology,Chengdu University of Technology,Chengdu 610059,China)
Abstract:By analyzing the Heritrix open-source crawler's component architecture,on account of the existed problems of the Heritrix open-source project,the project designs specific capture logics and classes that can directly crawl particular content pages,implements search for particular topic pages;And introduce the BKDRHash algorithms to URL hashing to achieve a particular topic pages for information search and improve the efficiency of the search data,and achieve the purpose of multi-threaded web crawler.Finally,analyse a particular topic pages and capture content,use HTMLParser tool to crawl the web data source into a specific format,the search can provide a data source for the topic-oriented information systems and data mining,prepare a good potential for further research.
Keywords:focused crawler  Heritrix  BKDRHash algorithm  HTMLParser  search engine
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号