首页 | 官方网站   微博 | 高级检索  
     

MatchLink:一种主题爬行方法
引用本文:蒋宗礼,鲁国相.MatchLink:一种主题爬行方法[J].北京工业大学学报,2007,33(11):1227-1232.
作者姓名:蒋宗礼  鲁国相
作者单位:北京工业大学,计算机学院,北京,100022
摘    要:为了在浩如烟海的Web信息中更快地找到用户关心的信息,提出了一种主题爬行方法——MatchLink,它通过文档向量模型来评估网页链接的主题相关度,通过朴素贝叶斯算法和多层分类的方法计算链接所在网页的主题相关度,并根据这2个相关度优先下载主题相关的页面,实验表明其结果好于BestFirst和BreadthFirst。

关 键 词:主题爬行器  文档向量模型  朴素贝叶斯
文章编号:0254-0037(2007)11-1227-06
收稿时间:2006-08-31

MatchLink: A Focused Crawling Method
JIANG Zong-li,LU Guo-xiang.MatchLink: A Focused Crawling Method[J].Journal of Beijing Polytechnic University,2007,33(11):1227-1232.
Authors:JIANG Zong-li  LU Guo-xiang
Abstract:How to find what a user wants in tremendous amount of Web information is a great challenge to web search engine.By focusing downloading web pages on a given domain,focused crawlers can save a great deal of works and improve the quality of the information they provide.We put forward a method of focused crawling--MatchLink.It uses document vector model to evaluate topic relevance of the anchor and uses Naive Bayes algorithm and multilayer classification method to compute the topic relevance of the web page containing the anchor.According to these.two relevaneies,topic relevant web pages have prior claim to be downloaded.Experiment shows that the result is better than BestFirst and BreadthFirst.
Keywords:search engines  document handling  Naive Bayes methods
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号