首页 | 官方网站   微博 | 高级检索  
     

基于动态隧道技术的主题爬行策略
引用本文:姜琨,朱磊,王一川.基于动态隧道技术的主题爬行策略[J].计算机系统应用,2020,29(3):253-260.
作者姓名:姜琨  朱磊  王一川
作者单位:西安理工大学计算机科学与工程学院,西安710048;西安理工大学计算机科学与工程学院,西安710048;西安理工大学计算机科学与工程学院,西安710048
基金项目:国家自然科学基金(61602374)
摘    要:互联网网页所形成的主题孤岛严重影响了搜索引擎系统的主题爬虫性能,通过人工增加大量的初始种子链接来发现新主题的方法无法保证主题网页的全面性.在分析传统基于内容分析、基于链接分析和基于语境图的主题爬行策略的基础上,提出了一种基于动态隧道技术的主题爬虫爬行策略.该策略结合页面主题相关度计算和URL链接相关度预测的方法确定主题孤岛之间的网页页面主题相关性,并构建层次化的主题判断模型来解决主题孤岛之间的弱链接问题.同时,该策略能有效防止主题爬虫因采集过多的主题无关页面而导致的主题漂移现象,从而可以实现在保持主题语义信息的爬行方向上的动态隧道控制.实验过程利用主题网页层次结构检测页面主题相关性并抽取“体育”主题关键词,然后以此对采集的主题网页进行索引查询测试.结果表明,基于动态隧道技术的爬行策略能够较好的解决主题孤岛问题,明显提升了“体育”主题搜索引擎的准确率和召回率.

关 键 词:网络爬虫  主题孤岛  动态隧道  爬行策略
收稿时间:2019/7/19 0:00:00
修稿时间:2019/8/22 0:00:00

Dynamic Tunneling Heuristic for Focused Crawling
JIANG Kun,ZHU Lei and WANG Yi-Chuan.Dynamic Tunneling Heuristic for Focused Crawling[J].Computer Systems& Applications,2020,29(3):253-260.
Authors:JIANG Kun  ZHU Lei and WANG Yi-Chuan
Affiliation:Faculty of Computer Science and Engineering, Xi''an University of Technology, Xi''an 710048, China,Faculty of Computer Science and Engineering, Xi''an University of Technology, Xi''an 710048, China and Faculty of Computer Science and Engineering, Xi''an University of Technology, Xi''an 710048, China
Abstract:Topic island on Internet Web pages has seriously affected the performance of focused crawlers. The metric of setting more initial links to find new topics cannot guarantee the comprehensiveness of Web pages. On the basis of analyzing typical crawling strategies and taking into account the hierarchy of topic relevant, we propose a crawling strategy using dynamic tunneling. The crawling strategy uses the tunneling technology based on the topic of Web pages to discover new topics, and constructs a hierarchical topic model to solve the problem of weak link between two topic islands. Meanwhile, the strategy can effectively prevent topic drift caused by collecting too many topic-independent pages, thus dynamic controls the tunneling depth in the crawling direction with the semantic information of the topic maintained. Experimental results show that the proposed method can better address the topic island issue, thereby enhancing the recall of focused search engines.
Keywords:focused crawler  topic island  crawling schema  dynamic tunneling
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号