首页 | 官方网站   微博 | 高级检索  
     

基于遗传算法的主题爬虫
引用本文:张海亮,袁道华.基于遗传算法的主题爬虫[J].微机发展,2012(8):48-52.
作者姓名:张海亮  袁道华
作者单位:四川大学计算机学院,四川成都610065
摘    要:针对目前主题网络爬虫搜索策略难以在全局范围内找到最优解,通过对遗传算法的分析与研究,文中设计了一个基于遗传算法的主题爬虫方案。引入了结合文本内容的PageRank算法;采用向量空间模型算法计算网页主题相关度;采取网页链接结构与主题相关度来评判网页的重要性;依据网页重要性选择爬行中的遗传因子;设置适应度函数筛选与主题相关的网页。与普通的主题爬虫比较,该策略能够获取大量主题相关度高的网页信息,能够提高获取的网页的重要性,能够满足用户对所需主题网页的检索需求,并在一定程度上解决了上述问题。

关 键 词:遗传算法  爬虫  主题爬虫  主题相关度  网页重要性

Focused Crawling Based on Genetic Algorithms
ZHANG Hai-liang,YUAN Dao-hua.Focused Crawling Based on Genetic Algorithms[J].Microcomputer Development,2012(8):48-52.
Authors:ZHANG Hai-liang  YUAN Dao-hua
Affiliation:(College of Computer Science, Sichuan University, Chengdu 610065, China)
Abstract:Optimized solution can't be found in the global scope based on the present searching strategy of focused crawler. A focused crawler method based on genetic algorithm is proposed through the analysis and study of genetic algorithm. This method introduces the PageRank algorithm combined with text contents, computes the page topic similarity with vector space model algorithm, and judges the importance of web page according to web link structure and topic similarity. At the same time, the genetic factors are selected on basis of the importance of web page. The system sets fitness function to select pages relevant with topic. Compared to focused crawler, the topic crawler based on genetic algorithms could obtain the web pages which have strong correlation with subjects, and improve the impor- tance of access web pages, and satisfy user' s demand for searching topic webs they,re interested in. So in a certain extent, the above problems are solved.
Keywords:genetic algorithm  crawler  focused crawler  topic similarity  web importance
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号