首页 | 官方网站   微博 | 高级检索  
     

WWW论坛中的动态网页采集
引用本文:李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82.
作者姓名:李魁  程学旗  郭岩  张凯
作者单位:1. 中国科学院计算技术研究所,北京,100080;中国科学院研究生院,北京,100039
2. 中国科学院计算技术研究所,北京,100080
基金项目:国家重点基础研究发展计划(973计划)
摘    要:网络论坛已经成为互联网信息发布的主要形式,对论坛信息的检索和挖掘都涉及到论坛信息的获取,然而传统的针对静态网页的广度优先采集工具,不能有效地获取论坛信息。该文利用论坛的结构特点,提出了一种“版面-主题关联判断”(BTCJ)算法,采用一种基于版面扩展的采集策略。实验证明,该方法在论坛采集准确率和覆盖率方面显著优于广度优先策略;具有良好的泛化能力,应用在实践中已覆盖各种类型的论坛12 000余个。

关 键 词:互联网论坛  信息采集  动态网页
文章编号:1000-3428(2007)06-0080-03
修稿时间:2006-03-25

Crawling Dynamic Web Pages in WWW Forums
LI Kui,CHENG Xueqi,GUO Yan,ZHANG Kai.Crawling Dynamic Web Pages in WWW Forums[J].Computer Engineering,2007,33(6):80-82.
Authors:LI Kui  CHENG Xueqi  GUO Yan  ZHANG Kai
Affiliation:1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080; 2. Graduate School, Chinese Academy of Sciences, Beijing 100039
Abstract:Web Forums have been one of dominating ways for information release and exchange in Internet. Crawling is the groundwork of searching and mining information from Web Forums. However, traditional crawling component usually using “Broad-first” strategy can not fetch information from Web Forums effectively. Exploring inner structure-features of forums, this paper presents a crawling strategy, which is based on “board-topic correlation judgments” algorithm. Compared with “board-first” strategy, this solution performs remarkably better both in precisions and recall. In practice, the algorithm is performed over 12 000 different Web forums and achieves a good result.
Keywords:WWW forums  Information crawling  Dynamic Web page
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号