WWW论坛中的动态网页采集 Crawling Dynamic Web Pages in WWW Forums期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

WWW论坛中的动态网页采集

引用本文：	李魁,程学旗,郭岩,张凯.WWW论坛中的动态网页采集[J].计算机工程,2007,33(6):80-82.

作者姓名：	李魁程学旗郭岩张凯

作者单位：	1. 中国科学院计算技术研究所,北京,100080;中国科学院研究生院,北京,100039 2. 中国科学院计算技术研究所,北京,100080

基金项目：	国家重点基础研究发展计划(973计划)

摘要：	网络论坛已经成为互联网信息发布的主要形式，对论坛信息的检索和挖掘都涉及到论坛信息的获取，然而传统的针对静态网页的广度优先采集工具，不能有效地获取论坛信息。该文利用论坛的结构特点，提出了一种“版面-主题关联判断”(BTCJ)算法，采用一种基于版面扩展的采集策略。实验证明，该方法在论坛采集准确率和覆盖率方面显著优于广度优先策略；具有良好的泛化能力，应用在实践中已覆盖各种类型的论坛12 000余个。
关键词：	互联网论坛信息采集动态网页
文章编号：	1000-3428（2007）06-0080-03
修稿时间：	2006-03-25
Crawling Dynamic Web Pages in WWW Forums

LI Kui,CHENG Xueqi,GUO Yan,ZHANG Kai.Crawling Dynamic Web Pages in WWW Forums[J].Computer Engineering,2007,33(6):80-82.

Authors:	LI Kui CHENG Xueqi GUO Yan ZHANG Kai

Affiliation:	1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080; 2. Graduate School, Chinese Academy of Sciences, Beijing 100039

Abstract:	Web Forums have been one of dominating ways for information release and exchange in Internet. Crawling is the groundwork of searching and mining information from Web Forums. However, traditional crawling component usually using “Broad-first” strategy can not fetch information from Web Forums effectively. Exploring inner structure-features of forums, this paper presents a crawling strategy, which is based on “board-topic correlation judgments” algorithm. Compared with “board-first” strategy, this solution performs remarkably better both in precisions and recall. In practice, the algorithm is performed over 12 000 different Web forums and achieves a good result.

Keywords:	WWW forums Information crawling Dynamic Web page
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏