基于扩展DOM树的Web页面信息抽取 INFORMATION EXTRACTION FROM WEB PAGE BASED ON EXTENDED DOM TREE期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于扩展DOM树的Web页面信息抽取

引用本文：	王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139.

作者姓名：	王磊蒋建中郭军利

作者单位：	解放军信息工程大学通信工程系河南郑州450002

摘要：	随着Internet的发展,Web页面提供的信息量日益增长,信息的密集程度也不断增强.多数Web页面包含多个信息块,它们布局紧凑,在HTML语法上具有类似的模式.针对含有多信息块的Web页面,提出一种信息抽取的方法:首先创建扩展的DOM(Document Object Model)树,将页面抽取成离散的信息条;然后根据扩展DOM树的层次结构,并结合必要的视觉特性和语义信息对离散化的信息条重新整合;最后确定包含信息块的子树,深度遍历DOM树实现信息抽取.该算法能对多信息块的Web页面进行信息抽取.
关键词：	DOM树信息抽取包装器半结构化基于扩展信息抽取 TREE EXTENDED BASED WEB PAGE EXTRACTION 算法遍历深度子树整合离散化语义信息视觉特性结合层次结构 Document Object Model 方法
修稿时间：	2006-04-29
INFORMATION EXTRACTION FROM WEB PAGE BASED ON EXTENDED DOM TREE

Wang Lei,Jiang Jianzhong,Guo Junli.INFORMATION EXTRACTION FROM WEB PAGE BASED ON EXTENDED DOM TREE[J].Computer Applications and Software,2007,24(6):137-139.

Authors:	Wang Lei Jiang Jianzhong Guo Junli

Affiliation:	Department of Communication Engineering, Information Engineering University of PLA, Zhengzhou 450002, Henan, China

Abstract:	With the development of Internet,the amount as well as the density of information has increased day by day.Most of the time,a single web page contains several information blocks which are close in layout and have similar mode in HTML grammar.A method of information extraction is designed in dealing with multiple information-block web pages.First,the definition of an extended DOM tree is put forward,and a given web page is dispersed into pieces of information.Then,by combining the hierarchy information with the vision features and semantic information,these discrete pieces of information are aggregated into information blocks.Finally the information block are extracted out by depth-traversing the extended DOM tree.This algorithm is applicable in dealing with web pages containing several information blocks.

Keywords:	DOM tree Information extraction Wrapper Semi-structured
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏