首页 | 官方网站   微博 | 高级检索  
     

基于扩展DOM树的Web页面信息抽取
引用本文:王磊,蒋建中,郭军利.基于扩展DOM树的Web页面信息抽取[J].计算机应用与软件,2007,24(6):137-139.
作者姓名:王磊  蒋建中  郭军利
作者单位:解放军信息工程大学通信工程系 河南郑州450002
摘    要:随着Internet的发展,Web页面提供的信息量日益增长,信息的密集程度也不断增强.多数Web页面包含多个信息块,它们布局紧凑,在HTML语法上具有类似的模式.针对含有多信息块的Web页面,提出一种信息抽取的方法:首先创建扩展的DOM(Document Object Model)树,将页面抽取成离散的信息条;然后根据扩展DOM树的层次结构,并结合必要的视觉特性和语义信息对离散化的信息条重新整合;最后确定包含信息块的子树,深度遍历DOM树实现信息抽取.该算法能对多信息块的Web页面进行信息抽取.

关 键 词:DOM树  信息抽取  包装器  半结构化  基于扩展  信息抽取  TREE  EXTENDED  BASED  WEB  PAGE  EXTRACTION  算法  遍历  深度  子树  整合  离散化  语义信息  视觉特性  结合  层次结构  Document  Object  Model  方法
修稿时间:2006-04-29

INFORMATION EXTRACTION FROM WEB PAGE BASED ON EXTENDED DOM TREE
Wang Lei,Jiang Jianzhong,Guo Junli.INFORMATION EXTRACTION FROM WEB PAGE BASED ON EXTENDED DOM TREE[J].Computer Applications and Software,2007,24(6):137-139.
Authors:Wang Lei  Jiang Jianzhong  Guo Junli
Affiliation:Department of Communication Engineering, Information Engineering University of PLA, Zhengzhou 450002, Henan, China
Abstract:With the development of Internet,the amount as well as the density of information has increased day by day.Most of the time,a single web page contains several information blocks which are close in layout and have similar mode in HTML grammar.A method of information extraction is designed in dealing with multiple information-block web pages.First,the definition of an extended DOM tree is put forward,and a given web page is dispersed into pieces of information.Then,by combining the hierarchy information with the vision features and semantic information,these discrete pieces of information are aggregated into information blocks.Finally the information block are extracted out by depth-traversing the extended DOM tree.This algorithm is applicable in dealing with web pages containing several information blocks.
Keywords:DOM tree Information extraction Wrapper Semi-structured
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号