首页 | 官方网站   微博 | 高级检索  
     

基于约束的半结构化信息的抽取方法
引用本文:黄豫清,邹涛.基于约束的半结构化信息的抽取方法[J].计算机应用与软件,2002,19(1):53-59.
作者姓名:黄豫清  邹涛
作者单位:南京大学多媒体计算机研究所,南京,210093
摘    要:为了对WEB上不规则的动态信息按照数据库的方式集成和查询,本文采用对象交换模型(OEM)建立WEB上信息模型。为了将页面中各个部分表示为对应的OEM对象,本文(1)设计了半结构化信息的抽取算法;(2)定义了满足约束条件的数据抽取格式,并且设计了输出正确抽取格式的候选者算法;(3)给出测试结果。该方法可以抽取结构化和半结构化的信息,比现有的抽取方法通用性更强。

关 键 词:数据抽取格式  OEM模型  数据抽取格式约束

EXTRACTING SEMISTRUCTURED INFORMATION FROM WEB
Huang Yuqing Zou Tao.EXTRACTING SEMISTRUCTURED INFORMATION FROM WEB[J].Computer Applications and Software,2002,19(1):53-59.
Authors:Huang Yuqing Zou Tao
Abstract:In order to integrate and query irregular and dynamic information on WEB in a database fashion,Object Exchange Model(OEM)is used to construct the information model of WEB. In order to express each component of the pages as an OEM object in this paper we have the following: (1) an algorithm which extracts semistructured data from HTML pages is designed; (2)a data extracting format which satisfies the constraints is defined and a candidate algorithm which outputs correct extracting format is designed; (3)the testing results have been given out.The structured and semi-structured data can thus be extracted by our method.It has more applicability than other current methods.
Keywords:Data extracting format OEM model Data extracting format constraint  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号