首页 | 官方网站   微博 | 高级检索  
     

XML技术在化学深层网数据提取中的应用
引用本文:卓流艺,李晓霞,郭力. XML技术在化学深层网数据提取中的应用[J]. 计算机与应用化学, 2006, 23(11): 1137-1141
作者姓名:卓流艺  李晓霞  郭力
作者单位:中国科学院过程工程研究所多相反应实验室,北京,100080;中国科学院研究生院,北京,100049;中国科学院过程工程研究所多相反应实验室,北京,100080
摘    要:Internet上的化学数据库是宝贵的化学信息资源,如何有效地利用这些数据是化学深层网所要解决的问题。本文总结了化学深层网的特点,基于XML技术实现从数据库检索返回的半结构化HTML页面中提取数据的目标,使之成为可供程序直接调用做进一步计算的数据。在数据提取过程中,先采用JTidy规范化HTML,得到格式上完整、内容无误的XHTML文档,利用包含着XPath路径语言的XSLT数据转换模板实现数据转换和提取。其中XPath表达式的优劣决定了XSLT数据转换模板能否长久有效地提取化学数据,文中着重介绍了如何编辑健壮的XPath表达式,强调了XPath表达式应利用内容和属性特征实现对源树中数据的定位,并尽可能地降低表达式之间的耦合度,前瞻性地预测化学站点可能出现的变化并在XSLT数据转换模板中采取相应的措施以提高表达式的长期有效性。为创建化学深层网数据提取的XSLT数据提取模板提供方法指导。

关 键 词:Web数据提取  化学深层网  XML  XSLT  XPath  化学数据库
文章编号:1001-4160(2006)11-1137-1141
收稿时间:2006-02-28
修稿时间:2006-02-282006-05-28

Chemical deep Web data extraction with XML-based technology
Zhuo Liuyi,Li Xiaoxia,Guo Li. Chemical deep Web data extraction with XML-based technology[J]. Computers and Applied Chemistry, 2006, 23(11): 1137-1141
Authors:Zhuo Liuyi  Li Xiaoxia  Guo Li
Affiliation:1. Lab of Multi-Phase Reaction, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, 100080, China; 2. Graduate University of Chinese Academy of Sciences, Beijing, 100049, China
Abstract:The Internet chemical databases are valuable resources that form the chemical deep Web.The data in chemical deep Web is only accessible by a query and result pages generated from databases are mostly in the form of HTML documents for human browsing, not for data exchange in computational applications.In this paper we introduce an approach to extract data from chemical deep Web based on XML technologies,in which HTML documents are first normalized into XHTML and then mapped to the desired XML applica- tion format by creating XSLT for the targeted database using XML path expression and regular expressions.The paper describes a meth- odology for creating XSLT with XML path (XPath) expressions that are capable of extracting data from HTML pages returned from Web based chemical database searching,where the robustness of the XPath expressions are emphasized,which is critical given the vulnera- bility of extraction technologies to the continually changing content,structure,and formatting of pages on the chemical Web.We sum- marize the data extraction rules in terms of their dependence on content,structural,or formatting features,and provide practical tips on how to create robust data extraction patterns for the chemical deep Web.These rules will be used to generate better XSLT documents for data extraction in our ChemDB Portal.
Keywords:Web data extraction  chemical deep Web  XML  XSLT  XPath  chemical databases
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号