XML技术在化学深层网数据提取中的应用 Chemical deep Web data extraction with XML-based technology期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

XML技术在化学深层网数据提取中的应用

引用本文：	卓流艺,李晓霞,郭力. XML技术在化学深层网数据提取中的应用[J]. 计算机与应用化学, 2006, 23(11): 1137-1141

作者姓名：	卓流艺李晓霞郭力

作者单位：	中国科学院过程工程研究所多相反应实验室,北京,100080;中国科学院研究生院,北京,100049;中国科学院过程工程研究所多相反应实验室,北京,100080

摘要：	Internet上的化学数据库是宝贵的化学信息资源,如何有效地利用这些数据是化学深层网所要解决的问题。本文总结了化学深层网的特点,基于XML技术实现从数据库检索返回的半结构化HTML页面中提取数据的目标,使之成为可供程序直接调用做进一步计算的数据。在数据提取过程中,先采用JTidy规范化HTML,得到格式上完整、内容无误的XHTML文档,利用包含着XPath路径语言的XSLT数据转换模板实现数据转换和提取。其中XPath表达式的优劣决定了XSLT数据转换模板能否长久有效地提取化学数据,文中着重介绍了如何编辑健壮的XPath表达式,强调了XPath表达式应利用内容和属性特征实现对源树中数据的定位,并尽可能地降低表达式之间的耦合度,前瞻性地预测化学站点可能出现的变化并在XSLT数据转换模板中采取相应的措施以提高表达式的长期有效性。为创建化学深层网数据提取的XSLT数据提取模板提供方法指导。
关键词：	Web数据提取化学深层网 XML XSLT XPath 化学数据库
文章编号：	1001-4160（2006）11-1137-1141
收稿时间：	2006-02-28
修稿时间：	2006-02-282006-05-28
Chemical deep Web data extraction with XML-based technology

Zhuo Liuyi,Li Xiaoxia,Guo Li. Chemical deep Web data extraction with XML-based technology[J]. Computers and Applied Chemistry, 2006, 23(11): 1137-1141

Authors:	Zhuo Liuyi Li Xiaoxia Guo Li

Affiliation:	1. Lab of Multi-Phase Reaction, Institute of Process Engineering, Chinese Academy of Sciences, Beijing, 100080, China; 2. Graduate University of Chinese Academy of Sciences, Beijing, 100049, China

Abstract:	The Internet chemical databases are valuable resources that form the chemical deep Web.The data in chemical deep Web is only accessible by a query and result pages generated from databases are mostly in the form of HTML documents for human browsing, not for data exchange in computational applications.In this paper we introduce an approach to extract data from chemical deep Web based on XML technologies,in which HTML documents are first normalized into XHTML and then mapped to the desired XML applica- tion format by creating XSLT for the targeted database using XML path expression and regular expressions.The paper describes a meth- odology for creating XSLT with XML path (XPath) expressions that are capable of extracting data from HTML pages returned from Web based chemical database searching,where the robustness of the XPath expressions are emphasized,which is critical given the vulnera- bility of extraction technologies to the continually changing content,structure,and formatting of pages on the chemical Web.We sum- marize the data extraction rules in terms of their dependence on content,structural,or formatting features,and provide practical tips on how to create robust data extraction patterns for the chemical deep Web.These rules will be used to generate better XSLT documents for data extraction in our ChemDB Portal.

Keywords:	Web data extraction chemical deep Web XML XSLT XPath chemical databases
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏