互联网商品信息抽取技术 WWW Merchandise Information Extraction期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

互联网商品信息抽取技术

引用本文：	于鲁波,陈超.互联网商品信息抽取技术[J].计算机工程,2008,34(5):274-276.

作者姓名：	于鲁波陈超

作者单位：	1. 中国科学技术大学电子工程与信息科学系,合肥,230027 2. 多媒体计算与通信教育部微软重点实验室,合肥,230026

基金项目：	多媒体计算与教育部-微软重点实验室开放基金

摘要：	针对网页信息抽取中格式多样化的问题，提出一种基于路径统计聚类的信息抽取算法。该算法充分利用电子商务网站网页的特点，给出网页统计信息的一般数学表达式，在此基础上，采用基于统计聚类的思想，分割信息块，实现抽取信息。通过对实际电子商务网站网页信息的抽取，证明算法的有效性，分割正确率达92.27%，信息抽取正确率达98.24%。
关键词：	网页分割网页信息抽取包装器路径聚类
文章编号：	1000-3428（2008）05-0274-03
收稿时间：	2007-04-06
修稿时间：	2007年4月6日
WWW Merchandise Information Extraction

YU Lu-bo,CHEN Chao.WWW Merchandise Information Extraction[J].Computer Engineering,2008,34(5):274-276.

Authors:	YU Lu-bo CHEN Chao

Affiliation:	(1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027; 2. MOE-Microsoft Key Laboratory of Multimedia Computing and Communication, Hefei 230026)

Abstract:	In response to format diversity problem in the webpage information extraction, this paper proposes a new information extraction method based on XPATH clustering. The method utilizes the character of e-commerce website and gives a general mathematic formula. Based on it, this paper uses the thought of webpage statistical information clustering, segments the information block, and realizes the information extraction. This paper proves the validity of the algorithm through the practical website information extraction, achieves good results. Segmentation accuracy is 92.27%, and information extraction accuracy gets 98.24%.

Keywords:	Web page segmentation Web page information extraction wrapper XPATH clustering
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机工程》浏览原始摘要信息
	点击此处可从《计算机工程》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏