首页 | 官方网站   微博 | 高级检索  
     

基于CURE算法的网页分块及正文块提取研究
引用本文:王超,徐杰锋.基于CURE算法的网页分块及正文块提取研究[J].微型机与应用,2012,31(12):11-14.
作者姓名:王超  徐杰锋
作者单位:中国石油大学(华东)计算机与通信工程学院计算机科学与技术系,山东青岛,266000
摘    要:研究基于CURE聚类的Web页面分块方法及正文块的提取规则。对页面DOM树增加节点属性,使其转换成为带有信息节点偏移量的扩展DOM树。利用CURE算法进行信息节点聚类,各个结果簇即代表页面的不同块。最后提取了正文块的三个主要特征,构造信息块权值公式,利用该公式识别正文块。

关 键 词:Web信息抽取  聚类算法  页面分块  正文块提取

approach based on CURE algorithm of Web page segmentation and information extraction
Wang Chao,Xu Jiefeng.approach based on CURE algorithm of Web page segmentation and information extraction[J].Microcomputer & its Applications,2012,31(12):11-14.
Authors:Wang Chao  Xu Jiefeng
Affiliation:(Computer Science and Technology Department,College of Computer and Communication Engineering,China University of Petroleum,Qingdao 266000,China)
Abstract:This paper discusses an approach based on CURE algorithm of Web pages segmentation and text extraction rules. The main idea is to add attributes to nodes of a standardization DOM tree to convert it into the extended DOM tree with the information node offset. Subsequently,we use the CURE algorithm to cluster information nodes. And each result of the cluster represent different block of the page. Finally,we extracts three main features of the text block and construct information weights formula which can distinguish text blocks.
Keywords:Web information extraction  clustering algorithm  page block  text block extraction
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号