首页 | 官方网站   微博 | 高级检索  
     

一种改进的树路径模型在网页聚类中的研究
引用本文:王亚普,王志坚,叶 枫.一种改进的树路径模型在网页聚类中的研究[J].计算机科学,2015,42(5):109-113.
作者姓名:王亚普  王志坚  叶 枫
作者单位:1. 河海大学计算机与信息学院 南京211100
2. 河海大学计算机与信息学院 南京211100;南京航空航天大学计算机科学与技术学院 南京210016
基金项目:本文受江苏水利科技项目:“智慧河流”研究及其在六合滁河管理中的应用(2013025),河海大学中央高校基本科研业务费项目(2009B21614)资助
摘    要:相似度计算是文本挖掘的基础,也是信息提取过程的关键步骤.对于结构复杂的网页,当前基于传统树路径模型的相似度计算方法在准确性上尚不完善.传统树路径模型未考虑路径出现的先后顺序,并且比较路径相似度时用的是完全匹配,难以在不完全匹配时更精确地描述路径之间的相似度.因此,从网页结构相似度入手,提出了一种改进的树路径模型.该模型充分考虑了兄弟节点之间的关系、路径位置以及路径权重,弥补了传统树路径模型无法表达文档结构和层次信息的缺陷.实验结果表明,该模型提高了识别网页结构相似性的能力,既能对结构差别较大的网页进行良好的区分,又能较好地反映来自同一模板的网页之间的差异性,同时在网页聚类中具有更优的效果.

关 键 词:信息提取  网页结构  相似度  树路径模型  聚类

Research of Improved Tree Path Model in Web Page Clustering
WANG Ya-pu,WANG Zhi-jian and YE Feng.Research of Improved Tree Path Model in Web Page Clustering[J].Computer Science,2015,42(5):109-113.
Authors:WANG Ya-pu  WANG Zhi-jian and YE Feng
Affiliation:College of Computer and Information,Hohai University,Nanjing 211100,China,College of Computer and Information,Hohai University,Nanjing 211100,China;College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China and College of Computer and Information,Hohai University,Nanjing 211100,China;College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China
Abstract:Computing the similarity is the basis of text mining,and also the crucial step of information extraction.When tackling the Web pages with complex structure,the accuracy of computing the similarity based on traditional tree path model is not perfect.Traditional tree path model will not take the sequence of the paths in consideration and compare the similarity of paths by using perfect matching.It cannot describe the similarity between paths accurately when it is not a perfect matching.Therefore,the paper introduced the structural similarity Web at first,and then proposed a tree path model.This model takes fully account of the relationship between the siblings,the path location and the path weights,and makes up for the defect of the traditional tree path model which cannot express both document structure and hierarchical information.The experiment result shows that the model improves the recognition ability of Web pages structural similarity.It not only can better distinguish the Web pages which have large structure difference,but also effectively reflects the difference between the Web pages with the same template,also has a better effect in the Web page clustering.
Keywords:Information extraction  Web page structure  Similarity  Tree path model  Clustering
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号