首页 | 官方网站   微博 | 高级检索  
     

关于重复词句提取的两种算法分析
引用本文:蒋华,殷波.关于重复词句提取的两种算法分析[J].计算机应用,2009,29(2):403-405.
作者姓名:蒋华  殷波
作者单位:桂林电子科技大学,计算机与控制学院,广西,桂林,541004
基金项目:桂林电子科技大学博士科技基金 
摘    要:针对重复网页的去重问题,对两种重复词句提取算法进行了系统分析比较。STC算法在时间成本上具有优秀性能,重复序列的倒排索引方法在空间复杂度方面更胜一筹。结合STC算法对重复序列方法进行了改进,而面向主题转载的重复网页,先抽取重复串,然后将重复串作索引进行STC算法的重复抽取。实验结果表明,改进算法在保持了原有空间特性的基础上极大地提高了时间效率。

关 键 词:重复词句  重复序列  后缀树
收稿时间:2008-09-02
修稿时间:2008-10-22

New algorithm based on repeat sequence deletion
JIANG Hua,YIN Bo.New algorithm based on repeat sequence deletion[J].journal of Computer Applications,2009,29(2):403-405.
Authors:JIANG Hua  YIN Bo
Affiliation:JIANG Hua,YIN Bo Department of Computer , Control,Guilin University of Electronic Technology,Guilin Guangxi 541004,China
Abstract:Aiming at the current de-duplication algorithms, two repeated sequences (RS)extracting algorithms were compared and analyzed. Since STC has favorable performance in considering time cost and the inverted index method is superior in terms of spatial complexity, STC was used to improve RS algorithm. Experiment results show that this method can find similar Web pages efficiently. This algorithm can reach a high precision in mono-language deletion of duplicated Web pages, and this algorithm can also reach a maximum precision when it is applied to deletion of duplicated web pages.
Keywords:repeated sequences  repeated segments  suffix tree
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号