关于重复词句提取的两种算法分析 New algorithm based on repeat sequence deletion期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

关于重复词句提取的两种算法分析

引用本文：	蒋华,殷波.关于重复词句提取的两种算法分析[J].计算机应用,2009,29(2):403-405.

作者姓名：	蒋华殷波

作者单位：	桂林电子科技大学,计算机与控制学院,广西,桂林,541004

基金项目：	桂林电子科技大学博士科技基金

摘要：	针对重复网页的去重问题，对两种重复词句提取算法进行了系统分析比较。STC算法在时间成本上具有优秀性能，重复序列的倒排索引方法在空间复杂度方面更胜一筹。结合STC算法对重复序列方法进行了改进，而面向主题转载的重复网页，先抽取重复串，然后将重复串作索引进行STC算法的重复抽取。实验结果表明，改进算法在保持了原有空间特性的基础上极大地提高了时间效率。
关键词：	重复词句重复序列后缀树
收稿时间：	2008-09-02
修稿时间：	2008-10-22
New algorithm based on repeat sequence deletion

JIANG Hua,YIN Bo.New algorithm based on repeat sequence deletion[J].journal of Computer Applications,2009,29(2):403-405.

Authors:	JIANG Hua YIN Bo

Affiliation:	JIANG Hua,YIN Bo Department of Computer , Control,Guilin University of Electronic Technology,Guilin Guangxi 541004,China

Abstract:	Aiming at the current de-duplication algorithms, two repeated sequences (RS)extracting algorithms were compared and analyzed. Since STC has favorable performance in considering time cost and the inverted index method is superior in terms of spatial complexity, STC was used to improve RS algorithm. Experiment results show that this method can find similar Web pages efficiently. This algorithm can reach a high precision in mono-language deletion of duplicated Web pages, and this algorithm can also reach a maximum precision when it is applied to deletion of duplicated web pages.

Keywords:	repeated sequences repeated segments suffix tree
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏