网页查重算法Shingling和Simhash研究 Research on Near-duplicate Detection Algorithm Shingling and Simhash期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

网页查重算法Shingling和Simhash研究

引用本文：	马成前,毛许光. 网页查重算法Shingling和Simhash研究[J]. 计算机与数字工程, 2009, 37(1): 15-17

作者姓名：	马成前毛许光

作者单位：	武汉理工大学计算机科学与技术学院,武汉,430070

摘要：	随着网络信息爆炸式增长，人们对信息检索有了更高的要求。在海量的网页中，有很多重复的网页。网页查重可以节省网络带宽，降低存储成本，提高搜索引擎的性能。在网页查重算法中shingling和simhash是比较重要和经典的算法，文中对两种算法做了介绍，包括算法的原理，存在的问题及改进等。
关键词：	网页查重搜索引擎 shingling simhash
Research on Near-duplicate Detection Algorithm Shingling and Simhash

Ma Chengqian,Mao Xuguang. Research on Near-duplicate Detection Algorithm Shingling and Simhash[J]. Computer and Digital Engineering, 2009, 37(1): 15-17

Authors:	Ma Chengqian Mao Xuguang

Affiliation:	Department of Computer Science and Technology;Wuhan University of Technology;Wuhan 430070

Abstract:	Going along with the explosion of the internet information,people want advanced information retrieval technique.In the tremendous amount of webpages,there are great many duplicated pages.The near-duplicates detection can save the network bandwidth,reduce the storage cost and enhance the quality of the search engine.Shingling and simhash are two important and classic algorithm in near-duplicate detection.The thesis introduce the two algorithms,including their principle,the problems they face and how to impro...

Keywords:	shingling simhash
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏