首页 | 官方网站   微博 | 高级检索  
     

网页查重算法Shingling和Simhash研究
引用本文:马成前,毛许光. 网页查重算法Shingling和Simhash研究[J]. 计算机与数字工程, 2009, 37(1): 15-17
作者姓名:马成前  毛许光
作者单位:武汉理工大学计算机科学与技术学院,武汉,430070
摘    要:随着网络信息爆炸式增长,人们对信息检索有了更高的要求。在海量的网页中,有很多重复的网页。网页查重可以节省网络带宽,降低存储成本,提高搜索引擎的性能。在网页查重算法中shingling和simhash是比较重要和经典的算法,文中对两种算法做了介绍,包括算法的原理,存在的问题及改进等。

关 键 词:网页查重  搜索引擎  shingling  simhash

Research on Near-duplicate Detection Algorithm Shingling and Simhash
Ma Chengqian,Mao Xuguang. Research on Near-duplicate Detection Algorithm Shingling and Simhash[J]. Computer and Digital Engineering, 2009, 37(1): 15-17
Authors:Ma Chengqian  Mao Xuguang
Affiliation:Department of Computer Science and Technology;Wuhan University of Technology;Wuhan 430070
Abstract:Going along with the explosion of the internet information,people want advanced information retrieval technique.In the tremendous amount of webpages,there are great many duplicated pages.The near-duplicates detection can save the network bandwidth,reduce the storage cost and enhance the quality of the search engine.Shingling and simhash are two important and classic algorithm in near-duplicate detection.The thesis introduce the two algorithms,including their principle,the problems they face and how to impro...
Keywords:shingling  simhash
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号