首页 | 官方网站   微博 | 高级检索  
     

Simhash算法在文本去重中的应用
引用本文:张航,盛志伟,张仕斌,杨敏.Simhash算法在文本去重中的应用[J].计算机工程与应用,2020,56(11):246-251.
作者姓名:张航  盛志伟  张仕斌  杨敏
作者单位:成都信息工程大学 网络空间安全学院,成都 610225
基金项目:四川省高校科研创新团队项目;四川省重点研发计划项目;四川省科技支撑计划项目;四川省应用基础项目;四川省学术和技术带头人培养支持经费资助项目;四川省教育厅项目;国家重点研发计划
摘    要:为了提升Simhash算法的文本去重效果、准确率,解决Simhash算法无法体现分布信息的缺点,提出了基于信息熵加权的Simhash算法(简称E-Simhash)。该算法引入TF-IDF和信息熵,通过优化Simhash算法中的权重及阈值计算,增加文本分布信息,使得最终生成的指纹更能体现关键信息的比重,并对指纹信息与权重的关联性进行了分析。仿真实验表明:优化权重计算能有效地提升Simhash算法的性能,E-Simhash算法在去重率、召回率、F值等方面均优于传统Simhash算法,并且在文本去重方面取得了良好的效果。

关 键 词:Simhash  信息熵  词频-逆向文件频率  权重优化  文本去重

Application of Simhash Algorithm in Text Deduplication
ZHANG Hang,SHENG Zhiwei,ZHANG Shibin,YANG Min.Application of Simhash Algorithm in Text Deduplication[J].Computer Engineering and Applications,2020,56(11):246-251.
Authors:ZHANG Hang  SHENG Zhiwei  ZHANG Shibin  YANG Min
Affiliation:School of Cybersecurity, Chengdu University of Information Technology, Chengdu 610225, China
Abstract:To improve the text deduplication effect and accuracy of Simhash algorithm, as well as to solve the shortcomings of Simhash algorithm that cannot reflect the distribution information, an improved Simhash algorithm based on information entropy weighting, abbreviated as E-Simhash, is proposed in this paper. Firstly, by introducing TF-IDF and information entropy, optimizing the weight and threshold calculation in Simhash algorithm, as well as adding the text distribution information, the final generated fingerprint can better embody the proportion of key information. Meanwhile, the correlation between fingerprint information and weight is also be certificated. Finally, the experimental results demonstrate that the performance of Simhash algorithm can be effectively improved by optimizing the weight. The modified algorithm is superior to the traditional Simhash algorithm in terms of deduplication rate, recall rate and F value, and also has good performance in Chinese similarity detection. Thus, the effectiveness and accuracy of the proposed method are verified.
Keywords:Simhash  information entropy  term frequency-inverse document frequency  weight optimization  text deduplication  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号