首页 | 官方网站   微博 | 高级检索  
     

基于编辑距离的词序敏感相似度度量方法
引用本文:张雷,崔荣一.基于编辑距离的词序敏感相似度度量方法[J].延边大学理工学报,2020,0(2):140-144.
作者姓名:张雷  崔荣一
作者单位:( 延边大学 工学院, 吉林 延吉 133002 )
摘    要:为改善余弦相似度不能反映词袋模型中词项间顺序差异的缺点,提出了一种基于编辑距离的文档相似度度量方法.首先分析了基于 tf - idf 的词袋模型和余弦相似度计算方法所存在的问题; 其次利用Jaccard系数和编辑距离描述两个字符串的公共子串中词语之间的顺序差异,并提出了一种词序敏感相似度计算方法; 最后利用实验数据对算法的有效性进行了验证,结果显示本文方法在Top1、Top3上的F1指标比原始的余弦相似度方法分别提高了0.082 5、 0.112 6,表明本文方法能够有效地提升信息检索系统的性能,具有很好的应用价值.

关 键 词:文本相似度  词袋模型  编辑距离  词序

A word order sensitive similarity measure based on edit distance
ZHANG Lei,CUI Rongyi.A word order sensitive similarity measure based on edit distance[J].Journal of Yanbian University (Natural Science),2020,0(2):140-144.
Authors:ZHANG Lei  CUI Rongyi
Affiliation:( College of Engineering, Yanbian University, Yanji 133002, China )
Abstract:In this paper, a method is proposed to calculate the similarity between documents based on edit distance in order to improve the shortcoming that the cosine similarity method cannot reflect the order difference between the terms in the bag - of -words model. Firstly, the problems of the bag - of -words model based on tf - idf and the calculation method of cosine similarity are analyzed. Secondly, the order difference between the words in the common substrings of the two character strings is described by the Jaccard coefficient and the edit distance, and a word order sensitive similarity calculation method is proposed. Finally, the experimental data is used to verify the algorithm. The results show that the F1 value of this method on Top1 and Top3 is improved by 0.082 5 and 0.112 6 respectively compared with the original cosine similarity method. It shows that the method in this paper can effectively improve the performance of the information retrieval system and has good application value.
Keywords:text similarity  bag - of -words model  edit distance  word order
本文献已被 CNKI 等数据库收录!
点击此处可从《延边大学理工学报》浏览原始摘要信息
点击此处可从《延边大学理工学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号