首页 | 官方网站   微博 | 高级检索  
     

基于分级匹配的维吾尔语文档相似性计算及剽窃检测方法
引用本文:亚森.艾则孜,艾山.吾买尔,阿力木江.艾沙.基于分级匹配的维吾尔语文档相似性计算及剽窃检测方法[J].计算机应用研究,2019,36(6).
作者姓名:亚森.艾则孜  艾山.吾买尔  阿力木江.艾沙
作者单位:新疆警察学院信息安全工程系,新疆大学信息科学与工程学院,新疆大学网络中心
基金项目:国家自然科学基金资助项目(61762086,61662077,61363064);国家社会科学基金资助项目(13CFX055);新疆维吾尔自治区高校科研计划项目(XJEDU2016I052,XJEDU2017M046);
摘    要:针对以维吾尔语书写的文档间的相似性计算及剽窃检测问题,提出了一种基于内容的维吾尔语剽窃检测(U-PD)方法。首先,通过预处理阶段对维吾尔语文本进行分词、删除停止词、提取词干和同义词替换,其中提取词干是基于N-gram 统计模型实现。然后,通过BKDRhash算法计算每个文本块的hash值并构建整个文档的hash指纹信息。最后,根据hash指纹信息,基于RKR-GST匹配算法在文档级、段落级和句子级将文档与文档库进行匹配,获得文档相似度,以此实现剽窃检测。通过在维吾尔语文档中的实验评估表明,提出的方法能够准确检测出剽窃文档,具有可行性和有效性。

关 键 词:维吾尔语文档  相似度  剽窃检测  文档hash指纹  分级匹配
收稿时间:2017/12/18 0:00:00
修稿时间:2019/5/8 0:00:00

Uyghur Document Similarity Calculation and Plagiarism Detection Based on Hierarchical Matching
Yasen.AIZEZI,Aishan.WUMAIER and Alimu.AISHA.Uyghur Document Similarity Calculation and Plagiarism Detection Based on Hierarchical Matching[J].Application Research of Computers,2019,36(6).
Authors:YasenAIZEZI  AishanWUMAIER and AlimuAISHA
Affiliation:Department of Information Security Engineering,Xinjiang Police College,SUrumqi,Xinjiang,,
Abstract:For the issues of the similarity calculation and plagiarism detection from documents written in Uyghur, a content-based Uyghur plagiarism detection (U-PD) method is proposed. Firstly, the Uyghur texts are segmented, the stop words are deleted, the stems are extracted and synonyms are replaced through the preprocessing stage, of which extraction stems are based on N-gram statistical models. Then, calculate the hash value of each text block through the BKDRhash algorithm and construct the hash fingerprint information of the entire document. Finally, according to the hash fingerprint information, the document and document library are matched at the document level, the paragraph level and the sentence level based on the RKR-GST matching algorithm, and the similarity of the document is obtained, so as to realize plagiarism detection. The experimental evaluation in Uyghur documents shows that the proposed method can detect plagiarism documents accurately and is feasible and effective.
Keywords:Uyghur documents  Similarity  Plagiarism detection  Document hash fingerprinting  Hierarchical matching
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号