首页 | 官方网站   微博 | 高级检索  
     

基于压缩近邻的查重元数据去冗算法设计
引用本文:姚文斌,叶鹏迪,李小勇,常静坤.基于压缩近邻的查重元数据去冗算法设计[J].通信学报,2015,36(8):1-7.
作者姓名:姚文斌  叶鹏迪  李小勇  常静坤
作者单位:1. 北京邮电大学 智能通信软件与多媒体北京市重点实验室,北京 100876; 2. 北京邮电大学 计算机学院,北京 100876; 3. 中国铁道科学研究院 机车车辆研究所,北京 100081; 4. 北京邮电大学 可信分布式计算与服务教育部重点实验室,北京 100876
基金项目:国家自然科学基金资助项目(61370069);国家高技术研究发展计划(“863”计划)基金资助项目(2012AA012600);中央高校基本科研业务费专项基金资助项目(BUPT2011RCZJ16)
摘    要:随着重复数据删除次数的增加,系统中用于存储指纹索引的清单文件等元数据信息会不断累积,导致不可忽视的存储资源开销。因此,如何在不影响重复数据删除率的基础上,对重复数据删除过程中产生的元数据信息进行压缩,从而减小查重索引,是进一步提高重复数据删除效率和存储资源利用率的重要因素。针对查重元数据中存在大量冗余数据,提出了一种基于压缩近邻的查重元数据去冗算法Dedup2。该算法先利用聚类算法将查重元数据分为若干类,然后利用压缩近邻算法消除查重元数据中相似度较高的数据以获得查重子集,并在该查重子集上利用文件相似性对数据对象进行重复数据删除操作。实验结果表明,Dedup2可以在保持近似的重复数据删除比的基础上,将查重索引大小压缩50%以上。

关 键 词:重复数据删除  查重元数据  近邻压缩规则
收稿时间:7/7/2014 12:00:00 AM

Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata
Wen-bin YAO,Peng-di YE,Xiao-yong LI,Jing-kun CHANG.Deduplication algorithm based on condensed nearest neighbor rule for deduplication metadata[J].Journal on Communications,2015,36(8):1-7.
Authors:Wen-bin YAO  Peng-di YE  Xiao-yong LI  Jing-kun CHANG
Abstract:Building effective deduplication index in the memory could reduce disk access times and enhance chunk fingerprint lookup speed, which was a big challenge for deduplication algorithms in massive data environments. As deduplication data set had many samples with high similarity, a deduplication algorithm based on condensed nearest neighbor rule, which was called Dedup2, was proposed. Dedup2 uses clustering algorithm to divide the original deduplication metadata into several categories. According to these categories, it employs condensed nearest neighbor rule to remove the highest similar data in the deduplication metadata. After that it can get the subset of deduplication metadata. Based on this subset, new data objects will be deduplicated based on the principle of data similarity. The results of experiments show that Dedup2 can reduce the size of deduplication data set more than 50% effectively while maintain similar deduplication ratio.
Keywords:deduplication  deduplication metadata  condensed nearest neighbor rule
点击此处可从《通信学报》浏览原始摘要信息
点击此处可从《通信学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号