首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We describe an efficient and easily applicable data deduplication framework with heuristic prediction based adaptive block skipping for the real-world dataset such as disk images to save deduplication related overheads and improve deduplication throughput with good deduplication efficiency maintained. Under the framework, deduplication operations are skipped for data chunks determined as likely non-duplicates via heuristic prediction, in conjunction with a hit and matching extension process for duplication identification within skipped blocks and a hysteresis mechanism based hash indexing process to update the hash indices for the re-encountered skipped chunks. For performance evaluation, the proposed framework was integrated and implemented in the existing data domain and sparse indexing deduplication algorithms. The experimental results based on a real-world dataset of 1.0 TB disk images showed that the deduplication related overheads were significantly reduced with adaptive block skipping, leading to a 30%~80% improvement in deduplication throughput when deduplication metadata were stored on the disk for data domain, and 25%~40% RAM space saving with a 15%~20% improvement in deduplication throughput when an in-RAM sparse index was used in sparse indexing. In both cases, the corresponding deduplication ratios reduced were below 5%.  相似文献   

2.
Data deduplication for file communication across wide area network (WAN) in the applications such as file synchronization and mirroring of cloud environments usually achieves significant bandwidth saving at the cost of significant time overheads of data deduplication. The time overheads include the time required for data deduplication at two geographi-cally distributed nodes (e.g., disk access bottleneck) and the duplication query/answer operations between the sender and the receiver, since each query or answer introduces at least one round-trip time (RTT) of latency. In this paper, we present a data deduplication system across WAN with metadata feedback and metadata utilization (MFMU), in order to harness the data deduplication related time overheads. In the proposed MFMU system, selective metadata feedbacks from the receiver to the sender are introduced to reduce the number of duplication query/answer operations. In addition, to harness the metadata related disk I/O operations at the receiver, as well as the bandwidth overhead introduced by the metadata feedbacks, a hysteresis hash re-chunking mechanism based metadata utilization component is introduced. Our experimental results demonstrated that MFMU achieved an average of 20%~40% deduplication acceleration with the bandwidth saving ratio not reduced by the metadata feedbacks, as compared with the “baseline” content defined chunking (CDC) used in LBFS (Low-bandwith Network File system) and exiting state-of-the-art Bimodal chunking algorithms based data deduplication solutions.  相似文献   

3.
Solid-state drives (SSDs) have been widely used as caching tier for disk-based RAID systems to speed up dataintensive applications. However, traditional cache schemes fail to effectively boost the parity-based RAID storage systems (e.g., RAID-5/6), which have poor random write performance due to the small-write problem. What’s worse, intensive cache writes can wear out the SSD quickly, which causes performance degradation and cost increment. In this article, we present the design and implementation of KDD, an efficient SSD-based caching system which Keeps Data and Deltas in SSD. When write requests hit in the cache, KDD dispatches the data to the RAID storage without updating the parity blocks to mitigate the small write penalty, and compactly stores the compressed deltas in SSD to reduce the cache write traffic while guaranteeing reliability in case of disk failures. In addition, KDD organizes the metadata partition on SSD as a circular log to make the cache persistent with low overhead.We evaluate the performance of KDD via both simulations and prototype implementations. Experimental results show that KDD effectively reduces the small write penalty while extending the lifetime of the SSD-based cache by up to 6.85 times.  相似文献   

4.
在E级计算时代,超算系统一般使用多层存储架构以满足应用数据访问的容量和性能需求,这种架构中不同层次的存储介质差异较大,难以实现统一名字空间管理,往往需要应用修改数据访问流程才能最大程度利用到多层存储的性能和容量优势。针对多层存储统一名字空间的问题,提出针对非易失性双列存储模块(NVDIMM)的块级缓存和针对突发缓冲存储(BB)的文件级缓存技术。基于NVDIMM的块级缓存技术对缓存窗口灵活控制,以支持数据块粒度的异步读写,实现NVDIMM与BB层统一名字空间管理;基于BB的文件级缓存技术将数据缓存在BB层中,并动态迁移和管理文件副本,实现BB层与传统磁盘文件系统统一名字空间管理。在神威E级原型验证系统中的测试结果表明,所提出的两种技术较好地解决了多层存储的透明加速难题,NVDIMM块级缓存与BB相比,在缓存窗口16 MB时128 KB顺序读写带宽分别提升27%和36%,8 KB随机读写带宽分别提升20%和37%;基于BB的文件缓存技术利用BB的高带宽支撑数据访问,与全局文件系统相比,128 KB顺序读写带宽分别提升55%和141%,8 KB随机读写带宽分别提升163%和209%。此外,实际应用的测试也表明以上两种缓存技术具有透明的存储加速效果。  相似文献   

5.
孙耀  刘杰  叶丹  钟华 《软件学报》2016,27(12):3192-3207
请求负载均衡,是分布式文件系统元数据管理需要面对的核心问题.以最大化元数据服务器集群吞吐量为目标,在已有元数据管理层之上设计实现了一种分布式缓存框架,专门管理热点元数据,均衡不断变化的负载.与已有的元数据负载均衡架构相比,这种两层的负载均衡架构灵活度更高,对负载的感知能力更强,并且避免了热点元数据重新分布、迁移引起的元数据命名空间结构被破坏的情况.经观察分析,元数据尺寸小、数量大,预取错误元数据带来的代价远远小于预取错误数据带来的代价.针对元数据的以上鲜明特点,提出一种元数据预取策略和基于预取机制的元数据缓存替换算法,加强了上述分布式缓存层的性能,这种两层的元数据负载均衡框架同时考虑了缓存一致性的问题.最后,在一个真实的分布式文件系统中验证了框架及方法的有效性.  相似文献   

6.
On-board disk cache is an effective approach to improve disk performance by reducing the number of physical accesses to the magnetic media. Disk drive manufacturers are increasing the on-board disk cache size to match the capacity growth of the backend magnetic media. Some disk drives nowadays have a cache of 32 MB. Modern computer systems use large amounts of memory to improve performance, any data brought into host memory will be re-accessed there, not in the on-board disk cache. This feature has a significant impact on the behavior of disk cache. This is because computer systems are complex systems consisting of various components. The components are correlated with each other. Therefore, a specific component cannot be isolated from the overall system when we analyze its performance behavior. This paper employs four block-level real traces to explore the performance behavior of the on-board disk cache by considering the impacts of the cache hierarchy contained in computer systems. The analysis gives three major implications: (1) I/O stream at block-level contains negligible temporal locality. Therefore, read/write cache can only achieve marginal benefits. (2) Static write cache does not achieve performance gains since the write stream does not have too much interference with the read stream. Therefore, it is better to leave the on-board disk cache shared by both the write and read streams. (3) Read cache dominates the contribution to the hit ratio besides prefetch. Thus, it is better to focus on improving the read performance rather than write performance of disk cache.  相似文献   

7.
Deduplication 通常在两个企业存储系统和云存储被使用了。克服性能挑战为选择恢复 deduplication 系统的操作, solid-state-drive-based (即,基于 SSD ) 读的缓存能为由缓冲加快被部署流行动态地恢复内容。不幸地,经常的数据更改由古典缓存计划导致了(例如, LRU 和 LFU ) 显著地弄短 SSD 一生当在 SSD 减慢 I/O 进程时。处理这个问题,我们建议新解决方案砍缓存极大地由扩大比例象 I/O 性能一样改进 SSD 的 write 耐久性长期流行(砍) 在写进基于 SSD 的缓存的数据之中的数据。砍缓存保留很长时间在 SSD 缓存砍数据减少的时期缓存代替的数字。而且,它在 deduplication 集装箱阻止不得人心或不必要的数据被写进 SSD 缓存。我们在一个原型 deduplication 系统实现了砍缓存评估它的性能。我们的试验性的结果显示砍缓存弄短潜伏选择与仅仅 deduplicated 数据的 5.56% 能力以小基于 SSD 的缓存的成本由 37.3% 的一般水准恢复。重要地,砍缓存由 9.77 的一个因素改进 SSD 一生。砍缓存为一个成本效率的基于 SSD 的读的缓存解决方案提供到的证据表演增加性能选择为 deduplication 恢复系统。  相似文献   

8.
The flash-based SSD is used as a tiered cache between RAM and HDD. Conventional schemes do not utilize the nonvolatile feature of SSD and cannot cache write requests. Writes are a significant, or often dominant, fraction of storage workloads. To cache write requests, the SSD cache should persistently and consistently manage its data and metadata, and guarantee no data loss even after a crash. Persistent cache management may require frequent metadata changes and causes high overhead. Some researchers insist that a nonvolatile persistent cache requires new additional primitives that are not supported by general SSDs in the market. We proposed a fully persistent read/write cache, which improves both read and write performance, does not require any special primitive, has a low overhead, guarantees the integrity of the cache metadata and the consistency of the cached data, even during a crash or power failure, and is able to recover the flash cache quickly without any data loss. We implemented the persistent read/write cache as a block device driver in Linux. Our scheme aims at virtual desktop infra servers. So the evaluation was performed with massive, real desktop traces of five users for ten days. The evaluation shows that our scheme outperforms an LRU version of SSD cache by 50% and the read-only version of our scheme by 37%, on average, for all experiments. This paper describes most of the parts of our scheme in detail. Detailed pseudo-codes are included in the Appendix.  相似文献   

9.
基于目录路径的元数据管理方法   总被引:7,自引:0,他引:7  
刘仲  周兴铭 《软件学报》2007,18(2):236-245
提出目录路径属性与目录对象分离的元数据管理方法,扩展了现有的对象存储结构.该方法能够有效避免因为目录属性修改而导致的大量元数据更新与迁移;通过减少前缀目录的重迭缓存提高了元数据服务器Cache的利用率和命中率;通过减少遍历目录路径的开销和充分开发目录的存储局部性,减少了磁盘I/O次数;通过元数据服务器的动态负载均衡避免单个服务器过载.实验结果表明,该方法在提高系统性能、均衡元数据分布以及减少元数据迁移等方面具有明显的优势.  相似文献   

10.
In information-centric networking, in-network caching has the potential to improve network efficiency and content distribution performance by satisfying user requests with cached content rather than downloading the requested content from remote sources. In this respect, users who request, download, and keep the content may be able to contribute to in-network caching by sharing their downloaded content with other users in the same network domain (i.e., user-assisted in-network caching). In this paper, we examine various aspects of user-assisted in-network caching in the hopes of efficiently utilizing user resources to achieve in-network caching. Through simulations, we first show that user-assisted in-network caching has attractive features, such as self-scalable caching, a near-optimal cache hit ratio (that can be achieved when the content is fully cached by the in-network caching) based on stable caching, and performance improvements over in-network caching. We then examine the caching strategy of user-assisted in-network caching. We examine three caching strategies based on a centralized server that maintains all content availability information and informs each user of what to cache. We also examine three caching strategies based on each user’s content availability information. We first show that the caching strategy affects the distribution of upload overhead across users and the number of cache hits in each segment. One interesting observation is that, even with a small storage space (i.e., 0.1% of the content size per user), the centralized and distributed approaches improve the cache hit ratio by 50% and 45%, respectively. With an overall view of caching information, the centralized approach can achieve a higher cache hit ratio than the distributed approach. Based on this observation, we discuss a distributed approach with a larger view of caching information than the distributed approach and, through simulations, confirm that a larger view leads to a higher cache hit ratio. Another interesting observation is that the random distributed strategy yields comparable performance to more complex strategies.  相似文献   

11.
Data deduplication (dedupe for short) is a special data compression technique. It has been widely adopted to save backup time as well as storage space, particularly in backup storage systems. Therefore, most dedupe research has primarily focused on improving dedupe write performance. However, backup storage dedupe read performance is also a crucial problem for storage recovery. This paper designs a new dedupe storage read cache for backup applications that improves read performance by exploiting a special characteristic: the read sequence is the same as the write sequence. Consequently, for better cache utilization, by looking ahead for future references within a moving window, it evicts victims from the cache having the smallest future access. Moreover, to further improve read cache performance, it maintains a small log buffer to judiciously cache future access data chunks. Extensive experiments with real-world backup workloads demonstrate that the proposed read cache scheme improves read performance by up to 64.3%  相似文献   

12.
缓存加速技术可以利用固态硬盘(SSD,solid state disk)随机访问性能高的优势,提升机械硬盘的随机读写性能;传统的缓存加速技术难以适应大数据背景下高并发、间歇性频繁访问等热点数据访问需求;为了提升缓存整体性能,提出一种基于虚拟存储层的缓存策略(CVSL,cache policy based on the virtual storage layer),将缓存技术和分层存储技术相结合,通过热度统计、数据逻辑迁移,实现基于数据逻辑分层的缓存控制;实验结果表明,相对传统的缓存策略,CVSL策略的随机读写性能提升了9%~10%,未见明显波动,在缓存命中率方面具有良好的效果,达到了预期设计目标.  相似文献   

13.
在大规模分布式存储系统中,元数据高性能服务和扩展性已成为一个重要的研究热点.在元数据服务器(metadata server,MDS)中,将元数据分解为目录对象和文件对象.目录对象为定位性元数据,提供文件所在位置和访问控制;文件对象为描述性元数据,描述文件的数据特性.每个MDS负责所有目录对象和自身的文件对象,同时,以目录对象ID和文件名为关键字的Hash值作为局部元数据查找表的索引,通过Bloom Filter算法将每个MDS的局部元数据查找表压缩成一个摘要,这样既可利用MDS中Cache,提高Cache的命中率,减少磁盘I/O次数,动态扩展MDS,又能够实现快速的元数据查找.  相似文献   

14.
Considering the current price gap between hard disk and flash memory SSD storages, for applications dealing with large-scale data, it will be economically more sensible to use flash memory drives to supplement disk drives rather than to replace them. This paper presents FaCE, which is a new low-overhead caching strategy that uses flash memory as an extension to the RAM buffer of database systems. FaCE aims at improving the transaction throughput as well as shortening the recovery time from a system failure. To achieve the goals, we propose two novel algorithms for flash cache management, namely multi-version FIFO replacement and group second chance. This was possible due to flash write optimization as well as disk access reduction obtained by the FaCE caching methods. In addition, FaCE takes advantage of the nonvolatility of flash memory to fully support database recovery by extending the scope of a persistent database to include the data pages stored in the flash cache. We have implemented FaCE in the PostgreSQL open-source database server and demonstrated its effectiveness for TPC-C benchmarks in comparison with existing caching methods such as Lazy Cleaning and Linux Bcache.  相似文献   

15.
大数据时代到来,备份数据量增大给存储空间带来新的挑战。重复数据删除技术在备份存储系统中正逐渐流行,但大量数据访问,造成了磁盘的很大负担。针对重复数据删除技术存在的块索引查询磁盘瓶颈问题,文中提出了文件相似性与数据流局部性结合方法改善磁盘I/O性能。该方法充分发挥了各自的优势,相似性优化了索引查找,可以检测到相同数据检测技术不能识别的重复数据;而数据局部性保留了数据流的序列,使得cache的命中率提高,减少磁盘访问次数。布鲁过滤器存储数据块索引可节省大量查询时间和空间开销。对于提出的解决方法所涉及的重要参数如块大小、段大小以及对误判率的影响做了深入分析。通过相关实验评估与性能分析,实验数据与结果为进一步系统性能优化问题提供了重要的数据依据。  相似文献   

16.
Modern single- and multi-processor computer systems incorporate, either directly or through a LAN, a number of storage devices with diverse performance characteristics. These storage devices have to deal with workloads with unpredictable burstiness. A storage-aware caching scheme—that partitions the cache among the disks, and aims at balancing the work across the disks—is necessary in this environment. Moreover, maintaining proper size for these partitions is crucial. Adjusting the partition size after each epoch (a certain time interval) assumes that the workload in the subsequent epoch will show similar characteristics as observed in the current epoch. However, in an environment with highly bursty and time-varying workload such an approach seems to be optimistic. Moreover, the existing storage-aware caching schemes assume linear relationship between cache size and hit ratio. But, in practice a (disk) partition may accumulate cache blocks (thus, choke the remaining disks) without increasing the hit ratio significantly. This disk choking phenomenon may degenerate the performance of the disk system. In this paper, we address the issues of continuous repartitioning and disk choking. First, we present a caching scheme that continuously adjusts the partition size forgoing any periodic activity. Later, considering the disk choking issue, we present a repartitioning framework based on the notion of marginal gains. Experimental results show the effectiveness of our approach. We show that our scheme outperforms the existing storage-aware caching schemes.  相似文献   

17.
命名数据网络(NDN)中的路由器节点具有缓存能力,这就极大地提高了网络中的数据发送与检索效率。然而,由于路由器的缓存能力是有限的,设计有效的缓存策略仍然是一项紧迫的任务。为了解决这个问题,提出了一种动态内容流行度缓存决策和替换策略(DPDR)。DPDR综合考虑内容流行度和缓存能力,利用一个和式增加、积式减少(AIMD)的算法动态调节流行度阈值,并将超过流行度阈值的内容存入缓存空间;同时提出了一个缓存替换算法,综合考虑了缓存空间中内容的流行度和内容最后被访问时间等因素,将替换值最小的内容移出内容缓存。大量仿真结果显示,与其他算法相比,本文所提的算法能够有效提高缓存命中率,缩短平均命中距离和网络吞吐量。  相似文献   

18.
Recently, a hybrid disk drive that integrates a small amount of flash memory within a mechanical drive has received significant attention. The hybrid drive extends the storage hierarchy by using flash memory to cache data from the mechanical disk. Unfortunately, current caching architectures fail to fully exploit the potential of the hybrid drive. Furthermore, current disk input/output (I/O) schedulers are optimized for rotational mechanical disk drives and thus must be re‐targeted for the hybrid disk drive. In this paper, we propose a new data caching scheme, called Profit Caching, for hybrid drives. Profit Caching is a self‐optimizing caching algorithm. It considers and seamlessly integrates all possible data characteristics that impact the performance of hybrid drives, including read count, write count, sequentiality, randomness, and recency, to determine the caching policy. Moreover, we propose a hybrid disk‐aware Completely Fair Queuing (HA‐CFQ) scheduler to avoid unnecessary I/O anticipations of the CFQ scheduler. We have implemented Profit Caching and HA‐CFQ scheduler in the Linux kernel. Coupled with a trace‐driven simulator, we have also conducted detailed experiments under a variety of workloads. Experimental results show that Profit Caching provides significantly improved performance compared with the previous schemes. In particular, the throughput of Profit Caching outperforms previous Random Access First and FlashCache caching schemes by factors of up to 1.8 and 7.6, respectively. In addition, the HA‐CFQ scheduler reduces the total execution time of the CFQ scheduler by up to 1.74%. Finally, the experimental results show that the runtime overhead of Profit Caching is extremely insignificant and can be ignored. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

19.
唐震  吴恒  王伟  魏峻  黄涛 《软件学报》2017,28(8):1982-1998
以SSD为代表的新型存储介质在虚拟化环境下得到了广泛的应用,通常作为虚拟机读写缓存,起到优化磁盘IO性能的作用.已有研究往往关注SSD缓存的容量规划,依据缓存读写命中率评价SSD缓存分配效果,未能充分考虑SSD的服务能力上限,难以适用于典型的分布式应用场景,存在虚拟机抢占SSD缓存资源,导致虚拟机中应用性能违约的可能.本文实现了虚拟化环境下面向多目标优化的自适应SSD缓存系统,考虑了SSD的服务能力上限.基于自适应闭环实现对虚拟机和应用状态的动态感知.动态检测局部SSD缓存抢占状态,基于聚类方法生成虚拟机的优化放置方案,依据全局SSD缓存供给能力确定虚拟机迁移顺序和时机.实验结果表明该方法在应对典型分布式应用场景时可以有效缓解SSD缓存资源的争用,同时满足应用对虚拟机放置的需求,提升应用的性能并兼顾应用的可靠性.在Hadoop应用场景下,平均降低了25%的任务执行时间,对IO密集型应用平均提升39%的吞吐率.在ZooKeeper应用场景下,以不到5%的性能损失为代价应对了虚拟化主机的单点失效带来的虚拟机宕机问题.  相似文献   

20.
基于重复数据删除的虚拟桌面存储优化技术   总被引:1,自引:0,他引:1  
虚拟桌面基础架构依靠数据中心海量的云基础设施,为用户按需提供虚拟桌面部署所需的软硬件资源,但同时面临存储资源利用率低和虚拟机启动慢的困境.针对虚拟桌面存储中具有大量数据冗余的特性,采用重复数据删除技术缩减虚拟桌面基础架构的存储空间需求;并利用服务器本地磁盘缓存以及共享存储池内的固态硬盘来优化虚拟机的启动性能.通过原型实现,发现相比于基于内容分块的策略,静态分块策略更适合虚拟桌面存储进行重复数据删除,最优的分块大小为4KB,并能够缩减85%的存储空间容量;通过服务器本地磁盘缓存和基于闪存的固态硬盘进行I/O优化,虚拟机的启动速度能够获得35%的提升.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号