期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accelerating big data analytics on HPC clusters using two-level storage

《Parallel Computing》2017

Data-intensive applications that are inherently I/O bound have become a major workload on traditional high-performance computing (HPC) clusters. Simply employing data-intensive computing storage such as HDFS or using parallel file systems available on HPC clusters to serve such applications incurs performance and scalability issues. In this paper, we present a novel two-level storage system that integrates an upper-level in-memory file system with a lower-level parallel file system. The former renders memory-speed high I/O performance and the latter renders consistent storage with large capacity. We build a two-level storage system prototype with Tachyon and OrangeFS, and analyze the resulting I/O throughput for typical MapReduce operations. Theoretical modeling and experiments show that the proposed two-level storage delivers higher aggregate I/O throughput than HDFS and OrangeFS and achieves scalable performance for both read and write. We expect this two-level storage approach to provide insights on system design for big data analytics on HPC clusters. 相似文献

2.

An optimized approach for storing and accessing small files on cloud storage

《Journal of Network and Computer Applications》2012,35(6):1847-1862

Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine ‘how small is small’. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility. 相似文献

3.

多模态医疗数据中海量小文件存储优化方法

曾梦邹北骥张文生杨雪冰朱承璋《软件学报》2023,34(3):1451-1469

Hadoop分布式文件系统(HDFS)通常用于大文件的存储和管理,当进行海量小文件的存储和计算时,会消耗大量的NameNode内存和访问时间,成为制约HDFS性能的一个重要因素.针对多模态医疗数据中海量小文件问题,提出一种基于双层哈希编码和HBase的海量小文件存储优化方法.在小文件合并时,使用可扩展哈希函数构建索引文件存储桶,使索引文件可以根据需要进行动态扩展,实现文件追加功能.在每个存储桶中,使用MWHC哈希函数存储每个文件索引信息在索引文件中的位置,当访问文件时,无须读取所有文件的索引信息,只需读取相应存储桶中的索引信息即可,从而能够在O(1)的时间复杂度内读取文件,提高文件查找效率.为了满足多模态医疗数据的存储需求,使用HBase存储文件索引信息,并设置标识列用于标识不同模态的医疗数据,便于对不同模态数据的存储管理,并提高文件的读取速度.为了进一步优化存储性能,建立了基于LRU的元数据预取机制,并采用LZ4压缩算法对合并文件进行压缩存储.通过对比文件存取性能、NameNode内存使用率,实验结果表明,所提出的算法与原始HDFS、HAR、MapFile、TypeStorage以及... 相似文献

4.

HDFS中海量小文件合并与预取优化方法的研究

郑通郭卫斌范贵生《计算机科学》2017,44(Z11):516-519, 541

HDFS在存储海量文件时具有明显的优势, 但在存储小文件占绝大多数的海量文件时,HDFS单个NameNode的存储架构会导致其性能严重降低。为此,提出一种基于合并思想的方案,即将小文件合并为大文件,同时建立小文件到合并文件的映射关系,并将其存于HBase中。为了提高读取速度,建立了基于LRU的预取机制。实验表明,该方法能明显提高HDFS在处理海量文件时的整体性能。相似文献

5.

基于磁光虚拟存储系统的文件调度算法

王子炫魏力张育平《计算机与现代化》2019,(5):7

基于光盘库的Hadoop分布式文件系统（HDFS光盘库）在单位存储成本、数据安全性、使用寿命等方面非常符合当前大数据存储要求，但是HDFS不适合存储大量小文件和实时数据读取。为了使HDFS光盘库能更好地运用到更多大数据存储场景，本文提出一种更加适合大数据存储的磁光虚拟存储系统（MOVS, Magneto-optical Virtual Storage System）。系统在HDFS光盘库与用户之间加入磁盘缓存，并在磁盘缓存内通过文件标签分类、虚拟存储、小文件合并等技术将磁盘缓存内小文件合并为适合HDFS光盘库存储的大文件，提高系统的数据传输速度。系统还使用了文件预取、缓存替换等文件调度算法对磁盘缓存内文件进行动态更新，减少用户访问HDFS光盘库次数。实验结果表明，MOVS相对HDFS光盘库在响应时间和数据传输速度方面得到很大改善。相似文献

6.

Hadoop中处理海量小文件的方法

李旭李长云张清清胡淑新周玲芳《计算机系统应用》2015,24(11):157-161

针对Hadoop中提供底层存储的HDFS对处理海量小文件效率低下、严重影响性能的问题.设计了一种小文件合并、索引和提取方案,并与原始的HDFS以及HAR文件归档方案进行对比,通过一系列实验表明,本文的方案能有效减少Namenode内存占用,提高HDFS的I/O性能. 相似文献

7.

ParSA: High-throughput scientific data analysis framework with distributed file system

《Future Generation Computer Systems》2015

Scientific data analysis and visualization have become the key component for nowadays large scale simulations. Due to the rapidly increasing data volume and awkward I/O pattern among high structured files, known serial methods/tools cannot scale well and usually lead to poor performance over traditional architectures. In this paper, we propose a new framework: ParSA (parallel scientific data analysis) for high-throughput and scalable scientific analysis, with distributed file system. ParSA presents the optimization strategies for grouping and splitting logical units to utilize distributed I/O property of distributed file system, scheduling the distribution of block replicas to reduce network reading, as well as to maximize overlapping the data reading, processing, and transferring during computation. Besides, ParSA provides the similar interfaces as the NetCDF Operator (NCO), which is used in most of climate data diagnostic packages, making it easy to use this framework. We utilize ParSA to accelerate well-known analysis methods for climate models on Hadoop Distributed File System (HDFS). Experimental results demonstrate the high efficiency and scalability of ParSA, getting the maximum 1.3 GB/s throughput on a six nodes Hadoop cluster with five disks per node. Yet, it can only get 392 MB/s throughput on a RAID-6 storage node. 相似文献

8.

基于对象存储的负载均衡存储策略

熊安萍刘进进邹洋《计算机工程与设计》2012,33(7):2678-2682,2689

对象存储文件系统中将大数据文件分片存储到多个存储节点上,以获取更好的并行I/O性能,提高系统吞吐率.现有对象存储文件系统的存储策略并未充分考虑存储对象本身负载的动态变化,不利于提高系统资源利用率.针对此问题,考虑存储对象的空间及I/O等负载实时变化,提出了一种简单、灵活、高效的负载均衡存储策略,并对该策略进行了实现.实验结果表明,该策略能有效提高对象存储系统资源的利用率和吞吐率,保证对象存储文件系统高效的读写性能. 相似文献

9.

基于MapFile 的HDFS 小文件存储效率问题

洪旭升林世平《计算机系统应用》2012,21(11):179-182

针对HDFS最初是为流式访问大文件而开发的,而对于大量小文件的存储效率不高问题,采用MapFile设计一个HDFS中存储小文件的方案．该方案的主要思想是在上传HDFS时增加一个文件类型判断模块,建立一个小文件队列,将小文件序列化存入一个MapFile容器,合并成大文件,并建立相应的索引文件,有效降低文件数目和提高访问效率．通过和现有的HadoopArchives（HARfiles）文件归档解决小文件问题的方案对比,实验结果表明,基于MapFile的存储小文件方案可以更为有效的提高小文件存储性能和减少HDFS文件系统的节点内存消耗．相似文献

10.

基于新型存储器件的分布式文件系统性能优化

董聪张晓程文迪石佳《计算机应用》2020,40(12):3594-3603

新型存储器件的I/O性能通常比传统固态驱动器（SSD）高一个数量级,然而使用新型存储器件的分布式文件系统相对于使用SSD的分布式文件系统性能并没有显著的提高,这说明目前的分布式文件系统并不能充分发挥新型存储器件的性能。针对这个问题,对Hadoop分布式文件系统（HDFS）的数据写入流程及传输过程进行了量化分析。通过量化分析HDFS数据写入过程各阶段的时间开销,发现在写入数据的各个阶段中,节点间数据传输的时间占比较大。因此提出了对应的优化方案,通过异步写入的方式并行化数据传输与处理过程,使得不同数据包的处理阶段叠加起来,减少了数据包整体的处理时间,从而提升了HDFS的写入性能。实验结果表明,所提方案将HDFS的写入吞吐量提升了15%~24%,总体的写入执行时间降低了28%~36%。相似文献