首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MapReduce is a programming model from Google for cluster-based computing in domains such as search engines, machine learning, and data mining. MapReduce provides automatic data management and fault tolerance to improve programmability of clusters. MapReduce’s execution model includes an all-map-to-all-reduce communication, called the shuffle, across the network bisection. Some MapReductions move large amounts of data (e.g., as much as the input data), stressing the bisection bandwidth and introducing significant runtime overhead. Optimizing such shuffle-heavy MapReductions is important because (1) they include key applications (e.g., inverted indexing for search engines and data clustering for machine learning) and (2) they run longer than shuffle-light MapReductions (e.g., 5x longer). In MapReduce, the asynchronous nature of the shuffle results in some overlap between the shuffle and map. Unfortunately, this overlap is insufficient in shuffle-heavy MapReductions. We propose MapReduce with communication overlap (MaRCO) to achieve nearly full overlap via the novel idea of including reduce in the overlap. While MapReduce lazily performs reduce computation only after receiving all the map data, MaRCO employs eager reduce to process partial data from some map tasks while overlapping with other map tasks’ communication. MaRCO’s approach of hiding the latency of the inevitably high shuffle volume of shuffle-heavy MapReductions is fundamental for achieving performance. We implement MaRCO in Hadoop’s MapReduce and show that on a 128-node Amazon EC2 cluster, MaRCO achieves 23% average speed-up over Hadoop for shuffle-heavy MapReductions.  相似文献   

2.
MapReduce:新型的分布式并行计算编程模型   总被引:3,自引:0,他引:3  
MapReduce是Google提出的分布式并行计算编程模型,用于大规模数据的并行处理。Ma-pReduce模型受函数式编程语言的启发,将大规模数据处理作业拆分成若干个可独立运行的Map任务,分配到不同的机器上去执行,生成某种格式的中间文件,再由若干个Reduce任务合并这些中间文件获得最后的输出文件。用户在使用MapReduce模型进行大规模数据处理时,可以将主要精力放在如何编写Map和Reduce函数上,其它并行计算中的复杂问题诸如分布式文件系统、工作调度、容错、机器间通信等都交给MapReduce系统处理,在很大程度上降低了整个编程难度。MapReduce日益成为云计算平台的主流编程模型。Apache Hadoop项目提供开源的MapReduce系统还有待进一步完善。  相似文献   

3.
为了优化大规模集群运行MapReduce作业时的通信效率和减少shuffle数据传输量,首先采用存储局部性换取通信局部性的策略建立一个分布式协同数据映射模型;其次通过随机抽样和机器学习方法来提取作业数据的局部性特征,实现map计算数据的有效部署;最后,利用软件定义网络的全局灵活控制能力,优选通信链路好的节点并将计算任务映射到该类节点中。实验表明对于中间数据混洗密集类作业有较好的优化效果,通信延迟降低了4.3%~5.8%。该方案能减少shuffle流量和数据迁移延迟,并且适合各种调度策略和网络拓扑结构。  相似文献   

4.
Nowadays, we are witnessing the fast production of very large amount of data, particularly by the users of online systems on the Web. However, processing this big data is very challenging since both space and computational requirements are hard to satisfy. One solution for dealing with such requirements is to take advantage of parallel frameworks, such as MapReduce or Spark, that allow to make powerful computing and storage units on top of ordinary machines. Although these key-based frameworks have been praised for their high scalability and fault tolerance, they show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node.In this paper, we present FP-Hadoop, a Hadoop-based system that renders the reduce side of MapReduce more parallel by efficiently tackling the problem of reduce data skew. FP-Hadoop introduces a new phase, denoted intermediate reduce (IR), where blocks of intermediate values are processed by intermediate reduce workers in parallel. With this approach, even when all intermediate values are associated to the same key, the main part of the reducing work can be performed in parallel taking benefit of the computing power of all available workers.We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.  相似文献   

5.
Due to cluster resource competition and task scheduling policy, some map tasks are assigned to nodes without input data, which causes significant data access delay. Data locality is becoming one of the most critical factors to affect performance of MapReduce clusters. As machines in MapReduce clusters have large memory capacities, which are often underutilized, in-memory prefetching input data is an effective way to improve data locality. However, it is still posing serious challenges to cluster designers on what and when to prefetch. To effectively use prefetching, we have built HPSO (High Performance Scheduling Optimizer), a prefetching service based task scheduler to improve data locality for MapReduce jobs. The basic idea is to predict the most appropriate nodes for future map tasks based on current pending tasks and then preload the needed data to memory without any delaying on launching new tasks. To this end, we have implemented HPSO in Hadoop-1.1.2. The experiment results have shown that the method can reduce the map tasks causing remote data delay, and improves the performance of Hadoop clusters.  相似文献   

6.
MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data. In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.  相似文献   

7.
针对大数据背景下基于划分的聚类算法中存在初始中心敏感,节点间通信开销大以及集群效率低下等问题,提出了基于网格密度和局部敏感哈希函数的PBGDLSH-MR并行化聚类算法。首先,对初始数据集提出网格密度策略(GDS)获取初始中心点,有效避免了随机选取引起的初始中心敏感的问题;其次,提出基于局部敏感哈希函数的数据分区(DP-LSH)用于投射关联性较大的数据对象到同一子数据集中,得到map上的数据分区,并设计相似性度量公式(SI)对数据分区结果进行评价,从而降低了节点间的通信开销;接着设计自适应分组策略(AGS)处理数据分区中数据倾斜的问题,进而有效地提高了集群效率;最后,结合MapReduce计算模型并行挖掘簇中心,生成最终聚类结果。实验结果表明,PBGDLSH-MR算法的聚类效果更佳,同时在大数据环境下能有效地提高并行计算的效率。  相似文献   

8.
Partitioning skew has been shown to be a major issue that can significantly prolong the execution time of MapReduce jobs. Most of the existing off-line heuristics for partitioning skew mitigation are inefficient; they have to wait for the completion of all the map tasks. Some solutions can tackle this problem on-line, but will impose an additional overhead by repartitioning the workload of overloaded tasks. In this paper, we present OPTIMA, an on-line partitioning skew mitigation technique for MapReduce. OPTIMA predicts the workload distribution of reduce tasks at run-time, leverages the deviation detection technique to identify the overloaded tasks and pro-actively adjusts resource allocation for these tasks to reduce their execution time. We provide the upper bound of OPTIMA in time complexity, while allowing OPTIMA to perform totally on-line. Through experiments using both real and synthetic workloads running on an 11-node Hadoop cluster, we have observed OPTIMA can effectively mitigate the partitioning skew and improved the job completion time by up to 36.73 % in our experiments.  相似文献   

9.
Hadoop是处理海量数据的分布式计算框架,已经得到了广泛的应用。但是Hadoop处理图结构数据存在一些不足。图结构数据的强耦合特性,无法通过一次MapReduce计算得出结果,而是需要迭代计算,甚至一次迭代需要多次Ma-pReduce完成。而重新启动MapReduce作业,开销较大,以及迭代过程中可能存在静态数据的不必要传输。文中在Hadoop的基础之上,提出map端存储的策略,即将静态数据存储在map端,在map端完成静态与动态数据相关的计算,减少了整个迭代计算的总运行时间。通过搭建修改过的Hadoop平台,与改进前迭代方案进行比较,实验结果表明map端存储策略运行时间得到了一定程度的减少。  相似文献   

10.
In MapReduce model, a job is divided into a series of map tasks and reduce tasks. The execution time of the job is prolonged by some slow tasks seriously, especially in heterogeneous environments. To finish the slow tasks as soon as possible, current MapReduce schedulers launch a backup task on other nodes for each of the slow tasks. However, traditional MapReduce schedulers cannot detect slow tasks correctly since they cannot estimate the progress of tasks accurately (Hadoop home page http://hadoop.apache.org/, 2011; Zaharia et al. in 8th USENIX symposium on operating systems design and implementation, ACM, New York, pp. 29–42, 2008). To solve this problem, this paper proposes a History-based Auto-Tuning (HAT) MapReduce scheduler, which calculates the progress of tasks accurately and adapts to the continuously varying environment automatically. HAT tunes the weight of each phase of a map task and a reduce task according to the value of them in history tasks and uses the accurate weights of the phases to calculate the progress of current tasks. Based on the accurate-calculated progress of tasks, HAT estimates the remaining time of tasks accurately and further launches backup tasks for the tasks that have the longest remaining time. Experimental results show that HAT can significantly improve the performance of MapReduce applications up to 37% compared with Hadoop and up to 16% compared with LATE scheduler.  相似文献   

11.
Data‐intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data‐intensive applications for massive data sets and an execution framework for large‐scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine‐tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high‐performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced.  相似文献   

12.
An important property of today’s big data processing is that the same computation is often repeated on datasets evolving over time, such as web and social network data. While repeating full computation of the entire datasets is feasible with distributed computing frameworks such as Hadoop, it is obviously inefficient and wastes resources. In this paper, we present HadUP (Hadoop with Update Processing), a modified Hadoop architecture tailored to large-scale incremental processing with conventional MapReduce algorithms. Several approaches have been proposed to achieve a similar goal using task-level memoization. However, task-level memoization detects the change of datasets at a coarse-grained level, which often makes such approaches ineffective. Instead, HadUP detects and computes the change of datasets at a fine-grained level using a deduplication-based snapshot differential algorithm (D-SD) and update propagation. As a result, it provides high performance, especially in an environment where task-level memoization has no benefit. HadUP requires only a small amount of extra programming cost because it can reuse the code for the map and reduce functions of Hadoop. Therefore, the development of HadUP applications is quite easy.  相似文献   

13.
随着分布式系统规模扩大及计算复杂度增加,分布式计算的平均故障修复时间和容错计算所产生的通信开销呈现日益上升趋势。结合分布式编码计算和副本冗余技术,提出一种新的容错算法。map节点应用分布式编码计算的思想,将数据冗余分配至多个计算节点创建编码中间结果,降低计算节点在shuffle阶段的数据传输量。reduce节点通过对接收到的编码中间结果进行解码,从而验证中间结果的正确性并得到最终计算结果。实验结果表明,在基于MapReduce的分布式计算框架下,与三模冗余和两阶段三模冗余容错算法相比,该算法在完成容错计算的同时能降低计算过程中的通信开销和平均故障修复时间,并提高分布式系统的可用性和可靠性。  相似文献   

14.
MapReduce, first proposed by Google, is a remarkable programming model for processing very large amounts of data. An open-source implementation of MapReduce, called Hadoop, is now used for developing a wide range of applications. Although developing a correct and efficient program on MapReduce is much easier than developing one with MPI etc., it is still nontrivial if the target application requires involved functionalities of Hadoop MapReduce. Under these situations, functional models for MapReduce computation play important roles because we can utilize them for better understanding, proving the correctness, and even optimization of MapReduce programs. In this paper, we develop two functional models, a low-level one and a high-level one, which capture the semantics of Hadoop MapReduce computation. We discuss the detailed semantics mainly in terms of the following two computations: the computation of Mapper and Reducer classes and the computation in the Shuffle phase with the secondary-sorting technique. In addition, we develop MapReduce algorithms for the scan computational pattern (prefix sums) on the newly proposed models.  相似文献   

15.
Various methods and techniques have been proposed in past for improving performance of queries on structured and unstructured data. The paper proposes a parallel B-Tree index in the MapReduce framework for improving efficiency of random reads over the existing approaches. The benefit of using the MapReduce framework is that it encapsulates the complexity of implementing parallelism and fault tolerance from users and presents these in a user friendly way. The proposed index reduces the number of data accesses for range queries and thus improves efficiency. The B-Tree index on MapReduce is implemented in a chained-MapReduce process that reduces intermediate data access time between successive map and reduce functions, and improves efficiency. Finally, five performance metrics have been used to validate the performance of proposed index for range search query in MapReduce, such as, varying cluster size and, size of range search query coverage on execution time, the number of map tasks and size of Input/Output (I/O) data. The effect of varying Hadoop Distributed File System (HDFS) block size and, analysis of the size of heap memory and intermediate data generated during map and reduce functions also shows the superiority of the proposed index. It is observed through experimental results that the parallel B-Tree index along with a chained-MapReduce environment performs better than default non-indexed dataset of the Hadoop and B-Tree like Global Index (Zhao et al., 2012) in MapReduce.  相似文献   

16.
文章利用并行计算框架MapReduce,探索数据立方体的计算问题。数据立方体的计算存在两个关键问题,一个是计算时间的问题,另一个是立方体的体积问题。随着维度的增加,计算时间将呈现指数级的增长,立方体的体积也是如此。尽管MapReduce是一个优秀的并行计算框架,但在处理数据倾斜时,分区算法不够完善,导致一些计算任务时间过长,影响整个作业的完成时间。本文通过数据采样的方式,优化数据分区,实验结果表明,数据立方体的计算的性能明显提升。为解决数据立方体体积过大的问题,在Reduce阶段将最终的结果输出到基于NoSQL的HBase数据库进行存储,HBase方便水平扩展,同时也便于日后对数据立方体的查询。  相似文献   

17.
With increasingly inexpensive storage and growing processing power, the cloud has rapidly become the environment of choice to store and analyze data for a variety of applications. Most large-scale data computations in the cloud heavily rely on the MapReduce paradigm and on its Hadoop implementation. Nevertheless, this exponential growth in popularity has significantly impacted power consumption in cloud infrastructures. In this paper, we focus on MapReduce processing and we investigate the impact of dynamically scaling the frequency of compute nodes on the performance and energy consumption of a Hadoop cluster. To this end, a series of experiments are conducted to explore the implications of Dynamic Voltage and Frequency Scaling (DVFS) settings on power consumption in Hadoop clusters. By enabling various existing DVFS governors (i.e., performance, powersave, ondemand, conservative and userspace) in a Hadoop cluster, we observe significant variation in performance and power consumption across different applications: the different DVFS settings are only sub-optimal for several representative MapReduce applications. Furthermore, our results reveal that the current CPU governors do not exactly reflect their design goal and may even become ineffective to manage the power consumption in Hadoop clusters. This study aims at providing a clearer understanding of the interplay between performance and power management in Hadoop clusters and therefore offers useful insight into designing power-aware techniques for Hadoop systems.  相似文献   

18.
iMapReduce: A Distributed Computing Framework for Iterative Computation   总被引:2,自引:0,他引:2  
Iterative computation is pervasive in many applications such as data mining, web ranking, graph analysis, online social network analysis, and so on. These iterative applications typically involve massive data sets containing millions or billions of data records. This poses demand of distributed computing frameworks for processing massive data sets on a cluster of machines. MapReduce is an example of such a framework. However, MapReduce lacks built-in support for iterative process that requires to parse data sets iteratively. Besides specifying MapReduce jobs, users have to write a driver program that submits a series of jobs and performs convergence testing at the client. This paper presents iMapReduce, a distributed framework that supports iterative processing. iMapReduce allows users to specify the iterative computation with the separated map and reduce functions, and provides the support of automatic iterative processing within a single job. More importantly, iMapReduce significantly improves the performance of iterative implementations by (1) reducing the overhead of creating new MapReduce jobs repeatedly, (2) eliminating the shuffling of static data, and (3) allowing asynchronous execution of map tasks. We implement an iMapReduce prototype based on Apache Hadoop, and show that iMapReduce can achieve up to 5 times speedup over Hadoop for implementing iterative algorithms.  相似文献   

19.
MapReduce is a currently popular programming model to support parallel computations on large datasets. Among the several existing MapReduce implementations, Hadoop has attracted a lot of attention from both industry and research. In a Hadoop job, map and reduce tasks coordinate to produce a solution to the input problem, exhibiting precedence constraints and synchronization delays that are characteristic of a pipeline communication between maps (producers) and reduces (consumers). We here address the challenge of designing analytical models to estimate the performance of MapReduce workloads, notably Hadoop workloads, focusing particularly on the intra-job pipeline parallelism between map and reduce tasks belonging to the same job. We propose a hierarchical model that combines a precedence graph model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time, throughput and resource utilization. We validate our solution against a queuing network simulator and a real setup in various scenarios, finding very close agreement in both cases. In particular, our model produces estimates of average job response time that deviate from measurements of a real setup by less than 15 %.  相似文献   

20.
It is a fact that the attention of research community in computer science, business executives, and decision makers is drastically drawn by big data. As the volume of data becomes bigger, it needs performance‐oriented data‐intensive processing frameworks such as MapReduce, which can scale computation on large commodity clusters. Hadoop MapReduce processes data in Hadoop Distributed File System as jobs scheduled according to YARN fair scheduler and capacity scheduler. However, with advancement and dynamic changes in hardware and operating environments, the performance of clusters is greatly affected. Various efforts in literature have been made to address the issues of heterogeneity (i.e., clusters consisting of virtual machines and machines with different hardware), network communication, data locality, better resource utilization, and run‐time scheduling. In this paper, we present a survey to discuss various research efforts made so far to improve Hadoop MapReduce scheduling. We classify scheduling algorithms and techniques proposed in the literature so far based on their addressing areas and present a taxonomy. Furthermore, we also discuss various aspects of open issues and challenges in the scheduling of MapReduce to improve its performance. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号