A Dynamic MapReduce Scheduler for Heterogeneous Workloads


功能特性弹性 MapReduce 的软件完全源于开源社区中的 Hadoop 软件,您可以将现有的大数据集群无缝平滑迁移至腾讯云上。

弹性 MapReduce 产品中集成了社区中常见的热门组件,包括但不限于 Hive、Hbase、Spark、Presto、Sqoop、Hue 等,可以满足您对大数据的离线处理、流式计算等全方位需求。

弹性 MapReduce 无缝集成了腾讯云对象存储(COS)服务,您可将原本存储于 HDFS 中的文件放置在可无限扩展、存储成本低且高可靠的 COS 中,实现计算存储分离。

依托于 COS,您可以在需要的时候创建集群,并在任务完成后销毁集群。




如名称所示,MapReduce主要由两个处理阶段:Map阶段和Reduce 阶段,每个阶段都以键值对作为输入和输出,键值对类型可由用户定义。



4.1.1 MapReduce基本过程MapReduce是一种编程模型,用户在这个模型框架下编写自己的Map函数和Reduce函数来实现分布式数据处理。





图:4-1 map/reduce计算不同形状的过程在Map阶段,将每个图形映射成形状(键Key)和数量(值Value),每个形状图形的数量值是“1”;Shuffle阶段的Combine(合并),相同的形状做归类;在Reduce阶段,对相同形状的值做求和计算。



Hadoop实战-中高级部分之 Hadoop MapReduce工作原理Hadoop RestFulHadoop HDFS原理1Hadoop HDFS原理2Hadoop作业调优参数调整及原理Hadoop HAHadoop MapReduce高级编程Hadoop IOHadoop MapReduce工作原理Hadoop 管理Hadoop 集群安装Hadoop RPC第一部分:MapReduce工作原理MapReduce 角色•Client :作业提交发起者。

•JobTracker: 初始化作业,分配作业,与TaskTracker通信,协调整个作业。






•TaskTracker会主动向JobTracker询问是否有作业要做,如果自己可以做,那么就会申请到作业任务,这个任务可以使Map 也可能是Reduce任务。







Hadoop MapReduce是一种分布式计算模型,用于处理大数据集。





这些数据块被输入给一个映射函数,该函数将输入数据转换成<Key, Value>对。

映射函数将生成许多中间<Key, Value>对,其中Key是一个唯一的标识符,Value是与该Key

Shuffle阶段:在Map阶段之后,中间的<Key, Value>对被分











以下是一些可能涉及到的判断题:1. Hadoop是一个开源的分布式计算平台。




2. Hadoop的核心组件包括HDFS和YARN。



Hadoop的核心组件包括Hadoop分布式文件系统(HDFS)用于存储和YARN(Yet Another Resource Negotiator)用于资源管理和作业调度。

3. MapReduce是Hadoop中用于数据处理的编程模型。




4. Hadoop生态系统中的Hive是一个用于实时数据处理的工具。




5. Hadoop的高可用性可以通过使用ZooKeeper来实现。










MapReduce的体系结构:主从结构:主节点,只有⼀个:JobTracker;从节点,有很多个:Task TrackersJobTracker负责:接收客户提交的计算任务;把计算任务分给Task Trackers执⾏;监控Task Tracker的执⾏情况;Task Trackers负责:执⾏JobTracker分配的计算任务。




MapReduce执⾏流程:MapReduce原理:执⾏步骤:1. map任务处理1.1 读取输⼊⽂件内容,解析成key、value对。



1.2 写⾃⼰的逻辑,对输⼊的key、value处理,转换成新的key、value输出。

1.3 对输出的key、value进⾏分区。



Computer Knowledge and Technology 电脑知识与技术计算机工程应用技术本栏目责任编辑:梁书第7卷第22期(2011年8月)Hadoop 集群性能优化技术研究辛大欣,刘飞(西安工业大学,陕西西安710032)摘要:Hadoop 技术已经在互联网领域得到广泛的应用,同时也得到了学术界的普遍关注。

该文介绍了Hadoop 作为基础数据处理平台仍然存在的问题,阐明了Hadoop 性能优化技术研究的必然性,并介绍了当前Hadoop 优化的三个主要思路:从应用程序角度进行优化、对Hadoop 系统参数进行优化和对Hadoop 作业调度算法进行优化。

Hadoop 集群优化对于提高系统性能和执行效率具有重大的意义。

关键词:Hadoop 集群;性能优化;配置参数;作业调度中图分类号:TP14文献标识码:A 文章编号:1009-3044(2011)22-5484-03Research of Hadoop Performance Tuning TechnologyXIN Da-xin,LIU Fei(Xi'an Technological University,Xi'an 710032,China)Abstract:Hadoop technology had been wildly used and research around the internet and academics.The article introduce the reminded problems of Hadoop data processing platform and Illustra Configuration parameters imization the hadoop performace to increase the system performace and efficiency.Key words:Hadoop cluster;performance optimization;configuration parameters;job schedulerhadoop 是隶属于Apache 软件基金会(Apache Software Foundation )的开源JAVA 项目,它是一个分布式的具有可靠性和可扩展性的存储与计算平台。



hadoop中的mapreduce的核心概念MapReduce是Apache Hadoop中的一个核心模块,用于处理大规模数据集的分布式计算。


核心概念:1. 分布式计算模型:MapReduce模型是一种分布式计算模型,它将大规模数据集划分为多个小型数据集,并在多个计算节点上并行处理这些小型数据集。


2. Map函数:Map函数是MapReduce中的第一个阶段,它执行一个映射操作,将输入数据集映射为<key, value>对。

Map函数可以独立地处理每个输入记录,并产生零个或多个<key, value>对作为中间结果。


3. Reduce函数:Reduce函数是MapReduce中的第二个阶段,它执行一个归约操作,将Map函数产生的中间结果进行合并和聚合。


Reduce 函数对这些值进行处理,并生成最终的输出结果。

4. 分区(Partitioning):在MapReduce中,分区是将中间结果按照键进行划分的过程。

每个Reduce任务会被分配到特定的分区,所有相同键的<key, value>对会被分发到同一个Reduce任务进行处理。


5. 排序(Sorting):在MapReduce的Reduce阶段之前,中间结果需要进行全局排序,以确保具有相同键的所有记录聚集在一起。

这个排序过程可以通过分区和排序(shuffle and sort)阶段来完成。



MapReduce 是一个编程模型, 也是一个处理和生成超
[1] [2]
机制将系统性能最大化。国外学者针对此现象提出了多 种改进方法。文献 [6] 提出了 LATE 调度算法, 核心思想是 基于一个异构环境, 使用静态的方法去计算任务的进度, 对系统性能的提升效果甚微。文献 [7-8]针对 LATE 调度算 法的不足提出了 SAMR 调度算法, 核心思想是基于历史信 息动态地调整 Map 和 Reduce 任务各阶段的时间比例, 找到 真正需要启动备份任务的执行任务。以上几种算法都未 考虑作业类型、 数据集的大小对任务进度值的影响, 因此 并不能最大化地提高系统的性能。 本文针对以上算法进行改进, 在 SAMR 算法的基础上 提出了一个增强的自适应 K-SAMR 调度算法。考虑到其 他影响任务进程的因素, 该算法记录了每个节点的历史信 息并采用 K-means 聚类算法动态地调整阶段进度值参数, 准确地查找慢任务。并将慢节点分为 Map 任务慢节点和 Reduce 慢节点, 有效地提高了系统资源利用率。
Begin 输入: K 个聚蔟的历史信息: M 1, M 2, R1, R2, R3 IF 一个作业 Map 任务完成的百分比超过阀值 PFM 则 M 1← Map 任务的权重
M2 ¬ 1 - M1
LATE 调度器
LATE 调度器 [11] 通过重启剩余时间最长任务的备份任
务来解决默认调度器的不足, 假设任务已经运行的时间为 Tr, 任务的处理速度为 ProgressRate, 任务的最长剩余完成 时间为 TTE。 LATE 调度算法首先利用公式 (1) 计算任务的 进度值 PS, 然后利用公式 (4) 和 (5) 计算任务的最长剩余完 成时间:







Map任务将输入数据映射为(key, value)对,并将这些对作为中间结果输出。



Reduce任务对每个(key, value)对执行特定的操作,并生成最终的输出结果。



目前比较常用的分布式计算框架有Apache Hadoop、Apache Spark等。


Map函数将输入数据映射为(key, value)对,并将这些对作为中间结果输出。


Map 函数的具体实现取决于具体的应用场景和需求。


Reduce函数对每个(key, value)对执行特定的操作,并生成最终的输出结果。

Reduce函数也是独立的,可以并行执行,每个Reduce任务处理一组(key, value)对。



class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))
class Reducer
method Reduce(id m, [s1, s2,...])
如果要累计计数的的不只是单个文档中的内容,还包括了一个Mapper节点处理的所有文档,那就要用 到Combiner了:
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1)
class N State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise method getMessage(object N) return N.State method calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1)
class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0
sum =பைடு நூலகம்sum + c Emit(term t, count sum)

MapReduce是一种用于大规模数据处理的并行编程框架,它由Google 提出,并被广泛应用于分布式计算领域。













每个节点将接收到的数据块应用映射函数,将其中的IP 地址作为键,并将出现的次数作为值输出。

A Dynamic MapReduce Scheduler for HeterogeneousWorkloadsChao Tian12, Haojie Zhou1,Yongqiang He 12, Li Zha11Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China2 Graduate University of the Chinese Academy of Sciences, Beijing 100039, Chinatianchao@, zhouhaojie@, heyongqiang@, char@Abstract—MapReduce is an important programming model for building data centers containing ten of thousands of nodes. In a practical data center of that scale, it is a common case that I/O-bound jobs and CPU-bound jobs, which demand different resources, run simultaneously in the same cluster. In the MapReduce framework, parallelization of these two kinds of job has not been concerned. In this paper, we give a new view of the MapReduce model, and classify the MapReduce workloads into three categories based on their CPU and I/O utilization. With workload classification, we design a new dynamic MapReduce workload predict mechanism, MR-Predict, which detects the workload type on the fly. We propose a Triple-Queue Scheduler based on the MR-Predict mechanism. The Triple-Queue scheduler could improve the usage of both CPU and disk I/O resources under heterogeneous workloads. And it could improve the Hadoop throughput by about 30% under heterogeneous workloads.Keywords-component; MapReduce; Schdule; heterogeneous workloads;I.I NTRODUCTIONAs the Internet scale keeps growing up, enormous data needs to be processed in many Internet Service Providers. MapReduce framework is now becoming a leading example solution for this. MapReduce is designed for building large commodity cluster, which consists of thousands of nodes by using commodity hardware. Hadoop, a popular open source implementation of MapReduce framework, developed primarily by Yahoo, is already used for processing hundreds of terabytes of data on at least 10,000 cores [3]. In this environment, many people share the same cluster for different purpose. This situation led that different kinds of workloads need to run on the same data center. For example, these clusters could be used for mining data from logs which mostly depends on CPU capability. At the same time, they also could be used for processing web text which mainly depends on I/O bandwidth.The performance of a parallel system like MapReduce system closely ties to its task scheduler. Many researchers have shown their interest in the schedule problem. Current scheduler in Hadoop uses a single queue for scheduling jobs with a FCFS method. Yahoo’s capacity scheduler [4] as well as Facebook’s fair scheduler [5] uses multiple queues for allocating different resources in the cluster. Using these scheduler, people could assign jobs to queues which could manually guarantee their specific resource share.In this work, we concentrate on the problem that how to improve the hardware utilization rate when different kinds of workloads run on the clusters in MapReduce framework. In practical, different kinds of jobs often simultaneously run in the data center. These different jobs make different workloads on the cluster, including the I/O-bound and CPU-bound workloads. But currently, the characters of workloads are not aware by Hadoop’s scheduler which prefers to simultaneously run map tasks from the same job on the top of queue. This may reduce the throughput of the whole system which seriously influences the productivity of data center, because tasks from the same job always have the same character. However, the usage of I/O and CPU are actually complementary [7]. A task that performs I/O is blocked, and is prevented from utilizing the CPU until the I/O completes. When diverse workloads run on this environment, machines could contribute different part of resource for different kinds of work.We design a new triple-queue scheduler which consist of a workload predict mechanism MR-Predict and three different queues (CPU-bound queue, I/O-bound queue and wait queue). We classify MapReduce workloads into three types, and our workload predict mechanism automatically predicts the class of a new coming job based on this classification. Jobs in the CPU-bound queue or I/O-bound queue are assigned separately to parallel different type of workloads. Our experiments show that our approach could increase the system throughput up to 30% in the situation of co-exiting diverse workloads.The rest of the paper is organized as follows. Section 2 describes the related work of this article. Section 3 shows our analysis on MapReduce schedule procedure and give a classification of MapReduce workloads. Section 4 introduces our new scheduler. Section 5 validates the performance increase of our new scheduler through a suit of experiments.II.R ELATED WORKThe scheduling of a set of tasks in a parallel system has been investigated by many researchers. Many schedule algorithms has been proposed [11, 12, 13, 16, 17]. [16, 17] focus on scheduling tasks on heterogeneous hardware, and [11,This work is supported in part by the National Science Foundation ofChina (Grant No. 90412010),the Hi-Tech Research and Development (863)Program of China (Grant No. 2006AA01A106, 2006AA01Z121), and theNational Basic Research (973) Program of China (Grant No. 2005CB321807).2009 Eighth International Conference on Grid and Cooperative Computing12, 13] focus on the system performance under diverse workload. The heterogeneity of workloads is also in our assumptions.In fact, it is nontrivial to balance the use of the resources in applications that have different workloads such as large computation and I/O requirements [10]. [6, 14] discussed the problem of how I/O-bound jobs affect system performance, and [7] shown a gang schedule algorithm which parallel the CPU-bound jobs and IO-bound jobs to increasing the utilization of hardware. Our work shares the some ideas with these work However, we depict the different kinds of workloads in the MapReduce system.The schedule problem in MapReduce also attracted manyattentions. [2] addressed the problem of how to robustly perform speculative execution mechanism under heterogeneous hardware environment. [9] derive a new family of scheduling policies specially targeted to sharable workloads.Hadoop is a popular open source implementation of MapReduce [1] and Google File System [15]. Yahoo and Facebook also designed schedulers of Hadoop as Capacity scheduler [4] and Fair scheduler [5]. These two schedulers used multiple queues for allocating different resources in the cluster. They both provide short response times to small jobs in a shared Hadoop cluster. Our work has some thing in common with these two schedulers. However, our work focuses on the utilization of hardware under heterogeneous workloads. We give an automatic workload predicting mechanism for detecting the workload type on the fly. Then the triple-queue scheduler use two different queues for assign tasks of different types.III.M AP R EDUCE ANALYSIS AND WORKLOADSCLASSIFICATIONIn this section, we analyze the MapReduce job working procedure, and give a classification of workloads on MapReduce. Then we discuss the schedule model in Hadoop, and express the problem on this model when diverse workloads run on the current implementation of Hadoop.A.MapReduce procedure analysisMapReduce contains a map phase grouping data in specified key and a reduce phase aggregating data shuffled from map nodes. Map tasks are a bag of independent tasks which use different input. They are assigned to different nodes in cluster. In the other hand, reduce tasks depend on the output of map tasks. They keep fetching map result data from other nodes. Shuffle actions which are I/O-bound often intercross in the map phase, for maximizing the utility of I/O resource. As shown in Figure 1, after all needed intermediate data getshuffled, the reducer begins to compute.Figure 1. The MapReduce data process phasesWe decompose the MapReduce procedure into two sub phases which are Map-Shuffle phase and Reduce-Computing phase. In the Map-Shuffle phase, every node dose five actions: 1) init input data; 2) compute map task; 3) store output result to local disk; 4) shuffle map tasks result data out; and 5) shuffle reduce input data in. The Map-Shuffle phase is the first step. And in Reduce-Computing phase, tasks could directly begin to run the application logic because the input data is already shuffled in memory or local disk.In this view of MapReduce, Map-Shuffle phase is critical to the whole procedure. In this phase, every node performs their map task logic which is similar in one job, and shuffles result data to all reducer nodes. All reducer nodes are not able to begin computing because just one map node slows down. Our work focuses on this phase in MapReduce. We predict map tasks behavior by analyzing job’s Map-Shuffle phase history. I/O-bound map tasks and CPU-bound map tasks could be distinguished by our scheduler. Then we parallel the different kinds of workloads.B. Classification of workloads on MapReduceAccording to the utilization of I/O and CPU, we give a classification of workloads on the Map-Shuffle phase of MapReduce. As we say, every node in the Map-Shuffle phase does five actions. The ratio of the amount of map input data (MID) and map output data (MOD) in a single map task depends on the type of workload. We define a variable ρ as the application logic of particular workload where:MID*MODρ= (1) Shuffle out data (SOD) is the same as the MOD, because MOD is the source of shuffle. Different with the others, the shuffle in data (SID) in a node is not determined by map tasks but the proportion of local reducers’ number and the whole reducers’ number.The first class of workload refers to the tasks of CPU-bound. This class job’s CPU utilization can be maximized to 100% by paralleling more executing tasks. We assume that tasks in the same job have the same ρ value. We define a variable n as the number of concurrent running tasks on one node. We define MTCT which means the Map Task Completed Time and DIOR which means Disk I/O Rate.We use formula 2 to define the type of CPU-Bound workload. As for a map task, the operations of the I/O in disk include input, output, shuffle out and shuffle in. In the process of program running, every node has n map tasks which are synchronously running. Multiple tasks share the disk I/O bandwidth when the system stably runs. In our opinion, if the summation of MID+MOD+SOD+SID of n map tasks divided by MTCT is still smaller than the bandwidth of disk I/O, then this kind of task is CPU-bound. This is an upper-bound of CPU-bound jobs. Formula 2 witnesses this conclusion, and it indicates the estimation of the program action.DIOR <++=+++MTCTSID) )MID 2(1 (*n MTCT SID)SOD MOD (MID *n ρ (2) The second class of workload is different with the formerone. Its map tasks are CPU-Bound without shuffle action. However when shuffle action begins, they block in I/O bandwidth. In this class, the ratio of the I/O data of map tasks to the runtime is less than DIOR. But when reduce phase begins, shuffle will generate lots of disk I/O, and it will make map task block in disk I/O. The CPU utilization ratio of this kind of job wouldn’t reach 100%. We name this class of job as Class Sway. This class means:DIOR≥++=+++MTCTSID) )MID 2(1 (*n MTCTSID)SOD MOD (MID *n ρ (3) Formula 4 define the type of I/O-Bound workload. In this class, every map task would generate lots of I/O operations in short time. And when n map tasks synchronously run on every node, there will be contention among different tasks. Even if reduce shuffle doesn’t start, map task will be still bound to I/O. Formula 4 shows this relation. With our analyzing of single task, we could conclude that if n tasks of this kind synchronously run in the system, it will make application block at disk IO.DIORMTCTMIDMTCT ≥+=)1(*n MOD)+(MID *n ρ (4)We assume that every reducer have a similar size data input.So the SID in MapReduce depends on the distribution of reducer in the cluster. The shuffle input data in every node relies on the proportion of running reducer number (RRN) to the whole reducer number (WRN).So the SID equals the following formula.number nodes *SOR *WRNRRNSID =(5)In these formulas, the MTCT is a variable which may beinfected by the nodes’ workload runtime status. The workload on a machine may make the MTCT longer than the ideal value. This means that if a job is defined as an I/O-bound class according to my formals, it is definitely I/O-bound. But if the job is defined as a CPU-bound class, it may be not CPU-bound class. However, we have got another defect-modify modular torectify this. In details, we make a test to get the execution time of the jobs which simultaneously run with the testing jobs, if the execution time doesn’t get longer, we could confirm the testing job is CPU-bound class.C. Hadoop schedule modelCurrent Hadoop scheduler serves as a FCFS queue with priority. In the MapReduce framework, assigning happens when a TaskTracker which is in charge of the running jobs in the cluster heartbeats the JobTraker. The TaskTracker do heartbeat in every interval (default as 5 second) or when a task in that node finishes. Hadoop also uses a concept of slotnumber for each node to control maximal tasks running number of that node. The slot number actually depicts the maximal parallelizability of the machines. It can be configured in an xml file on every machine based on the hardware of that node. The current scheduler in Hadoop always assigns tasks from the job in the top of queue. It makes that map tasks from the same jobs are always running together. Because map tasks from the same job in MapReduce have similar behaviors, this kind of one queue scheduler could not efficiently use both CPU and I/O resource of the cluster. The contention among the similar tasks decreases the system throughput. In our work, we use three queues to separately execute different type of jobs.Our experiment has shown that our approach can increase thesystem throughput in 30%. IV. T RIPLE -QUEUE S CHEDULERCurrent Hadoop scheduler implementation assigns tasks sequentially by using one queue. Other map tasks wouldn’t be assigned until tasks from the job in the top of queue finish. This FCFS strategy works well when the jobs in the queue are of the same class. However, I/O-bound tasks cause the CPUs to be idle too much time, when other tasks can run. At the same time, the effect of the disk performance is on the opposite: I/O-bound tasks keep the disks busy, while CPU-bound tasks leave them idle. This phenomenon raises the problem of inadequately using of resource. This could happen in a real data center where diverse kinds of workloads often run on it simultaneously. The main idea of our work is that balancing different kinds of tasks could increase the utilization rate of both CPU and I/O bandwidth. The rationale for such paralleling is that these different tasks will hardly interfere each other’s work, as they use different devices [7]. Therefore, they will work well separately in the system. If the I/O operations’ time is not negligible relative to the CPU time, such an overlap of the I/O activity with CPU work can be efficient [6].A. MR-Predict In this triple-queue scheduler, the predicting of workload type is essential. The characteristics of a task can be assessed by looking over its history. We assume that tasks from one job have similar characteristics. It means we can predict tasks’ behavior from the tasks already ran. We propose a new MapReduce workload predict mechanism called MR-Predict. As shown in Figure 2, our scheduler determines a job’s type based on the Formula 2, 3 and 4 which we proposed in previous chapter, and then jobs will run in two different queueswith feedback. When a new job comes in, it will be put into the waiting queue first. Then the scheduler will assign one map task of that job to every TaskTracker when it has idle slots. When these map tasks finish, we calculate the MTCT, MID and MOD by using the data form these tasks. Then jobs can be divided into three types. Each type can be determined by using these data by using our classification of MapReduce framework.In this triple-queue scheduler, the predicting of workload type is essential. The characteristics of a task can be assessed by looking over its history. We assume that tasks from one job have similar characteristics. It means we can predict tasks’ behavior from the tasks already ran. We propose a new MapReduce workload predict mechanism called MR-Predict. As shown in Figure 2, our scheduler determines a job’s type based on the Formula 2, 3 and 4 which we proposed in previous chapter, and then jobs will run in two different queues with feedback. When a new job comes in, it will be put into the waiting queue first. Then the scheduler will assign one map task of that job to every TaskTracker when it has idle slots. When these map tasks finish, we calculate the MTCT, MID and MOD by using the data form these tasks. Then jobs can be divided into three types. Each type can be determined by using these data by using our classification of MapReduce framework.Figure 2. Pseudo-code of the schedule policyAmong all the classifications we discussed above, IO-bound is one conservative class. Because every node may have existed jobs, their contention for resource would lead longer runtime of tasks. Consequently, certain kinds of tasks are likely to be classified as CPU-bound according to our formulas. Therefore, our workload predict system includes one defect-modify mechanism. After one task is assigned to the CPU-Bound queue, the system will monitor the running tasks in IO-Bound queue, if the MTCT of recent tasks increase to a certain threshold, which we define as 140% in our system, then we get the conclusion that the task which we just assigned shouldn’t be allocated to the CPU queue, we need to re-allocate it to the I/O-Bound queue.B.Schedule policysThe triple-queue tasks scheduler contains a CPU-bound queue where jobs of CPU-Bound Class stand in, an I/O-bound queue where jobs of I/O-bound Class stand in, and a waiting queue where all jobs stand in before their type is determined. When a new job comes in, it will be added to the waiting queue. Then the scheduler assigns one map tasks to every TaskTracker for predicting the job type. As shown in the figure 3, if both the CPU-bound queue and I/O-bound queue are empty at that time, the job on the top of waiting queue would move to the idle queue and keep running until its type is determined. And then, as shown in Figure 4, if the undetermined job is found stand in a wrong queue, it will move to the right one.CPU-BoundqueueI/O-BoundQueueWaitQueueFigure 3. The schedule policy when both queues are emptyCPU-Bound QueueI/O-BoundQueue WaitQueueFigure 4. Jobs which are detected standing in a wrong queue will switch toanother queueCPU-Bound queue and I/O-bound queue have their own map slot number and reduce slot number which can be configured by users based on the information of cluster hardware. Each queue works independently, and serves a FCFS with priority strategy just like Hadoop’s current job queue. On every node, both of the queues have their own slots for running certain kind of jobs unless one queue becomes empty. In this situation, idle slots will be fully used by the running job until a new job adds in the empty queue.V.E VALUATIONWe start our evaluation by compiling statistic of jobs to verify our workload classification. Then we do a couple of experiments to validate that one queue schedule can not raise the utilization of both I/O and CPU. At last we run a suit of mixed type jobs for validating the triple queue scheduler works well in a multiple workloads environment.We use a local cluster to test our triple queue scheduler which contains 6 DELL1950 nodes connected by gigabit Ethernet. Each node has 2 Quard Core 2.0GHz CPU, 4GB memory, 2 SATA disks.A. Resource utilizationsWe run a couple of jobs to evaluate the hardware utilization of the current one queue scheduler. We choose three jobs which belong to the three different workload types.[2] says that short jobs are a major use of MapReduce. For example, Yahoo won the TeraSort 1TB benchmark using 910 nodes in 209 second[8], and the average MapReduce job at Google in September 2007 was 395 seconds long [1]. So we choose the input data set as 15GB for each job which could simulate the situation in a real product environment. In these experiments, we set the map slots and the reduce slot to 8. So the n of formula 2, 3 and 4 is 8.We use a testing program called DIO to get the DIOR value of ideal performing Hadoop system. This program runs without reduce phase, and its programming act is simply to read and write on the disk. Therefore, it is totally IO intensive. We could estimate the system’s DIOR value according to the average runtime of single task in this program. In our experiment, we showed that our DIOR is almost 31.2 MB/S.TABLE I. TEST JOBSProgram MIDMOD MTCT TeraSort 64M64M8sec Grep-Count 64M1M 92sec WordCount 64M 64M 35secDIO 64M64M4.1secAs shown in table 1, the first job is TeraSort [8] which is a famous total order sort benchmark. TeraSort is essentially a sequential I/O benchmark. Its MID is 64M, MOD is almost64M in average. The MTCT of this program is 8 second. According to our formula 4, this job belongs to I/O-bound class.The second job is Grep-Count which based on the commonly used program Grep in Linux. The Grep-Count program accepts a regular expression and the files as the input parameters. Unlike Grep in Linux, this Grep-Count output the occupation number of the same lines matching the input regular expression in the documents rather than output all lines matching the regular expression. The computing complexity depends on the input regular expression. In our test case, we use [.]* as the regular expression which make the job be CPU-Bound. Its MID is 64M and MOD is almost 1M in average. According to our formula 2, this job belongs to I/O-bound class.The third job is WordCount. It splits the input text into words, shuffles every word in map phase and counts its occupation number in reduce phase. Its MID is 64M and MODis almost 64M in average. According to our formula 3, this job belongs to Sway class.Figure 5. The average CPU utilization rate of the clusterAmong the testing results, the CPU utilization rate of TeraSort always keeps in low level. That’s because the task is IO-bound, and the CPU utilization rate couldn’t rise. On the contrary, in the tests to Grep-Count, the CPU utilization rate is always nearly 100%. This also verifies that this task is of the class of CPU intensive. The performance of WordCount differs from the two programs above. When the program starts, because reduce task doesn’t begin, the CPU utilization rate reaches 80%; but after reduce task begins, the CPU utilization rate rapidly decreases.B.Triple queue scheduler experimentsIn the previous section, our experiments have shown that current Hadoop scheduler indeed could not efficiently use both of the CPU and I/O resource. In this section, we design a new scene that is multiple jobs run together. We define three clients which submit the three different kinds of jobs in their separate sessions. One client submits one kind of job, and sequentially submits the same job after the previous one finishes. This scène is used for simulating the real product environment which periodically does the same kinds of job everyday.We choose the task used in last chapter as testing task. We use three different clients to constantly request for running these three kinds of tasks. Every job runs five times, and in total 15 jobs will run. The three jobs are of different type. According to the throughput and complete time of these tasks, we could analyze the improvement of performance because of paralleling IO bound and CPU bound tasks.In our experiments, the sequence of task executing would influence the result of the test. The time of Reduce-compute phase in TeraSort and WordCount is a little longer, while that in Grep-count is a little shorter. TeraSort under Reduce-compute phase is CPU bound, and wordCount under Reduce-compute phase is IO bound. In the scheduler of Hadoop, after the Map phase of TeraSort finishes, map slot will be idle, if the next job in the queue is Grep-count, the map tasks of Grep-count and TeraSort could be parallel. It makes the integrity of the system improve in certain level. Therefore, in this test, we give the data under the best, the worst and the average condition.Figure 6. Makespan of the Triple-Queue scheduler against Hadoop nativeschedulerFigure 7. The throughput of map tasksAs shown in figure 5 and 6, the testing data has witnessed that with the scheduler of Hadoop, by adjusting the sequence of jobs, the performance of system is different. The makespan [18] of Hadoop native scheduler is about 7635 seconds in the best condition as well as 8540 seconds in the worst condition. The throughput of map tasks is 5.89 in the best condition and 5.44 in the worst condition.The triple queue scheduler can significantly improve the system performance. It improves the throughput by 30% in map-shuffle phase, and enhances the makespan by 20% under parallel workloads.VI.C ONCLUSION AND F UTURE WORKThis paper discusses the MapReduce performance under heterogeneous workloads. We analyze the typical MapReduce workloads on the MapReduce system, and classify them intothree catagories. We propose the Triple-Queue scheduler based on the classification. The triple-queue scheduler dynamically determines the category of one job. It contains a waiting queue to test-run new joined job and to predict the workload type of this job according to the result. It also includes a CPU-bound queue and an I/O-bound queue for paralleling different kinds of jobs. According to the experiment results, the scheduler can correctly distributes jobs into different queues in most situations. And then the job will run due to the resource appointed by the queue. Our experiments have shown that the Triple Queue Scheduler could increase the map tasks’ throughput of the system by 30%, and save the makespan by 20%.In our work, we assume that the distribution of workload isuniform. We then predict the future behaviors of tasks base onthese distribution. In our future work, we will try to predict theworkloads which are of different kinds of distribution and consider the hardware heterogeneity of the Hadoop cluster environment.[1] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified DataProcessing on Large Clusters,” In Communications of the ACM,Volume 51, Issue 1, pp. 107-113, 2008.J. Clerk Maxwell, A Treatise onElectricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892,pp.68–73.[2] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, and Randy KatzIon Stoica, “Improving MapReduce Performance in HeterogeneousEnvironments,” Proceedings of the 8th conference on Symposium onOpearting Systems Design & Implementation[3] Yahoo! Launches World’s Largest HadoopProduction Application, Yahoo! Developer Network ,/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html.[4] Hadoop’s Capacity Scheduler/core/docs/current/capacity_scheduler.html.[5] Matei Zaharia, “The Hadoop Fair Scheduler” /blogs/hadoop/FairSharePres.ppt[6] E. Rosti, G. Serazzi, E. Smirni, and M.S. Squillante, “The Impact of I/Oon Program Behavior and Parallel Scheduling,” Proc. SIGMETRICS Conf. Measurement and Modeling of Computing Systems, 1998, pp. 56-65,[7] Yair Wiseman and Dror G. Feitelson, “Paired Gang Scheduling,” IEEETransactions on Parallel and Distributed System, vol. 14, no. 6, June 2003[8] Tera Sort Benchmark, /hosted/sortbenchmark/. [9] Parag Agrawal, Daniel Kifer, and Christopher Olston, “SchedulingShared Scans of Large Data Files,” PVLDB '08, August 2008, pp.23-28 [10] E. Rosti, G. Serazzi, E. Smirni, and M.S. Squillante, “Models of ParallelApplications with Large Computation and I/O Requirements,” IEEE Trans. Software Eng., vol. 28, no. 3, Mar.2002, pp. 286-307[11] M.J. Atallah, C.L. Black, D.C. Marinescu, H.J. Siegel and T.L.Casavant, “Models and algorithms for co-scheduling compute-intensive asks on a network of workstations,” Journal of Parallel and Distributed Computing 16, 1992, pp.319–327 [12] D.G. Feitelson and L. Rudolph, “Gang scheduling performance benefitsfor fine-grained synchronization,” Journal of Parallel andDistributed Computing 16(4),December 1992, pp.306–318 [13] J.K. Ousterhout, “Scheduling techniques for concurrent systems,” in Proc. of 3rd Int. Conf. on Distributed Computing Systems, May 1982,pp.22–30.[14] W. Lee, M. Frank, V. Lee, K. Mackenzie, and L. Rudolph, “Implications of I/O for Gang Scheduled Workloads,” Job Scheduling Strategies for Parallel Processing, 1997, pp. 215-237[15] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google file system,” In Proceedings of 19th Symposium on Operating Systems Principles, 2003, pp. 29-43[16] H. Lee, D. Lee and R.S. Ramakrishna, “An Enhanced Grid Scheduling with Job Priority and Equitable Interval Job Distribution,” The first International Conference on Grid and Pervasive Computing, Lecture Notes in Computer Science, vol. 3947, May 2006, pp. 53-62[17] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing,” in 19th IEEE International Parallel and Distributed Processing Symposium, 2005. [18] M. Pinedo, ``Scheduling: Theory, Algorithms, and Systems,'' Prentice Hall, Englewood Cliffs, NJ, 1995.。
