Distributed MapReduce Engine with Fault Tolerance

合集下载

mapreduce在大数据处理中的作用

mapreduce在大数据处理中的作用一、引言随着大数据时代的到来，数据量越来越庞大，传统的数据处理方法已经无法胜任。

为了解决这个问题，谷歌公司在2004年提出了一种分布式计算框架——MapReduce，该框架可以对大规模数据进行高效处理，并且具有可扩展性和容错性等优点。

本文将详细介绍MapReduce在大数据处理中的作用。

二、MapReduce框架概述1. MapReduce框架基本原理MapReduce框架是一种分布式计算模型，它将一个大规模的任务分成若干个小任务，并行地执行这些小任务。

具体来说，MapReduce 框架包含两个阶段：映射（map）阶段和归约（reduce）阶段。

在映射阶段中，每个节点都会对输入数据进行处理，并将处理结果输出为键值对；在归约阶段中，所有节点将相同键值对的结果进行合并并输出最终结果。

2. MapReduce框架主要组件MapReduce框架主要由以下三个组件构成：（1）JobTracker：负责管理整个作业的执行过程，在其中分配任务给各个节点，并监控各个节点的状态。

（2）TaskTracker：负责执行具体的任务，在其中执行Map和Reduce操作，并将处理结果输出到HDFS中。

（3）HDFS：分布式文件系统，用于存储大规模数据。

三、MapReduce框架在大数据处理中的作用1. 高效的数据处理能力MapReduce框架采用分布式计算的方式，可以将一个大规模的任务分成若干个小任务，在多个节点上并行地执行这些小任务。

这种方式可以充分利用集群中的计算资源，提高数据处理速度。

同时，MapReduce框架还支持本地化计算，即将计算任务尽可能地分配到离数据源较近的节点上，减少网络传输带来的开销。

2. 可扩展性MapReduce框架具有良好的可扩展性。

由于它采用了分布式计算的方式，可以通过增加节点数量来扩展集群规模，以应对不断增长的数据量。

同时，MapReduce框架还支持动态添加和删除节点等操作，使得集群规模更加灵活。

uengine 原理

uengine 原理uengine是一种基于云计算的技术平台，可以为用户提供强大的计算能力和资源管理功能。

它的原理是通过将用户的计算任务分解成多个小任务，并将这些任务分配给不同的计算资源进行并行处理，从而提高计算效率和性能。

uengine采用了分布式计算的原理。

分布式计算是指将一个大型的计算任务分解成多个小任务，并将这些小任务分配给多个计算节点进行并行处理。

uengine利用云计算的特点，将大规模的计算资源组织成一个庞大的计算集群，每个节点都可以独立地执行任务，从而实现任务的并行处理。

这种分布式计算的原理可以大大提高计算的效率和性能，使得用户可以更快地完成计算任务。

uengine还采用了任务调度和资源管理的原理。

在分布式计算中，任务的调度和资源的管理是非常重要的。

uengine通过一个中央调度器来管理和调度用户的计算任务。

中央调度器根据任务的优先级和计算资源的可用性，将任务分配给合适的计算节点进行处理。

同时，uengine还具有资源管理的功能，可以根据用户的需求动态地调整计算资源的分配，从而提高计算的效率和性能。

uengine还采用了数据分布和数据共享的原理。

在大规模计算任务中，数据的分布和共享是非常重要的。

uengine通过将数据分布到不同的计算节点上，并采用数据共享的方式，使得不同的计算节点可以共享数据，从而避免了数据的重复传输和存储，减少了通信开销，提高了计算的效率和性能。

uengine还采用了容错和负载均衡的原理。

容错是指系统在面对硬件故障或网络故障时，能够自动地进行故障恢复和故障转移。

uengine通过在计算节点之间进行数据备份和任务重试，可以实现容错的功能。

负载均衡是指系统能够平衡不同计算节点之间的负载，使得计算资源得到合理的利用。

uengine通过动态地调整任务的分配和计算节点的负载，可以实现负载均衡的功能，从而提高计算的效率和性能。

uengine是一种基于云计算的技术平台，通过分布式计算、任务调度和资源管理、数据分布和数据共享、容错和负载均衡等原理，提供了强大的计算能力和资源管理功能。

“大数据生态课件-批处理框架MapReduce介绍”

2 搜索引擎
MapReduce可用于处理海量数据，如日志分析、数据挖掘和社交网络分析等。
MapReduce广泛应用于搜索引擎中的索引构建和查询处理等关键任务。
3 机器学习
MapReduce可以支持大规模的机器学习算法训练和特征提取等任务。
Hadoop中的MapReduce组件
Hadoop分布式文件系统（HDFS）
HDFS是MapReduce的默认文件系统，提供了大规模数据存储和可靠性。
MapReduce框架
MapReduce是Hadoop的核心组件，负责任务调度和数据处理。
Apache YARN
YARN是Hadoop的资源管理器，用于协调和分配集群资源给 MapReduce作业。
MapReduce的数据添加更多的节点来扩展以处理更大规模的数据。
容错性
由于MapReduce框架在节点失败时能够自动处理故障，因此具有高度的容错性。
灵活性
MapReduce可以在任何支持并行计算的环境中运行，不依赖于特定的硬件或软件。
MapReduce的应用场景
1 大规模数据处理
1. 运行MapReduce任务时打开详细的日志记录 2. 使用可视化工具分析任务的执行过程和性能瓶颈 3. 对失败的任务进行日志追踪和错误排查 4. 使用模拟数据和小规模输入进行本地调试
MapReduce与其他批处理框架的比较
MapReduce Apache Spark
Apache Flink
经典的批处理模型，适用于大规模数据处理。
3 溢出处理
当中间结果过大时，需要进行溢出处理以避免内存不足问题。
MapReduce的性能优化
• 调整集群的资源配置和任务调度策略 • 优化代码逻辑和算法，减少不必要的计算和数据传输 • 使用压缩算法和序列化技术减少数据传输量 • 适当增加Reduce任务的并行度以提高处理速度

hadoop的生态体系及各组件的用途

hadoop的生态体系及各组件的用途
Hadoop是一个生态体系，包括许多组件，以下是其核心组件和用途：
1. Hadoop Distributed File System (HDFS)：这是Hadoop的分布式文件系统，用于存储大规模数据集。

它设计为高可靠性和高吞吐量，并能在低成本的通用硬件上运行。

通过流式数据访问，它提供高吞吐量应用程序数据访问功能，适合带有大型数据集的应用程序。

2. MapReduce：这是Hadoop的分布式计算框架，用于并行处理和分析大规模数据集。

MapReduce模型将数据处理任务分解为Map和Reduce两个阶段，从而在大量计算机组成的分布式并行环境中有效地处理数据。

3. YARN：这是Hadoop的资源管理和作业调度系统。

它负责管理集群资源、调度任务和监控应用程序。

4. Hive：这是一个基于Hadoop的数据仓库工具，提供SQL-like查询语言和数据仓库功能。

5. Kafka：这是一个高吞吐量的分布式消息队列系统，用于实时数据流的收集和传输。

6. Pig：这是一个用于大规模数据集的数据分析平台，提供类似SQL的查询语言和数据转换功能。

7. Ambari：这是一个Hadoop集群管理和监控工具，提供可视化界面和集群配置管理。

此外，HBase是一个分布式列存数据库，可以与Hadoop配合使用。

HBase 中保存的数据可以使用MapReduce来处理，它将数据存储和并行计算完美地结合在一起。

面向异构环境的分布式机器学习算法设计与优化

面向异构环境的分布式机器学习算法设计与优化随着大数据时代的到来，机器学习在各个领域的应用越来越广泛。

然而，传统的机器学习算法在处理大规模数据时面临着计算资源不足、计算速度慢等问题。

为了解决这些问题，分布式机器学习应运而生。

分布式机器学习利用多台计算机进行协同工作，将数据划分为多个部分进行处理和训练，从而提高了计算速度和模型的准确性。

然而，在实际应用中，我们常常面临着异构环境的挑战。

异构环境指的是由不同类型、不同性能、不同存储能力等特点的计算资源组成的环境。

这些异构资源对于分布式机器学习算法设计和优化提出了新的要求和挑战。

首先，在异构环境中进行任务划分是一个关键问题。

由于不同类型、性能差异较大的计算资源存在差别，我们需要合理地将任务划分到各个资源上，以充分利用它们各自特点，并且尽量减少任务之间通信开销。

其次，在任务划分之后，异构环境中的计算资源之间的通信成为一个重要问题。

由于异构资源之间的通信速度差异较大，通信开销可能成为整个分布式机器学习算法的瓶颈。

因此，我们需要设计高效的通信机制，减少通信开销，并且充分利用高速计算资源。

此外，在异构环境中进行模型训练也是一个具有挑战性的问题。

由于不同类型、性能差异较大的计算资源之间存在差别，我们需要设计适应性强、效果好的模型训练算法。

这样才能充分利用各个资源，并且获得较好的模型准确性。

针对上述挑战，研究者们提出了许多面向异构环境的分布式机器学习算法设计和优化方法。

首先，在任务划分方面，研究者们提出了多种任务划分策略。

例如，基于数据特征和计算资源特征进行任务划分、基于负载均衡进行任务划分等。

这些策略可以根据实际情况选择合适的方法，并且充分利用各个计算资源。

其次，在通信方面，研究者们提出了多种高效的通信机制。

例如，基于数据压缩和数据量化的通信机制，可以减少通信开销。

此外，基于异步通信和分布式共享内存的通信机制，可以提高通信效率。

这些方法可以根据实际情况选择合适的方法，并且减少整个分布式机器学习算法的通信开销。

分布式计算计算引擎

分布式计算计算引擎分布式计算引擎是一种能够将计算任务分配到多个计算节点上进行并行计算的技术。

它可以将大规模的计算任务分解成多个小任务，然后将这些小任务分配到不同的计算节点上进行计算，最终将计算结果汇总起来得到最终结果。

这种技术可以大大提高计算效率，缩短计算时间，同时也可以降低计算成本。

分布式计算引擎的核心是分布式计算框架，它是一种能够将计算任务分配到多个计算节点上进行并行计算的软件系统。

目前比较流行的分布式计算框架有Hadoop、Spark、Flink等。

这些框架都具有高可靠性、高可扩展性、高并发性等特点，可以满足不同规模的计算需求。

Hadoop是最早的分布式计算框架之一，它主要用于处理大规模的数据集。

Hadoop的核心是HDFS（Hadoop分布式文件系统）和MapReduce计算模型。

HDFS是一种分布式文件系统，可以将大规模的数据集分散存储在多个计算节点上，MapReduce计算模型则是一种将计算任务分解成多个小任务进行并行计算的模型。

Spark是一种新兴的分布式计算框架，它主要用于处理实时数据和迭代计算。

Spark的核心是RDD（弹性分布式数据集）和DAG（有向无环图）计算模型。

RDD是一种分布式内存数据结构，可以将数据集缓存在内存中，从而提高计算效率。

DAG计算模型则是一种将计算任务分解成多个阶段进行并行计算的模型。

Flink是一种新兴的分布式计算框架，它主要用于处理流式数据和批量数据。

Flink的核心是DataStream和DataSet计算模型。

DataStream是一种流式数据处理模型，可以实时处理数据流，DataSet则是一种批量数据处理模型，可以批量处理数据集。

分布式计算引擎是一种能够将计算任务分配到多个计算节点上进行并行计算的技术，它可以大大提高计算效率，缩短计算时间，同时也可以降低计算成本。

目前比较流行的分布式计算框架有Hadoop、Spark、Flink等，它们都具有高可靠性、高可扩展性、高并发性等特点，可以满足不同规模的计算需求。

大规模数据处理计算引擎

大规模数据处理计算引擎大规模数据处理计算引擎是指能够高效处理大规模数据的计算引擎。

随着大数据时代的到来，企业和组织面临着越来越多的数据需求，传统的计算引擎不再满足这些需求。

因此，大规模数据处理计算引擎应运而生，通过分布式计算、并行处理和优化算法等技术，能够快速高效地处理大规模数据，提供实时、智能的数据分析、挖掘和决策支持服务。

大规模数据处理计算引擎的核心特点之一是分布式计算。

传统的计算引擎通常是在单台计算机上运行，随着数据量的增加，这种方式会面临计算能力和存储空间的限制。

而大规模数据处理计算引擎采用分布式计算的方式，将计算任务分解成多个子任务，并分配给不同的计算节点进行处理，通过并行计算的方式提高计算性能和数据处理能力。

这样不仅能够充分利用集群中的计算资源，还可以通过横向扩展集群的规模来提升整体性能。

另一个重要特点是可伸缩性。

大规模数据处理计算引擎能够按需扩展计算资源，根据数据规模和计算需求的变化，动态调整计算节点的数量和配置，以提供更好的计算性能和数据处理能力。

这种可伸缩性使得大规模数据处理计算引擎能够应对不断增长的数据需求，保持高性能和高效率。

大规模数据处理计算引擎还具有高可用性和容错性。

由于数据处理涉及到大量的计算和存储操作，计算节点的故障和数据丢失是不可避免的。

因此，大规模数据处理计算引擎必须具备高可用性和容错性，通过冗余备份和数据复制等机制，保证计算节点的可靠运行和数据的完整性。

此外，大规模数据处理计算引擎还需要支持多种数据处理任务和算法。

随着数据类型的增多和数据需求的多样化，大规模数据处理计算引擎应该能够处理结构化数据和非结构化数据，支持各种数据分析和挖掘任务，如文本分析、图像处理、机器学习等。

为此，大规模数据处理计算引擎需要提供丰富的数据处理接口和算法库，以满足各种数据需求。

综上所述，大规模数据处理计算引擎是一种能够高效处理大规模数据的计算引擎，具有分布式计算、可伸缩性、高可用性和容错性等特点，能够满足企业和组织在大数据时代的数据处理需求。

mapreduce工作原理图文详解_Map、Reduce任务中Shuffle和排序

mapreduce工作原理图文详解_Map、Reduce任务中Shuffle和排序本文主要分析以下两点内容：1.MapReduce作业运行流程原理2.Map、Reduce任务中Shuffle和排序的过程下面是visio2010画出的MapReduce流程示意图：流程分析：1.在客户端启动一个作业。

2.向JobTr ac ker请求一个Job ID。

3.将运行作业所需要的资源文件复制到HDFS上，包括MapReduce程序打包的JAR文件、配置文件和客户端计算所得的输入划分信息。

这些文件都存放在JobTracker专门为该作业创建的文件夹中。

文件夹名为该作业的Job ID。

JAR 文件默认会有10个副本（mapred.sub mit.repl ic ation属性控制）；输入划分信息告诉了JobTracker应该为这个作业启动多少个map任务等信息。

4.JobTracker接收到作业后，将其放在一个作业队列里，等待作业调度器对其进行调度（这里是不是很像微机中的进程调度呢，呵呵），当作业调度器根据自己的调度算法调度到该作业时，会根据输入划分信息为每个划分创建一个map任务，并将map任务分配给TaskTracker执行。

对于map和reduce任务，TaskTracker根据主机核的数量和内存的大小有固定数量的map槽和reduce槽。

这里需要强调的是：map任务不是随随便便地分配给某个TaskTracker的，这里有个概念叫：数据本地化（Data-Local）。

意思是：将map任务分配给含有该map处理的数据块的TaskTracker上，同时将程序JAR 包复制到该TaskTracker上来运行，这叫“运算移动，数据不移动”。

而分配reduce任务时并不考虑数据本地化。

5.TaskTracker每隔一段时间会给JobTracker发送一个心跳，告诉JobTracker它依然在运行，同时心跳中还携带着很多的信息，比如当前map任务完成的进度等信息。

MapReduce在分布式搜索引擎中的应用

开发的开源项目Ｈｄｏａｏｐ也实现了ＭａＲｄｃｐｅｕｅ机制
③ ，以及ＨＤＳ分布式文件系统④ ，促使目前国内Ｆ
外众多企业有机会利用ＭａＲｄｃｐｅｕｅ实现大型分布式数据处理。作为Ｄｕｕｔｇ的另一开源项目Ｌｃｎ，提供ｏｇＣｔｎｉｕｅｅ
摘
要：ＭａＲｄｃ是一种分布式的并行编程模式，它可以实现大型数据集的并行运算。ｕｅｅＡａｈｐｅｕｅＬｃｎ是ｐｃｅ下的
搜索引擎开发包，当索引文件不断增大时，Ｌｃｎ搜索便会出现瓶颈问题。通过利用ＭａＲｄｃ的思想，按城ｕｅｅｐｅｕｅ
ＡｂｔａｔＭａＲｅｕｅｉｓｒｂｔｄｐｒｌｌｅｏｒｍｍｉｇｍｏｅ．ｔａｍｐｅｎｈｒｃｓｉｇａｄｇｎｒｔｎｓｒｃ：ｐｄｅｓａｄｉｔｉｕｅａａｌｉｄｐｒｇａｅｚｎｄ１Ｉｎｉｌｍｅｔｔｅｐｏｅｓｎｎｅｅａｉｇｃ
（ａｇｏｇｉｒｉｆｉａｃ，ｕｎｄｎｕｎｚｏ，１５１ｈｎ）ＧｕｎｄｎｖｓｙｏＦｎｎｅＧａｇｏｇＧａｇｈｕ５０２ｉａＵｎｅｔＣ（ａｅｒｕｌｅｆｒｔｎＴｃｎｌｙＳａｇａ）ｔｈｎｈｉ０１０ＣｉａＣｒｅｂｉｒｎｏｍａｏｅｈｏｏ（ｈｎｈｉＬｄＳａｇａ２０２ｈ）ｄＩｉｇ，ｎ
ｓｒｅｒａｐｎｐｒｔｎＩｉｍａｐｄｂｉｉｉｇｔｅｉｄｘｆｅｂｉｔｔｇ．ｄｔｅｅＭ印Ｆｎｔｎｇｔｈｅｖｒｏｐｉｇｏｅａｉ．ｔｓｐｅｙｄｖｄｎｅｌｙｃｔｓａｅｙＡｎｎｔｆＭｏｈｎｉｙｒｈｈｕｃｉｅｅｏｔ

Spark经典论文笔记---ResilientDistributedDatasets：AF。。。

Spark经典论⽂笔记---ResilientDistributedDatasets：AF。

Spark 经典论⽂笔记Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing为什么要设计spark现在的计算框架如Map/Reduce在⼤数据分析中被⼴泛采⽤，为什么还要设计新的spark？Map/Reduce提供了⾼级接⼝可以⽅便快捷的调取计算资源，但是缺少对分布式内存有影响的抽象。

这就造成了计算过程中需要在机器间使⽤中间数据，那么只能依靠中间存储来保存中间结果，然后再读取中间结果，造成了时延与IO性能的降低。

虽然有些框架针对数据重⽤提出了相应的解决办法，⽐如Pregel针对迭代图运算设计出将中间结果保存在内存中，HaLoop提供了迭代Map/Reduce的接⼝，但是这些都是针对特定的功能设计的不具备通⽤性。

针对以上问题，Spark提出了⼀种新的数据抽象模式称为RDD（弹性分布式数据集），RDD是容错的并⾏的数据结构，并且可以让⽤户显式的将数据保存在内存中，并且可以控制他们的分区来优化数据替代以及提供了⼀系列⾼级的操作接⼝。

RDD数据结构的容错机制设计RDD的主要挑战在与如何设计⾼效的容错机制。

现有的集群的内存的抽象都在可变状态（啥是可变状态）提供了⼀种细粒度（fine-grained）更新。

在这种接⼝条件下，容错的唯⼀⽅法就是在不同的机器间复制内存，或者使⽤log⽇志记录更新，但是这两种⽅法对于数据密集（data-intensive⼤数据)来说都太昂贵了，因为数据的复制及时传输需要⼤量的带宽，同时还带来了存储的⼤量开销。

与上⾯的系统不同，RDD提供了⼀种粗粒度（coarse-grained）的变换（⽐如说map，filter，join），这些变换对数据项应⽤相同的操作。

hadoop mapreduce工作原理

hadoop mapreduce工作原理
Hadoop MapReduce是一种分布式计算模型，用于处理大数据集。

它有两个主要组件：Map和Reduce。

Map阶段：在MapReduce任务中，数据被拆分成几个小块，
然后并行传输到不同的节点上。

每个节点上都运行着一个
Map任务。

在Map阶段，每个节点独立地对其分配到的数据
块进行处理。

这些数据块被输入给一个映射函数，该函数将输入数据转换成<Key, Value>对。

映射函数将生成许多中间<Key, Value>对，其中Key是一个唯一的标识符，Value是与该Key
相关联的数据。

Shuffle阶段：在Map阶段之后，中间的<Key, Value>对被分
区并按照Key进行排序。

然后，相同Key的值被分组在一起，并传输到下一个Reduce节点。

在此过程中，数据在不同的节
点之间进行移动，以便形成适合进行Reduce操作的数据分区。

Reduce阶段：在Reduce阶段，每个Reduce节点只处理与特
定Key相关联的所有Value。

Reduce节点将这些Value作为输
入传给一个归约函数。

归约函数可以对这些Value执行合并、
计算或其他操作来得到最终的输出结果。

整个MapReduce过程的主要思想是将大任务分解成更小的子
任务，然后并行执行这些子任务，并将结果进行合并以生成最终的输出。

这种计算模型能够充分利用分布式计算集群的处理能力，从而高效地处理大规模的数据集。

mapreduce的工作原理

mapreduce的工作原理
MapReduce是一种用于处理大规模数据集的编程模型和算法。

它的工作原理基于分而治之的思想，将数据分割成多个小块，并在不同的计算节点上进行并行处理，然后将中间结果进行合并，最终得到最终的结果。

MapReduce的过程可以分为两个阶段：Map阶段和Reduce阶段。

在Map阶段，输入的大数据集被拆分为多个小数据块，并由
不同的计算节点进行并行处理。

每个计算节点都会执行用户定义的Map函数，该函数将每个输入数据块转化为一系列（键，值）对的中间结果。

这个过程中，Map函数是独立执行的，
从而实现了并行处理。

在Reduce阶段，所有的中间结果会被按照键进行排序，并进
行分组。

然后，每个计算节点都会执行用户定义的Reduce函数，该函数将每个组的值进行聚合并生成最终结果。

在Reduce阶段，数据的处理是有序的，以确保数据的正确性和
一致性。

MapReduce的工作原理主要基于数据的分割、并行处理和结
果的合并。

通过将大规模数据集划分为小块并在多个计算节点上同时处理，可以大幅提高数据处理的效率和速度。

同时，通过中间结果的合并，可以得到最终的结果。

这种分布式计算模型在处理大规模数据时具有优势，并被广泛应用于各种领域，如搜索引擎、数据分析和机器学习等。

简述hadoop核心组件及功能应用

简述hadoop核心组件及功能应用Hadoop是一个开源的分布式计算系统，由Apache组织维护。

它可以处理大量的数据，支持数据的存储、处理和分析。

其核心组件包括HDFS（Hadoop分布式文件系统）、MapReduce计算框架、YARN（资源管理）。

以下是对每个核心组件的简要介绍：1. HDFSHDFS是Hadoop分布式文件系统，它是Hadoop最核心的组件之一。

HDFS是为大数据而设计的分布式文件系统，它可以存储大量的数据，支持高可靠性和高可扩展性。

HDFS的核心目标是以分布式方式存储海量数据，并为此提供高可靠性、高性能、高可扩展性和高容错性。

2. MapReduce计算框架MapReduce是Hadoop中的一种计算框架，它支持分布式计算，是Hadoop的核心技术之一。

MapReduce处理海量数据的方式是将数据拆分成小块，然后在多个计算节点上并行运行Map和Reduce任务，最终通过Shuffle将结果合并。

MapReduce框架大大降低了海量数据处理的难度，让分布式计算在商业应用中得以大规模应用。

3. YARNYARN是Hadoop 2.x引入的新一代资源管理器，它的作用是管理Hadoop集群中的资源。

它支持多种应用程序的并行执行，包括MapReduce和非MapReduce应用程序。

YARN的目标是提供一个灵活、高效和可扩展的资源管理器，以支持各种不同类型的应用程序。

除了以上三个核心组件，Hadoop还有其他一些重要组件和工具，例如Hive（数据仓库）、Pig（数据分析）、HBase（NoSQL数据库）等。

这些组件和工具都是Hadoop生态系统中的重要组成部分，可以帮助用户更方便地处理大数据。

总之，Hadoop是目前最流行的大数据处理框架之一，它的核心组件和工具都为用户提供了丰富的数据处理和分析功能。

主流大数据计算引擎对比分析

相对于Storm，Spark Streaming支持更的大吞吐量；基于Spark内核的迭代计算，Spark Streaming是准实时处理；良好的容错性和故障恢复能力；
Spark Streaming原理
流入的记录以短时批处理的方式进行计算，每一个批次转化成一个RDD
STORM流处理应用
提纲
分布式批处理计算引擎介绍分布式流处理计算引擎介绍
MapReduce应用场景
MapReduce基于Google发布的分布式计算框架MapReduce论文设计开发，用于大规模数据集（大于1TB）的并行运算，特点如下：
- 易于编程：程序员仅需描述做什么，具体怎么做就交由系统的执行框架处理。 - 良好的扩展性：可以添加机器扩展集群能力。 - 高容错性：通过计算迁移或数据迁移等策略提高集群的可用性与容错性。
STORM 应用场景
Storm 可以对大量的数据流进行可靠的实时处理，这一过程也称为“流式处理”;
Storm 支持多种类型的应用，包括：实时分析、在线机器学习、连续计算、分布式RPC（DRPC）、 ETL等;
快速的数据处理、可扩展性与容错性;
STROM原理
基于STROM的情感分析
SPARK Streaming 应用场景
Spark核心概念 – 宽依赖和窄依赖
RDD父子依赖关系：窄（Narrow）依赖和宽（Wide）依赖。窄依赖指父RDD的每一个分区最多被一个子RDD的分区所用。宽依赖指子RDD的分区依赖于父RDD的所有分区。
Spark SQL- Spark 生态圈的查询引擎
提纲
分布式批处理计算引擎介绍分布式流处理计算引擎介绍
Байду номын сангаас
谁在使用MapReduce？

分布式计算引擎

分布式计算引擎分布式计算引擎是一种用于处理大规模数据的计算框架，它可以将计算任务分配给多个计算节点进行并行计算，从而提高计算效率和处理能力。

分布式计算引擎的出现，使得大规模数据处理变得更加高效和可靠，成为了现代计算领域的重要技术之一。

分布式计算引擎的核心思想是将计算任务分解成多个子任务，然后将这些子任务分配给多个计算节点进行并行计算。

这些计算节点可以是分布在不同地理位置的服务器、计算机或者云平台上的计算资源。

分布式计算引擎可以自动管理这些计算节点，将计算任务分配给最适合的节点进行计算，并在计算过程中自动处理节点故障和数据传输等问题，从而保证计算的高效性和可靠性。

分布式计算引擎的应用非常广泛，包括数据挖掘、机器学习、图像处理、自然语言处理等领域。

例如，在数据挖掘领域，分布式计算引擎可以帮助处理海量的数据，从中挖掘出有价值的信息和模式。

在机器学习领域，分布式计算引擎可以加速模型训练和优化，从而提高模型的准确性和泛化能力。

在图像处理和自然语言处理领域，分布式计算引擎可以帮助处理大规模的图像和文本数据，从中提取出有用的特征和信息。

市面上有很多分布式计算引擎可供选择，例如Apache Hadoop、Apache Spark、Apache Flink等。

这些分布式计算引擎都具有高效、可靠、可扩展等特点，可以满足不同场景下的需求。

同时，这些分布式计算引擎也在不断地发展和完善，引入了更多的优化和新功能，使得它们在处理大规模数据方面的能力不断提升。

分布式计算引擎是现代计算领域的重要技术之一，它可以帮助处理大规模数据，提高计算效率和处理能力，为各种应用场景提供了强有力的支持。

随着技术的不断发展和完善，分布式计算引擎的应用前景将会更加广阔。

mapreduce工作原理

mapreduce工作原理
MapReduce是一种分布式计算模型，用于处理大规模的数据集。

它的工作原理可以简单概括为两个过程：Map过程和Reduce过程。

在Map过程中，输入数据集被分割成多个小块，并由多个
Map任务并行处理。

每个Map任务都会对输入数据集中的每
个元素执行相同的操作，并生成中间键值对。

这些中间键值对会被存储在内存中的缓冲区内。

接下来是Shuffle过程，该过程负责将Map任务生成的中间键
值对按照键的值进行排序和分区，并将同一个键的中间键值对传递给同一个Reduce任务进行处理。

Shuffle过程可以确保相
同键的中间键值对被发送到同一个Reduce任务。

在Reduce过程中，每个Reduce任务并行处理一组中间键值对。

Reduce任务会将它们从存储在内存中的缓冲区中取出，并按
照键的值进行合并和计算。

最终的计算结果会被写入一个输出文件中。

整个MapReduce过程中，数据的读取、处理和写入都是在分
布式计算集群中进行的，可以充分利用集群中的多台计算机资源来加速处理过程。

MapReduce模型的并行处理能力和可靠
性使得它成为处理大规模数据集的一种理想选择。

mapreduce技术特点及适用场景

MapReduce是一种用于处理大规模数据的并行计算程序设计模式。

它由Google公司提出并用于其大规模数据处理系统中，后来被Hadoop等开源项目广泛采用。

MapReduce技术具有很多特点，同时也具有很多适用场景。

一、MapReduce技术特点1. 分布式处理：MapReduce将问题分解成独立的任务，并且在多台计算机上并行处理。

这样可以提高计算速度，适应大规模数据处理。

2. 容错性：MapReduce框架在处理数据时会自动检测错误并进行重新计算，确保计算结果的准确性。

3. 可伸缩性：MapReduce框架可以方便地进行横向扩展，即通过增加计算节点来提高处理能力。

4. 简单易用：MapReduce编程模型相对简单，使用Map和Reduce 两种基本操作就可以完成大部分数据处理任务。

5. 适合非交互式计算：MapReduce适用于一次性大规模数据处理，不适合需要即时交互的应用场景。

6. 适合数据并行计算：MapReduce适用于数据集的并行计算，而不适用于计算量很大但是没有明显的数据并行结构的任务。

7. 适用于高延迟环境：MapReduce框架可以有效地利用网络传输数据，适合在高延迟的环境下进行数据处理。

二、MapReduce适用场景1. 数据挖掘和分析：MapReduce技术适用于大规模的数据挖掘和分析任务，可以方便地处理海量的结构化和非结构化数据。

2. 分布式搜索引擎：MapReduce可以用于构建分布式的搜索引擎，通过并行计算来提高搜索效率。

3. 日志处理和分析：许多互联网公司使用MapReduce来处理大规模的日志数据，以便进行性能监控、用户行为分析等工作。

4. 数据清洗和预处理：大规模数据处理中，往往需要清洗和预处理数据，MapReduce技术可以很好地完成这类任务。

5. 图像处理和识别：MapReduce可以并行处理大规模的图像数据，用于图像特征提取、目标检测等应用。

6. 自然语言处理：对文本数据进行分析和处理时，MapReduce技术可以提高处理速度和效率。

distributeddataparallel的map和reduce的过程

distributeddataparallel的map和reduce的过程分布式数据并行是一种用于处理大规模数据集的并行计算模式，它将数据集分割成多个分区，并在多个计算节点上并行执行计算任务。

其中，Map和Reduce是分布式数据并行的两个核心步骤。

Map阶段：Map阶段是将输入数据集分割成多个小的数据块，并在每个计算节点上进行并行的处理的过程。

在Map阶段中，每个计算节点都会分别处理一个或多个数据块，将数据块中的元素经过某种映射函数处理后生成键值对形式的中间结果。

这个映射函数可以是用户自定义的函数，用于将输入数据中的每个元素转换为键值对。

Map阶段的处理结果是一组中间结果，其中每个中间结果都与对应的计算节点相关联。

Reduce阶段：Reduce阶段是在多个计算节点上并行地处理和归并Map阶段生成的中间结果的过程。

在Reduce阶段中，每个计算节点都会选择一个或多个中间结果进行处理，并将这些中间结果按照键值对中键的值进行排序和归并。

然后，通过用户自定义的归并函数，将相同键的值进行合并，从而生成最终的输出结果。

Reduce阶段的处理结果是最终的输出结果，包含了对分布式数据集的处理和汇总。

整个过程的具体流程如下：1.输入数据集的划分：将大的输入数据集分割成多个小的数据块，每个数据块包含多个元素。

2. Map任务的并行处理：将每个数据块分配给不同的计算节点，并在每个计算节点上并行地执行映射函数，将数据块中的每个元素转换为键值对形式的中间结果。

3.中间结果的收集和排序：收集每个计算节点上生成的中间结果，并按照键的值进行排序，以便后续的归并操作。

4. Reduce任务的并行处理：将排序后的中间结果分配给不同的计算节点，并在每个计算节点上并行地执行归并函数，将相同键的值进行合并，生成最终的输出结果。

5.最终结果的收集和输出：收集每个计算节点上生成的最终输出结果，并将它们合并成一个完整的结果集。

分布式数据并行的优点是能够高效地处理大规模的数据集，通过将数据集分割成多个小的数据块，同时在多个计算节点上进行并行处理，大大提高了数据处理的速度和效率。

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Distributed MapReduce Engine with Fault ToleranceLixing Song,Shaoen Wu Dept.of Computer Science Ball State UniversityMuncie,IN {lsong,swu}@Honggang WangDept.of Electrical and Computer EngineeringUniversity of Massachusetts DartmouthDartmouth,MAhwang1@Qing YangComputer Science DepartmentMontana State UniversityBozeman,MTqing.yang@Abstract—Hadoop is the de facto engine that drives current cloud computing practice.Current Hadoop architecture suffers from single point of failure problems:its job management lacks of fault tolerance.If a job management fails,even if its tasks remains still active on cloud nodes,this job loses all state information and has to restart from scratch.In this work,we propose a distributed MapReduce engine for Hadoop with the Distributed Hash Table(DHT)algorithm that drives the scalable peer-to-peer networks today.The distributed Hadoop engine provides the fault-tolerance capability necessary to support efﬁcient job computation required in the cloud computing with numerous jobs running at a moment.We have implemented the proposed distributed solution into Hadoop and evaluated its performance in job failures under various network deployments.I.I NTRODUCTIONCloud computing[1]–[4]is a model for enabling ubiq-uitous,convenient,on-demand network access to a shared pool of conﬁgurable computing resources(e.g.,networks, servers,storage,applications,and services)that can be rapidly provisioned and released with minimal management effort or service provider interaction.With the dramatic upsurging of cloud computing applications and business in the past years, Hadoop[5],[6],thanks to its open source implementation, has become the de facto engine enabling the current cloud computing.For example,large businesses such as Yahoo!, Facebook,Amazon and HP all adopt Hadoop as the foundation in their cloud computing service to billions of customers. Hadoop wasﬁrst developed by Yahoo!and then released as open source to the public.It has been evolving from itsﬁrst version into today’s the second version released in2012.One way that cloud computing is able to scale up and down easily is through a programming model called MapReduce. In Hadoop,the MapReduce model has a master node that receives user jobs,then breaks them into a number of tasks and delegates(maps)the tasks to other slave computing nodes. The master also assigns other nodes to merge the pieces of completed computations of the task from the mappers back into integrated pieces(and eventually into the completed job). These slave nodes are called reducers.MapReduce is highly 1)ﬂexible in that it can work on large tasks that require large amounts of mappers and reducers just as easily as it can work on small tasks that only require very limited computing resources,2)scalable in that it can process a small number of user jobs as well as a huge number of user jobs by dispatching the actual computation onto thousands,even millions,of out-of-shelf computing nodes,3)fault-tolerant for tasks(but NOT jobs!!)in that if a mapper or reducer fails,then the master can still complete the job by re-delegating the task of the failed node to another node.Though,there exists one major potential point of failure in the current Hadoop MapReduce architecture that can result in signiﬁcant performance degradation—the master.If the master node fails,its managed user jobs are not be able to be completed because there is currently no redundancy for the master node.To address the single point of failure(SPOF)issue on the master node of the MapReduce engine,this paper proposes a fault-tolerant architecture for the master node in MapReduce by employing the Distributed Hash Table algorithm[7]that is the core in the today’s widely used peer-to-peer networks e.g.BitTorrent[8],[9].The distributed peer-to-peer networked master nodes now allow for multiple masters to manage one MapReduce user job,or at least be aware that it exists. Therefore,in the event that one master node goes down, there will be at least one more master node still available and running for the same user job that can take over the job management duty for the failed master node.How a master node determines or locates its peer master nodes for the managed user jobs is enabled by the DHT algorithm.In the rest of the paper,Section II briefs the current Hadoop MapReduce engine architecture,discusses SPOF issue of this architecture and presents the motivations for this work.Then, the design and implementation of the proposed distributed MapReduce solution is detailed in Section III.Next,Section IV presents our evaluation of the performance of the proposed distributed architecture.The related work is reviewed in Sec-tion V.Finally,Section VI concludes this paper and discusses our future work.II.B ACKGROUND AND M OTIVATIONS When this work was performed,Hadoop MapReduce was still in itsﬁrst version(MapReduce1).Before this work was completed,Hadoop MapReduce had involved into its sec-ond version(MapReduce2).However,the research problem addressed in this work based on MapReduce1still exists in MapReduce2.Therefore,in this section,we review both versions since the implementation and evaluation were based on MapReduce1.A.Hadoop MapReduce Engine ArchitectureMapReduce1:Fig.1shows the framework architecture of MapReduce1.There are two key components in the architec-ture:master node(a.k.a jobtracker node)and slave nodes(a.k.a tasktracker nodes).A user job is decomposed(mapped)into a number of tasks by the MapReduce engine at the master nodeand those tasks are dispatched to the remote slave nodes that are close to the interesting data.The JobTracker managing the user job at the master node tracks the status and progress of all tasks of the job with the heartbeats from/to the TaskTrackers at the slave nodes that manages the tasks.After all tasks are ﬁnished,the JobTracker condenses(reduces)the returned task results into aﬁnal result and returns to the user.Fig.1.MapReduce1ArchitectureMapReduce2:MapReduce1was upgaded to MapReduce2 (a.k.a YARN–Yet Another Resource Negotiator[10])to mainly address a scalability issue.In MapReduce1,the JobTracker at the master node actually takes both responsibilities:a)job scheduling—-identifying proper remote slave nodes for the mapped tasks,and b)tracking the progress and status of tasks of a managed user job.MapReduce1has a scalability bottleneck of about4,000nodes and is difﬁcult to support very large clusters that are required in many business cases today. MapReduce2splits these two responsibilities to support very large scales.Fig.2shows the framework architecture of MapReduce2. The ResourceManager at the resource manager node takes care of the job scheduling by identifying the nodes to run the mapped tasks of a user job.The ApplicationMaster at themaster node tracks the progress and status ofthe tasks runningon slave nodes.The ApplicationMaster processes the returned task results and updates the user.Fig.2.MapReduce2Architecture B.Research Problems and MotivationsLet us inspect the consequence of failures in the Hadoop MapReduce architecture.As we have mentioned earlier,if a task fails at a salve node,both MapReduce1and MapReduce2 can take care of this failure by initiating another task,possibly at another node.However,if the master physical node crashes (less likely),or the JobTracker/ApplicationManager fails(more likely),the status of a running job at the master node is gone, even its tasks are still running on slave nodes.As a result, the user loses the track of his/her job.In MapReduce1,the job has to be restarted from scratch.In MapReduce2,the ResourceManager starts another instance of the user job,also from scratch.As we can see,if a user job requires extensive computation e.g.climate simulation,retuning the job takes a signiﬁcant amount of resources and time.In MapReduce2, the problem becomes even worse if the ResouraceManage fails because no more user requests can can be accepted and the management of the cluster resources is out of control. Therefore,the Hadoop MapReduce Engine has a SPOF problem.It has been revealed that“a single failure can lead to large,variable and unpredictable job running times”[11], [12].This project is thereof motivated to address this problem and to provide fault tolerance for job management.The goal is that a job is NOT required to restart when the job management fails at one node.III.D ISTRIBUTED M AP R EDUCE E NGINE To address the SPOF problem on the user job management in the MapReduce architecture,we propose a distributed MapReduce engine that offers the fault tolerance on managing user jobs.A.Conceptual ArchitectureThe distributed MapReduce engine consists of a group of physically distributed nodes that collectively serve together as a master network as shown in Fig.3.A user job will be managed by more than one nodes in this master network.The master nodes of a user job synchronizes their images of the job.But at one time,only one master node serves as the active master node for the ly,the tasks of the job only communicates to one mater node—-the active master node. When the active master node,or its JobManager,fails,the standby master node(s)and the slave nodes can detect the failure with the heartbeats.Then,the new active master node will be elected from the standby master nodes upon a policy discussed later.The communication about the status of the tasks at the slave nodes will be directed to the new active master node so that the job continue its running.To the system and the user,it seems nothing happened in the job processing. The next discusses how the master network is formed and how the switch takes place when an active master node or its JobManager fails.B.Formation of Master Network with DHTThe key component in the proposed architecture is a dis-tributed master network.This network could be implemented with a separated physical network of nodes.For resource efﬁciency,we rather propose it be implemented as a virtual network.When a regular cloud node runs the job managementFig.3.Distributed MapReduce Conceptual Architecturefor a user job,this node is part of the master network and be referred as a mater node.It can also actually run regular computation task as a slave node for other ly,a physical node can serve as both a master node for certain jobs and a slave node for other jobs.As a result,the whole cloud could be a virtual master network at the extremity.It should be noted that the existence of the master node network for a job relies on the job existence—it is purged as a job is completed.In this proposed architecture,the master network is formed with DHT algorithm.First,a small number of cloud nodes are speciﬁed as server nodes in the DHT framework.When a user job arrives at a job management,the job management node acts as the active master node for the job and identiﬁes its peer master nodes through the server nodes with the DHT algorithm as in Chord[7].The number of master nodes can be speciﬁed by the system(we recommend2to4nodes as the master nodes for a job since the physical node has a very low down rate).Because a network problem cloud result in the failure of all its nodes,a policy in identifying the peer master nodes is that a peer node should at best effort be in a deferent LAN for better fault-tolerance.Then,each master node of a user job will have a list of its peers in the order of their node IDs on the DHT ring.Then,the job management initiates the mapped tasks onto remote slave nodes.The task updates are performed as in cur-rent practice between the tasks and the their job management. Meanwhile,the active master node synchronizes the job status information among its standby peer master nodes,which is discussed next.C.Job Image SynchronizationTo guarantee the job continuity in the event of failure of the active master node,the master nodes of a job must have the same job images.There are two ways to achieve this.One is to have the tasks of a job to multicast their status and progress to all master nodes,but incurs expensive communication cost in the network.The other approach,which is adopted by us,is to have the tasks communicate their active master node only, but this requires the synchronization of job images among the master nodes.We propose an Incremental synchronization scheme with light cache.The active master node only sends the incremental update that is the difference from the last update to its peer master nodes.To guarantee the reception of the updates,the active master has to be acknowledged by the peer masters. Therefore,the synchronization is carried on TCP in unicast because TCP ensures the delivery and unicast allows ac-knowledgement(on the contrary,multicast does not support acknowledgement).Since the active master may fail at any moment,it is possible that some task progress received after the last synchronization but before the failure is lost in the failure.To address this issue,the task nodes have to keep their task progress for the time of a synchronization cycle.So,when the active node fails,the new active node solicits the missed task updates from the task nodes to keep up-to-date.D.Active Master SwitchingWhen the master node network of a job is formed as discussed in Section III-B,the master nodes are compiled into a list in ascending order according to their node IDs.This list is disseminated to all master nodes and tasks nodes so that every node knows which node is the next active master is the current active master ly,the next active master node is pre-elected based on the node IDs.If the standby master nodes do not receive the periodic updates as scheduled from their active peer master,they will send a status inquiry heartbeat to the active master.If no response of the inquiry is received,they consider the active master fails and the next master node in the master list will take over the active master role.All other master nodes will synchronize to this new active master.E.ImplementationImplementation on MapReduce1:We have implemented the proposed distributed MapReduce engine on Hadoop code base.The implementation was based on MapReduce1be-cause MapReduce2was not released when our implementa-tion started.The DHT algorithm was implemented into the JobTracker as in Fig. 1.The implementation of DHT is based on the source code of Open Chord[13].The mas-ter synchronization capability was also implemented in the JobTracker.Because of the MapReduce1architecture,in our implementation,the master nodes are a group of speciﬁc nodes that doe not run regular tasks as MapReduce2does.To accommodate the distributed JobTracker nodes,TaskTracker is changed to enable the short-term cache of updates and keep a list of the master nodes.Possible Implementation on MapReduce2:It should be noted that the implementation of the proposed distributed MapReduce solution onto MapReduce2is still feasible,but in a”ﬂatter”mode.As in Fig.2,MapReduce2retains the ResourceManager as a separated node,but the AppManager has been dispatched into a regular cloud node as its tasks do.Therefore,the implementation of the DHT should be on the ResourceManager that is responsible for locating the nodes for AppManagers because the ResourceManager should generate and dispatch multiple AppManagers,rather than only one as it does without fault-tolerance currently.However,the synchronization should be implemented on the AppManagers because they need to communicate to each other.The taskJVM should implement the short-term cache and the list of masternodes as the TaskTracker does in MapReduce1.Moreover,the ResourceManager itself should have a distributed architecture too because it has a SPOF as well.The distributed Resource-Manager architecture could be accompllished with DHT as in the distributed JobTracker implementation in the MapReduce1.IV.E VALUATIONWe have evaluated the distributed MapReduce solution implemented on MapReduce1.The evaluation focus is on the latency,success ratio of the master switch in failure cases, and the incurred network trafﬁc overhead.Weﬁrst present the evaluation platform and methodologies,then discuss the evaluated metrics and the experiment results.A.Platform and MethodologiesThe evaluation was carried out on a Hadoop cloud con-sisting of virtual machines.We have two Ubuntu-Linux host machines conﬁgured into two LANs and each of them has ﬁve virtual machines installed and conﬁgured with the Hadoop cloud computing platform.The Hadoop platform includes both the MapReduce1package the Hadoop Distributed File System (HDFS).The evaluation conﬁguration is illustrated in Fig.4.We set the master synchronization cycle as the ten times of the task update cycle.The short-term cache duration on the task nodes was then linked to the synchronization cycle in ly,the task nodes cached the last ten updates locally.The two virtual networks of the virtual machines were conﬁgured with network masks of192.168.0.255and 192.168.1.255respectively.The IP address of each node was used as the node ID for the DHT.Because of the limitation of the physical computing resources,only80jobs were submitted to run in the cloud.We conﬁgured each master network to contain only two master nodes(i.e.one active and the other standby).With the preference that master nodes should be separated in different networks if possible,the active master nodes were basically in the network of192.168.0.0and the standby master nodes were in192.168.1.0.We emulated the active master failure by purging the instance of a JobTracker of an active master node.Fig.4.Experiment Platform of HadoopB.Metrics and ResultsOur evaluation focused on three metrics:switch latency, switch success ratio and network overhead.The evaluation results of the three metrics are shown in Table I.Switch latency:This metric refers to the time from a failure of an active master to a new active master takes over the job management with its tasks.To avoid the error incurred by the system time difference between two host machines,we rather limited this experiment into only one network,namely both the active and the standby master nodes are in the same network on the same host machine.We measured1000master switches and averaged the latency.The latency was normalized by the synchronization cycle.From the measurements in Table I,we observe that the switch latency is a little over half a synchronization cycle.This is reasonable because the worst case occurs when a failure happens immediately after a synchronization and the standby node has to wait till the end of almost the whole cycle to detect the failure.Therefore the maximum latency should be of a synchronization cycle plus the inquiry message timeout.The minimum latency should be close to the timeout of inquiry message when a failure occurs nearly at the end of a synchronization cycle.Therefore, with the a uniform distribution of failure,the expected latency should be the half of the synchronization cycle plus the inquiry message timeout.Switch success ratio:this metric is deﬁned as the number of successful master switches divided by the number of the switch attempts(or the number of active master failures).This metric is important in that it shows how much fault-tolerance is provided by the distributed MapReduce architecture.Ideally, the ratio should be100%,but it is hurt by network transmission such as packet loss.With the1000master switches tested,the measurement shows a success ratio of97%,which indicates that the distributed solution is effective in providing fault-tolerance to user jobs.Network overhead:This metric measures the overhead of network trafﬁc incurred by formation of the master network, synchronization and active master switch.It should grow along with the size of the master network because more copies of a message has to be sent in synchronization and master switch. Therefore,the measurement result is normalized by the number of master switches.As observed from Table I,the trafﬁc is about5messages per master switch in our case of two master nodes only and three tasks associated with a job.With each message takes nearly0.5μs in Gbps networks,the network cost is about2.5μs resulted from the switch in the distributed MapReduce solution.TABLE I.E VALUATION R ESULTSMetric ResultsLatency0.51Success Ratio0.97Network Overhead0.5μsV.R ELATED W ORKHadoop is essentially an open source massively scalable queryable store and archive platform enabling the cloud com-puting that includes aﬁle system,queryable databases,archival store,andﬂexible schema[14].Hadoop can,and normally does,use the Hadoop Distributed File System(HDFS)that uses the write-once,read-many philosophy meaning that instead of the hard disk constantly having to seek to modify data throughout the set,any new data is appended to the end of the current dataset.MapReduce is the programming model thatwas developed by Google[15]and that Hadoop uses to process data[6].The datasets processed in Hadoop can be,and often are,much larger than any one computer can ever process[16]. MapReduce organizes how that data is processed over many computers(anywhere from a few to a few thousand)[16].Distributed Hash Table(DHT)wasﬁrst proposed in the work of Chord[7]by MIT to support overlay peer-to-peer network for contention distribution with some other DHT implementations available such as CAN[17],Pastry[18], and Tapestry[19].Chord is a distributed lookup protocol that specializes in mapping keys to nodes[8].Data location can be easily implemented on top of Chord by associating a key with each data item,and storing the key/data pair at the node to which the key maps.Chord adapts efﬁciently as nodes join and leave the system,and can answer queries even if the system is continuously changing.This is especially a goodﬁt to MapReduce because MapReduce already maps key-value pairs to process data,and mappers and reducers are constantly being brought up and shutdown[16].Chord features load balancing because it acts as a distributed hash function and spreads keys evenly over the participating nodes.Chord is also decentralized;no node is of greater importance than any other node.Chord scales automatically without need to do any tuning to achieve success at scale.Some other fault-tolerance efforts have been proposed for cloud computing.Cassandra[20]is a peer-to-peer based solu-tion proposed by Facebook engineers to address fault-tolerance in distributed databased management.It eliminates the SPOF problem in a distributed data storage.Other effort in addressing data fault-tolerance includes work like[21].YARN[10]is proposed in the MapReduce2of Hadoop to address the fault-tolerance in HDFS.It solves the SPOF problem in an HDFS cluster.So far,there has been no solution proposed to address the SPOF problem on the job management as our this work does.VI.A CKNOWLEDGEMENTThis paper presents a fault-tolerant MapReduce engine for cloud computing.The fault-tolerance is enabled by a distributed solution based on DHT algorithm.In the solution, a network of master nodes are formed to provide job man-agement.A failed active master node will be replaced by its next standby peer node for job management and thereby the job running is maintained.The solution has been implemented into the Hadoop MapReduce1engine and evaluated of high fault-tolerance with low latency and networking cost.Our next step is to implement this solution into the current Hadoop MapReduce2for more extensive performance evaluation and release it as open source.VII.A CKNOWLEDGEMENTThe authors would like to thank Gordon Pettey who helped with the extensive source code implementation of the solution.A special thanks is given to National Science Foundation for the award#1041292that supports Gordon to work on this project as a REU student.We would like also thank our reviewers for their precious comments to make the work better.R EFERENCES[1]M.Armbrust,A.Fox,R.Grifﬁth,A.D.Joseph,R.Katz,A.Konwinski,G.Lee,D.Patterson,A.Rabkin,I.Stoica,and M.Zaharia,“A viewof cloud computing,”Commun.ACM,vol.53,no.4,pp.50–58,Apr.2010.[Online].Available:/10.1145/1721654.1721672 [2]P.Mell and T.Grance,“The nist deﬁnition of cloud computing(draft),”NIST special publication,vol.800,no.145,p.7,2011.[3]T.Velte,A.Velte,and R.Elsenpeter,Cloud computing,a practicalapproach.McGraw-Hill,Inc.,2009.[4]Q.Zhang,L.Cheng,and R.Boutaba,“Cloud computing:state-of-the-artand research challenges,”Journal of Internet Services and Applications, vol.1,no.1,pp.7–18,2010.[5] D.Borthakur,“The hadoop distributedﬁle system:Architecture anddesign,”2007.[6]T.White,Hadoop:the deﬁnitive guide.O’Reilly,2012.[7]I.Stoica,R.Morris,D.Liben-Nowell,D.R.Karger,M.F.Kaashoek,F.Dabek,and H.Balakrishnan,“Chord:a scalable peer-to-peer lookupprotocol for internet applications,”IEEE/ACM w.,vol.11, no.1,pp.17–32,2003.[8] B.Cohen,“The bittorrent protocol speciﬁcation,”2008.[9] D.Qiu and R.Srikant,“Modeling and performance analysis ofbittorrent-like peer-to-peer networks,”ACM SIGCOMM Computer Communication Review,vol.34,no.4,pp.367–378,2004.[10] A. C.Murthy, C.Douglas,M.Konar,O.O’MALLEY,S.Radia,S.Agarwal,and V.KV,“Architecture of next generation apache hadoop mapreduce framework,”Tech.rep.,Apache Hadoop,Tech.Rep.,2011.[11] F.Dinu and T.Ng,“Understanding the effects and implications ofcompute node related failures in hadoop,”in Proceedings of the21st international symposium on High-Performance Parallel and Distributed Computing.ACM,2012,pp.187–198.[12] F.D.T.E.Ng,“Analysis of hadoop’s performance under failures,”RiceUniversity,Tech.Rep.,2012.[13]L.Karsten and K.Sven,“.”[14] D.Borthakur,J.Gray,J.S.Sarma,K.Muthukkaruppan,N.Spiegelberg,H.Kuang,K.Ranganathan,D.Molkov,A.Menon,S.Rash et al.,“Apache hadoop goes realtime at facebook,”in Proceedings of the 2011ACM SIGMOD International Conference on Management of data.ACM,2011,pp.1071–1080.[15]J.Dean and S.Ghemawat,“Mapreduce:simpliﬁed data processing onlarge clusters,”Communications of the ACM,vol.51,no.1,pp.107–113,2008.[16]J.Lin and C.Dyer,“Data-intensive text processing with mapreduce,”Synthesis Lectures on Human Language Technologies,vol.3,no.1,pp.1–177,2010.[17]S.Ratnasamy,P.Francis,M.Handley,R.Karp,and S.Shenker,Ascalable content-addressable network.ACM,2001,vol.31,no.4. [18] A.Rowstron and P.Druschel,“Pastry:Scalable,decentralized object lo-cation,and routing for large-scale peer-to-peer systems,”in Middleware 2001.Springer,2001,pp.329–350.[19] B.Y.Zhao,L.Huang,J.Stribling,S.C.Rhea,A.D.Joseph,andJ.D.Kubiatowicz,“Tapestry:A resilient global-scale overlay for service deployment,”Selected Areas in Communications,IEEE Journal on, vol.22,no.1,pp.41–53,2004.[20] kshman and P.Malik,“Cassandra:a decentralized structuredstorage system,”ACM SIGOPS Operating Systems Review,vol.44, no.2,pp.35–40,2010.[21]S.Y.Ko,I.Hoque,B.Cho,and I.Gupta,“Making cloud intermediatedata fault-tolerant,”in Proceedings of the1st ACM symposium on Cloud computing.ACM,2010,pp.181–192.。