简述 mapreduce的数据处理过程
简述 mapreduce的数据处理过程简述 MapReduce 的数据处理过程一、引言随着大数据的兴起,对于海量数据的高效处理变得越来越重要。
MapReduce 是一种分布式计算模型,能够并行处理大规模数据集,使得数据处理更加高效和可扩展。
本文将简要介绍 MapReduce 的数据处理过程,并分享一些个人的观点和理解。
二、MapReduce 的基本概念在开始探讨数据处理过程之前,我们先来了解一下 MapReduce 的基本概念。
1. Map 函数:Map 函数是 MapReduce 过程中的第一个阶段。
Map 函数可以根据需求进行自定义的操作和处理,例如提取特定信息、进行数据清洗、计算等。
2. Reduce 函数:Reduce 函数是MapReduce 过程中的第二个阶段。
它接收 Map 函数输出的键值对,并将具有相同键的值进行聚合。
Reduce 函数可以进行汇总、统计、排序等处理,生成最终的结果。
3. 分布式计算框架:MapReduce 依赖于一种分布式计算框架,如Hadoop,并通过将数据和计算任务分发给集群中的多个节点来提高处理效率。
三、MapReduce 的数据处理过程MapReduce 的数据处理过程可以分为以下几个阶段:1. 输入数据分片:输入数据将被拆分成多个数据块,并在集群中的各个节点上进行存储。
2. Map 阶段:1) 输入数据映射:每个节点将分配到的数据块加载到内存中,并应用 Map 函数,将数据转化为键值对。
2) Map 函数处理:各个节点并行处理自己所负责的数据块,执行Map 函数定义的操作。
这个阶段的输出将作为 Reduce 阶段的输入。
3. Shuffle 阶段:在 Shuffle 阶段,节点间将重新分配和交换数据。
具体步骤如下:1) 同一键值对的数据被重新分配:根据键值对的键,将具有相同键的数据重新分配给同一个节点。
1. Shuffle机制的基本概念在MapReduce框架中,Shuffle机制是指在Mapper阶段产生的中间结果需要传输给Reducer节点进行后续处理的过程。
2. Shuffle机制的作用和重要性Shuffle机制在MapReduce框架中起着至关重要的作用。
3. Shuffle机制的具体实现方式在实际的MapReduce框架中,Shuffle机制的实现涉及到数据的分区、排序和分组等具体细节。
4. Shuffle机制的优化方法为了提高MapReduce任务的执行效率和性能,研究人员和工程师们提出了许多针对Shuffle机制的优化方法。
MapReduce编程模型的主要原理可以归纳为以下几个方面:1. 数据划分MapReduce会将大规模数据集划分为小的数据块,每个数据块通常在64MB到1GB之间。
2. Map操作Map操作是MapReduce中的第一步。
Map操作通常包括以下步骤:(1)输入:从输入数据中读取数据块(2)映射:将输入数据转换为中间键-值对(3)缓存:将处理后的中间键-值对缓存在内存中3. Shuffle操作Shuffle操作是MapReduce中的第二步,Shuffle操作会将Map操作生成的中间键-值对重新组合,并按照key值将它们分组。
Shuffle操作通常包括以下步骤:(1)数据的拷贝:将Map输出的中间键-值对按照key值拷贝到Reduce操作的计算节点上(2)数据的排序:按照key值对中间键-值对进行排序,便于Reduce操作的处理(3)数据的分区:将排序后的中间键-值对分成多个分区,每个分区包含相同key值的中间键-值对4. Reduce操作Reduce操作是MapReduce中的第三步。
Reduce操作通常包括以下步骤:(1)输入:从Map操作的输出获取中间键-值对分组信息(2)缓存:将Map操作输出的中间键-值对缓存到内存中(3)分组:将缓存中的中间键-值对按照key值分组(4)Reduce:对每个分组中的中间键-值对进行Reduce操作,并将结果输出5. 在Master节点上进行控制和协调MapReduce编程模型中,由Master节点来进行任务的分配、管理和协调。
hadoop mapreduce工作原理
hadoop mapreduce工作原理
Hadoop MapReduce是一种分布式计算模型,用于处理大数据集。
这些数据块被输入给一个映射函数,该函数将输入数据转换成<Key, Value>对。
映射函数将生成许多中间<Key, Value>对,其中Key是一个唯一的标识符,Value是与该Key
Shuffle阶段:在Map阶段之后,中间的<Key, Value>对被分
2.向JobTracker请求⼀个Job ID。
⽂件夹名为该作业的Job ID。
对于map和reduce任务,TaskTracker根据主机核的数量和内存的⼤⼩有固定数量的map 槽和reduce槽。
2. Map阶段:每个Mapper读取一个数据块,并对数据块执行相同的映射操作。
3. 中间结果分区:中间结果根据键进行分区,每个分区包含一组具有相同键的键值对。
4. Shuffle阶段:将中间结果按照键的顺序进行排序,并将具有相同键的键值对分组在一起。
5. Reduce阶段:每个Reducer读取一个分区的中间结果,并对中间结果执行相同的聚合操作。
6. 结果输出:Reduce阶段的输出结果可以存储在文件系统中,或者传递给其他系统进行进一步处理。
阐述分布式计算框架mapreduce的主要步骤嘿,咱今儿就来唠唠这分布式计算框架 mapreduce 的主要步骤哈!你想啊,这 mapreduce 就像是一个超级大的团队在干活儿。
你说这 mapreduce 厉害不厉害?就像是一场精彩的魔术表演!通过这两个步骤的完美配合,原本庞大复杂的数据就被驯服得服服帖帖啦!它能处理海量的数据,就像一个大力士能轻松举起千斤重担一样。
你想想看,如果没有前面的精心准备,后面怎么能得出漂亮的结果呢?而且啊,这mapreduce 还特别灵活。
它能适应各种复杂多变的环境,这可真是太牛了!再看看我们周围的世界,很多事情不也是这样吗?一个大工程的完成,不就是由无数个小步骤组成的吗?一个团队的成功,不也是大家齐心协力,各自做好自己的那部分工作,最后汇聚成一个伟大的成果吗?总之啊,这分布式计算框架 mapreduce 的主要步骤,真的是非常重要,非常神奇!它让我们能轻松应对那些看似不可能完成的任务,让数据处理变得不再那么困难。
二、MapReduce的基本概念1. Map阶段:在MapReduce中,Map阶段是数据处理的第一步,它将输入的数据集合拆分成若干个独立的任务,并将这些任务分配给不同的计算节点进行并行处理。
2. Shuffle阶段:Shuffle阶段是MapReduce中非常重要的一部分,它负责将Map阶段输出的结果进行分区和排序,然后将相同key的数据进行分组,以便于后续Reduce阶段的处理。
3. Reduce阶段:Reduce阶段是MapReduce的最后一步,它接收Shuffle阶段输出的数据,并将具有相同key的数据进行合并和聚合,最终输出最终的处理结果。
三、MapReduce的基础运用1. 数据处理:MapReduce可以高效地处理海量数据,如日志文件、文本数据等,通过Map和Reduce两个阶段的处理,可以实现对数据的分析和计算,例如词频统计、数据过滤等。
2. 分布式计算:MapReduce能够将数据集分解成多个小的任务,分配给多个计算节点进行并行处理,因此可以充分利用集群的计算资源,提高数据处理的速度和效率。
3. 容错性:MapReduce具有很强的容错性,当集群中的某个计算节点发生故障时,系统能够自动将任务重新分配给其他正常的节点进行处理,保证任务的顺利完成。
四、MapReduce的原理分析1. 并行计算模型:MapReduce采用了流水线式的并行计算模型,将数据处理划分成不同的阶段,每个阶段都可以并行执行,从而充分利用集群的计算资源。
使用Hadoop MapReduce高效处理大规模数据的方法
使用Hadoop MapReduce高效处理大规模数据的方法随着互联网的快速发展,大规模数据的处理成为了一个重要的挑战。
传统的数据处理方法已经无法满足这一需求,因此,使用Hadoop MapReduce成为了一种高效处理大规模数据的方法。
一、Hadoop MapReduce的基本原理Hadoop MapReduce是一种分布式计算模型,它将大规模数据划分为多个小块,并将这些小块分发到不同的计算节点上进行并行处理。
其基本原理可以概括为以下几个步骤:1. 输入数据划分:将大规模数据划分为多个小块,每个小块的大小通常为64MB或128MB。
2. Map阶段:在每个计算节点上,对输入数据进行处理,生成中间结果。
3. Shuffle阶段:将Map阶段生成的中间结果按照键值对进行排序,并将相同键的值归并在一起,以便进行后续的处理。
4. Reduce阶段:对Shuffle阶段生成的中间结果进行处理,得到最终的结果。
二、优化Hadoop MapReduce的性能虽然Hadoop MapReduce能够高效处理大规模数据,但是在实际应用中,还存在一些性能瓶颈。
下面介绍一些优化Hadoop MapReduce性能的方法。
1. 数据本地化:在MapReduce任务中,数据的传输是一个耗时的操作。
2. 压缩数据:大规模数据的处理通常需要大量的磁盘空间。
3. 合并小文件:在Hadoop中,每个小文件都会占用一个数据块的存储空间,这样会导致存储空间的浪费。
4. 调整任务数量:在Hadoop MapReduce中,任务的数量对性能有很大的影响。
hadoop mapreduce案例
Hadoop MapReduce案例简介Hadoop MapReduce是一个分布式计算框架,用于处理大规模数据集的并行计算问题。
本文将深入探讨Hadoop MapReduce的概念、架构以及使用案例。
Hadoop MapReduce概述Hadoop MapReduce是由Apache Hadoop项目提供的一种编程模型,旨在处理大规模数据集的计算问题。
Hadoop MapReduce的核心原理是将数据划分成若干个小块,然后为每个块创建一个Map任务。
Hadoop MapReduce案例金融数据分析数据准备在这个案例中,我们将使用Hadoop MapReduce来分析一份金融数据集。
首先,我们需要准备数据,可以从公开的金融数据源中获取,例如Yahoo Finance。
public class FinanceMap extends Mapper<LongWritable, Text, Text, DoubleWritabl e> {private Text date = new Text();private DoubleWritable amount = new DoubleWritable();@Overrideprotected void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String[] parts = value.toString().split(",");if (parts.length == 3) {date.set(parts[0]);amount.set(Double.parseDouble(parts[2]));context.write(date, amount);}}}Reduce任务然后,我们需要创建Reduce任务来对Map任务的输出结果进行聚合操作。
下面是MapReduce的用法:1. **编写Map函数:**Map函数将输入数据按照指定的规则转换成(Key, Value)对,其中Key表示数据的分类,Value表示数据的内容。
例如,我们可以编写一个Map函数将一个文件中的每一行转换成(Key, Value)对,其中Key是行号,Value是该行的内容。
2. **编写Reduce函数:**Reduce函数按照指定的规则对Map阶段输出的结果进行合并和处理。
Reduce函数的输入参数是(Key, [Value1, Value2, ...]),其中Key表示数据的分类,Value1, Value2, ...表示属于该分类的所有数据。
3. **配置MapReduce任务:**在Hadoop中,你需要设置MapReduce任务的输入和输出路径,并指定Map和Reduce函数的实现类。
例如,可以使用以下代码创建一个MapReduce任务:```job = Job.getInstance(configuration, "MyJob")job.setJarByClass(MyJob.class)job.setMapperClass(MyMapper.class)job.setReducerClass(MyReducer.class)job.setOutputKeyClass(Text.class)job.setOutputValueClass(IntWritable.class)FileInputFormat.addInputPath(job, new Path("input"))FileOutputFormat.setOutputPath(job, new Path("output"))```4. **运行MapReduce任务:**在设置好MapReduce任务后,你可以将其提交到Hadoop集群上运行。
简述 mapreduce 中的 shuffle 过程
简述 mapreduce 中的 shuffle 过程【实用版3篇】目录(篇1)1.概述 MapReduce 中的 Shuffle 过程2.Shuffle 过程的作用3.Shuffle 过程的具体实现4.Shuffle 过程的优化5.总结正文(篇1)一、概述 MapReduce 中的 Shuffle 过程MapReduce 是一种分布式计算模型,用于处理大规模数据集。
在MapReduce 中,Shuffle 过程是连接 Map 任务和 Reduce 任务的重要环节。
Shuffle 过程负责将 Map 任务的输出数据按照特定的规则进行分区和排序,然后将分区后的数据传输到相应的 Reduce 任务节点进行处理。
二、Shuffle 过程的作用Shuffle 过程在 MapReduce 计算模型中具有重要作用,主要体现在以下几点:1.数据分区:Shuffle 过程将 Map 任务的输出数据按照不同的分区(partition)进行划分,使得具有相同键值的数据被划分到同一个分区中。
2.数据排序:在每个分区内,Shuffle 过程会对数据进行排序,确保具有相同键值的数据在排序后相邻。
3.数据传输:Shuffle 过程将排序后的数据从 Map 任务节点传输到相应的 Reduce 任务节点,为后续的 Reduce 任务提供输入数据。
三、Shuffle 过程的具体实现Shuffle 过程主要分为两个阶段:Map 端和 Reduce 端。
1.Map 端:在 Map 任务中,Shuffle 过程主要涉及输入数据的划分和排序。
Map 任务将输入数据分成多个分区,并在每个分区内对数据进行排序。
2.Reduce 端:在 Reduce 任务中,Shuffle 过程主要涉及从 Map 任务节点获取分区数据并将其合并。
Reduce 任务会从所有 Map 任务节点获取对应的分区数据,然后将这些数据进行合并和排序,最终输出处理结果。
mapreduce 原理
mapreduce 原理MapReduce是一个分布式计算模型,旨在解决当今海量数据处理需求的问题。
mapreduce 工作原理
mapreduce 工作原理MapReduce 是一种用于大数据处理的编程模型,它可以将大任务分解成小任务,然后在不同的计算节点上并行执行。
MapReduce 设计出来的初衷就是要处理大数据,它的分布式处理能力和可伸缩性使得它成为了云计算时代的一个重要组成部分。
MapReduce 的基本工作原理是将大规模数据集分成很多小的数据块,然后依次对每一个数据块进行处理,最后将每个子任务的处理结果合并起来形成最终的结果。
MapReduce 模型分为两个阶段:Map 阶段和 Reduce 阶段。
Map 阶段中,MapReduce 模型将输入数据集分成若干个大小相等的数据块,然后为每个数据块启动一个 Map 任务,将该数据块输入到Map 函数中进行处理。
Map 函数会将每个输入的键值对经过一系列转换后输出成一系列新的键值对。
这些新的键值对会被分类和排序,最终输出到 Reduce 阶段进行处理。
Reduce 阶段中,MapReduce 模型将所有 Map 任务输出的键值对进行聚合和排序,并且将具有相同 key 值的键值对输出到同一个Reduce 任务中进行处理。
Reduce 函数将所有输入的键值对按照 key 值进行聚合,并且将聚合后的结果输出到一个新的文件中。
MapReduce 模型的优势在于其高可伸缩性和高容错性。
由于每个Map 任务之间互不干扰,也因为 Map 函数之间没有任何交互,所以MapReduce 模型可以非常容易地进行并行化处理。
同时为了解决可能存在的某些节点发生故障,MapReduce 模型通过持久化所有任务处理过程的数据以进行容错处理。
总之,MapReduce 模型是处理大数据的一个非常好的方法。
该模型将输入数据集拆分成几个小数据块,然后为每个数据块启动一个Map 任务,最终将 Map 任务的输出结果输入到 Reduce 函数中聚合处理,最后得到最终的结果。
Map任务将输入数据映射为(key, value)对,并将这些对作为中间结果输出。
Reduce任务对每个(key, value)对执行特定的操作,并生成最终的输出结果。
目前比较常用的分布式计算框架有Apache Hadoop、Apache Spark等。
Map函数将输入数据映射为(key, value)对,并将这些对作为中间结果输出。
Map 函数的具体实现取决于具体的应用场景和需求。
Reduce函数对每个(key, value)对执行特定的操作,并生成最终的输出结果。
Reduce函数也是独立的,可以并行执行,每个Reduce任务处理一组(key, value)对。
mapreduce 的工作机制
MapReduce 是一种用于处理大规模数据的并行计算框架。
下面将详细介绍 MapReduce 的工作机制。
一、分布式计算1. MapReduce 使用分布式计算来处理大规模数据。
2. 分布式计算可以充分利用集裙中的计算资源,加快数据处理速度,并提高系统的容错能力。
3. 分布式计算还可以有效地处理数据的并行化计算,提高计算效率。
二、数据划分1. 在 MapReduce 中,数据会被划分成多个输入对。
2. 每个输入对包括一个键和一个值。
3. 数据划分可以根据键来实现,这样相同键的数据会被划分到同一个计算节点上进行处理。
三、映射函数1. 映射函数是 MapReduce 中的一个重要环节。
2. 映射函数会对每个数据块进行处理,并输出多个中间键值对。
3. 映射函数的输出将作为归约函数的输入,用于后续的数据处理。
四、归约函数1. 归约函数是 MapReduce 中的另一个重要环节。
2. 归约函数会根据键将中间键值对进行聚合,然后对每个键值对执行归约操作。
3. 归约函数的输出就是最终的处理结果,可以将结果保存到文件系统中。
五、MapReduce 的工作流程1. 当一个 MapReduce 任务被提交时,首先会将输入数据划分成多个数据块,然后将这些数据块分配到不同的计算节点上。
2. 每个节点上都会运行映射函数来处理数据块,并生成中间键值对。
3. 各个节点上的归约函数会对中间键值对进行聚合,生成最终的处理结果。
通过以上介绍,可以看出 MapReduce 的工作机制主要包括分布式计算、数据划分、映射函数和归约函数等关键步骤。
通过MapReduce 可以对社交网络中的大规模数据进行分析。
MapReduce工作原理1MapReduce 原理(一)1.1MapReduce编程模型MapReduce采用“分而治之”的思想,把对大规模数据集的操作,分发给一个主节点管理下的各个分节点共同完成,然后通过整合各个节点的中间结果,得到最终结果。
简单地说,MapReduce 就是"任务的分解与结果的汇总"。
在分布式计算中,MapReduce框架负责处理了并行编程中分布式存储、工作调度、负载均衡、容错均衡、容错处理以及网络通信等复杂问题,把处理过程高度抽象为两个函数:map和reduce, map负责把任务分解成多个任务,reduce负责把分解后多任务处理的结果汇总起来。
1.2MapReduce处理过程在Hadoop中,每个MapReduce任务都被初始化为一个Job,每个Job又可以分为两种阶段:map 阶段和reduce阶段。
map函数接收一个<key,value>形式的输入,然后同样产生一个<key,value>形式的中间输出,Hadoop函数接收一个如<key,(list of values)羽式的输入,然后对这个value集合进行处理,每个reduce产生。
一切都是从最上方的user program 开始的,user program 链接了MapReduce 库,实现了最基 本的Map 函数和Reduce 函数。
Scalable Distributed Reasoningusing MapReduceJacopo Urbani,Spyros Kotoulas,Eyal Oren,and Frank van Harmelen Department of Computer Science,Vrije Universiteit Amsterdam,the NetherlandsAbstract.We address the problem of scalable distributed reasoning,proposing a technique for materialising the closure of an RDF graphbased on MapReduce.We have implemented our approach on top ofHadoop and deployed it on a compute cluster of up to64commoditymachines.We show that a naive implementation on top of MapReduceis straightforward but performs badly and we present several non-trivialoptimisations.Our algorithm is scalable and allows us to compute theRDFS closure of865M triples from the Web(producing30B triples)inless than two hours,faster than any other published approach.1IntroductionIn this paper,we address the problem of scalable distributed reasoning.Most ex-isting reasoning approaches are centralised,exploiting recent hardware improve-ments and dedicated data structures to reason over large-scale data[8,11,17]. However,centralised approaches typically scale in only one dimension:they be-come faster with more powerful hardware.Therefore,we are interested in parallel,distributed solutions that partition the problem across many compute nodes.Parallel implementations can scale in two dimensions,namely hardware performance of each node and the number of nodes in the system.Some techniques have been proposed for distributed reasoning,but,as far as we are aware,they do not scale to orders of108triples.We present a technique for materialising the closure of an RDF graph in a distributed manner,on a cluster of commodity machines.Our approach is based on MapReduce[3]and it efficiently computes the closure under the RDFS semantics[6].We have also extended it considering the OWL Horst semantics[9] but the implementation is not yet competitive and it is should be considered as future work.This paper can be seen as a response to the challenge posed in [12]to exploit the MapReduce framework for efficient large-scale Semantic Web reasoning.This paper is structured as follows:we start,in Section2,with a discus-sion of the current state-of-the-art,and position ourselves in relation to these approaches.We summarise the basics of MapReduce with some examples in Section3.In Section4we provide an initial implementation of forward-chainingRDFS materialisation with MapReduce.We call this implementation“naive”be-cause it directly translates known distributed reasoning approaches into MapRe-duce.This implementation is easy to understand but performs poorly because of load-balancing problems and because of the need forfixpoint iteration.There-fore,in Section5,an improved implementation is presented using several inter-mediate MapReduce functions.Finally,we evaluate our approach in Section6, showing runtime and scalability over various datasets of increasing size,and speedup over increasing amounts of compute nodes.2Related workHogan et al.[7]compute the closure of an RDF graph using two passes over the data on a single machine.They implement only a fragment of the OWL Horst semantics,to allow efficient materialisation,and to prevent“ontology hijacking”. Our approach borrows from their ideas,but by using well-defined MapReduce functions our approach allows straightforward distribution over many nodes, leading to improved results.Mika and Tummarello[12]use MapReduce to answer SPARQL queries over large RDF graphs,and mention closure computation,but do not provide any details or results.In comparison,we provide algorithm details,make the code available open-source,and report on experiments of up to865M triples.MacCartney et al.[13]show that graph-partitioning techniques improve rea-soning overfirst-order logic knowledge bases,but do not apply this in a dis-tributed or large-scale context.Soma and Prasanna[15]present a technique for parallel OWL inferencing through data partitioning.Experimental results show good speedup but on relatively small datasets(1M triples)and runtime is not reported.In contrast,our approach needs no explicit partitioning phase and we show that it is scalable over increasing dataset size.In previous work[14]we have presented a technique based on data-partitioning in a self-organising P2P network.A load-balanced auto-partitioning approach was used without upfront partitioning cost.Conventional reasoners are locally executed and the data is intelligently exchanged between the nodes.The basic principle is substantially different from the work here presented and experimen-tal results were only reported for relatively small datasets of up to15M triples.Several techniques have been proposed based on deterministic rendezvous-peers on top of distributed hashtables[1,2,4,10].However,these approaches suffer of load-balancing problems due to the data distributions[14].3What is the MapReduce framework?MapReduce is a framework for parallel and distributed processing of batch jobs[3]on a large number of compute nodes.Each job consists of two phases: a map and a reduce.The mapping phase partitions the input data by associ-ating each element with a key.The reduce phase processes each partition in-dependently.All data is processed based on key/value pairs:the map function2Algorithm1Counting term occurrences in RDF NTriplesfilesmap(key,value)://key:line number//value:tripleemit(value.subject,blank);//emit a blank value,sinceemit(value.predicate,blank);//only amount of terms mattersemit(value.object,blank);reduce(key,iterator values)://key:triple term(URI or literal)//values:list of irrelevant values for each termint count=0;for(value in values)count++;//count number of values,equalling occurrencesemit(key,count);Fig.1.MapReduce processingprocesses a key/value pair and produces a set of new key/value pairs;the reduce merges all intermediate values with the same key intofinal results.We illustrate the use of MapReduce through an example application that counts the occurrences of each term in a collection of triples.As shown in Al-gorithm1,the map function partitions these triples based on each term.Thus, it emits intermediate key/value pairs,using the triple terms(s,p,o)as keys and blank,irrelevant,value.The framework will group all intermediate pairs with the same key,and invoke the reduce function with the corresponding list of values, summing these the number of values into an aggregate term count(one value was emitted for each term occurrence).This job could be executed as shown in Figure1.The input data is split in several blocks.Each computation node operates on one or more blocks,and performs the map function on that block.All intermediate values with the same key are sent to one node,where the reduce is applied.This simple example illustrates some important elements of the MapReduce programming model:–since the map operates on single pieces of data without dependencies,parti-tions can be created arbitrarily and can be scheduled in parallel across many nodes.In this example,the input triples can be split across nodes arbitrarily,31:s p o(if o is a literal)⇒:n rdf:type rdfs:Literal2:p rdfs:domain x&s p o⇒s rdf:type x3:p rdfs:range x&s p o⇒o rdf:type x4a:s p o⇒s rdf:type rdfs:Resource4b:s p o⇒o rdf:type rdfs:Resource5:p rdfs:subPropertyOf q&q rdfs:subPropertyOf r⇒p rdfs:subPropertyOf r6:p rdf:type rdf:Property⇒p rdfs:subPropertyOf p7:s p o&p rdfs:subPropertyOf q⇒s q o8:s rdf:type rdfs:Class⇒s rdfs:subClassOf rdfs:Resource 9:s rdf:type x&x rdfs:subClassOf y⇒s rdf:type y10:s rdf:type rdfs:Class⇒s rdfs:subClassOf s11:x rdfs:subClassOf y&y rdfs:subClassof z⇒x rdfs:subClassOf z12:p rdf:type rdfs:ContainerMembershipProperty⇒p rdfs:subPropertyOf rdfs:member 13:o rdf:type rdfs:Datatype⇒o rdfs:subClassOf rdfs:LiteralTable1.RDFS rules[6]since the computations on these triples(emitting the key/value pairs),areindependent of each other.–the reduce operates on an iterator of values because the set of values is typically far too large tofit in memory.This means that the reducer can onlypartially use correlations between these items while processing:it receivesthem as a stream instead of a set.In this example,operating on the streamis trivial,since the reducer simply increments the counter for each item.–the reduce operates on all pieces of data that share some key,assigned in a map.A skewed partitioning(i.e.skewed key distribution)will lead to imbal-ances in the load of the compute nodes.If term x is relatively popular thenode performing the reduce for term x will be slower than others.To useMapReduce efficiently,we mustfind balanced partitions of the data.4Naive RDFS reasoning with MapReduceThe closure of an RDF input graph under the RDFS semantics[6]can be com-puted by applying all RDFS rules iteratively on the input until no new datais derived(fixpoint).The RDFS rules,shown in Table1,have one or two an-tecedents.For brevity,we ignore the former(rules1,4a,4b,6,8,10,12and13) since these can be evaluated at any point in time without a join.Rules with two antecedents are more challenging to implement since they require a join over two parts of the data.4.1Encoding an example RDFS rule in MapReduceApplying the RDFS rules means performing a join over some terms in the input triples.Let us consider for example rule9from Table1,which derives rdf:type based on the sub-class hierarchy.We can implement this join with a map and reduce function,as shown in Figure2and Algorithm2:4Fig.2.Encoding RDFS rule9in MapReduce.Algorithm2Naive sub-class reasoning(RDFS rule9)map(key,value)://key:linenumber(irrelevant)//value:tripleswitch triple.predicatecase"rdf:type":emit(triple.object,triple);//group(s rdf:type x)on xcase"rdfs:subClassOf":emit(triple.subject,triple);//group(x rdfs:subClassOf y)on xreduce(key,iterator values)://key:triple term,eg x//values:triples,eg(s type x),(x subClassOf y)superclasses=empty;types=empty;//we iterate over triples//if we find subClass statement,we remember the super-classes//if we find a type statement,we remember the typefor(triple in values):switch triple.predicatecase"rdfs:subClassOf":superclasses.add(triple.object)//store ycase"rdf:type":types.add(triple.subject)//store sfor(s in types):for(y in classes):emit(null,triple(s,"rdf:type",y));In the map,we process each triple and output a key/value pair,using as value the original triple,and as key the triple’s term(s,p,o)on which the join should be performed.To perform the sub-class join,triples with rdf:type should be grouped on their object(eg.“x”),while triples with rdfs:subClassOf should be grouped on their subject(also“x”).When all emitted tuples are grouped for the reduce phase,these two will group on“x”and the reducer will be able to perform the join.4.2Complete RDFS reasoning:the need forfixpoint iterationIf we perform this map once(over all input data),and then the reduce once, we will notfind all corresponding conclusions.For example,to compute the transitive closure of a chain of n rdfs:subClassOf-inclusions,we would need to iterate the above map/reduce steps n times.5Obviously,the above map and reduce functions encode only rule9of the RDFS rules.We would need to add other,similar,map and reduce functions to implement each of the other rules.These other rules are interrelated:one rule can derive triples that can serve as input for another rule.For example,rule2 derives rdf:type information from rdfs:domain statements.After applying that rule,we would need to re-apply our earlier rule9to derive possible superclasses.Thus,to produce the complete RDFS closure of the input data using this technique we need to add more map/reduce functions,chain these functions to each other,and iterate these until we reach somefixpoint.5Efficient RDFS reasoning with MapReduceThe previously presented implementation is straightforward,but is inefficient because it produces duplicate triples(several rules generate the same conclu-sions)and because it requiresfixpoint iteration.We encoded,as example,only rule9and we launched a simulation over the Falcon dataset,which contains35 million triples.After40minutes the program had not yet terminated,but had already generated more than50billion triples.Considering that the unique de-rived triples from Falcon are no more than1billion,the ratio of unique derived triples to duplicates is at least1:50.Though the amount of duplicate triples depends on the specific data set,a valid approach should be able to efficiently deal with real world example like Falcon.In the following subsections,we introduce three optimisations to greatly de-crease the number of jobs and time required for closure computation:5.1Loading schema triples in memoryTypically,schema triples are far less numerous than instance triples[7];As also shown in Table2,our experimental data1indeed exhibit a low ratio between schema and instance triples.In combination with the fact that RDFS rules with two antecedents include at least one schema triple,we can infer that joins are made between a large set of instance triples and a small set of schema triples.For example,in rule9of Table1the set of rdf:type triples is typically far larger than the set of rdfs:subClassOf triples.As ourfirst optimisation,we can load the small set of rdfs:subClassOf triples in memory and launch a MapReduce job that streams the instance triples and performs joins with the in-memory schema triples.5.2Data grouping to avoid duplicatesThe join with the schema triples can be physically executed either during the map or during the reduce phase of the job.After initial experiments,we have concluded that it is faster to perform the join in the reduce,since doing so in the map results in producing large numbers of duplicate triples.1from the Billion Triple challenge2008,http://www.cs.vu.nl/~pmika/swc/btc.html6schema type amount fractiondomain,range(p rdfs:domain D,p rdfs:range R)30.0000.004%sub-property(a rdfs:subPropertyOf b)70.0000.009%sub-class(a rdfs:subClassOf b) 2.000.0000.2% Table2.Schema triples(amount and fraction of total triples)in datasets Let us illustrate our case with an example based on rule2(rdfs:domain). Assume an input with ten different triples that share the same subject and predicate but have a different object.If the predicate has a domain associated with it and we execute the join in the mappers,the framework will output a copy of the new triple for each of the ten triples in the input.These triples can be correctlyfiltered out by the reducer,but they will cause significant overhead since they will need to be stored locally and be transfered over the network.We can avoid the generation of duplicates if wefirst group the triples by subject and then we execute the join over the single group.We can do it by designing a mapper that outputs an intermediate tuple that has as key the triple’s subject and as value the predicate.In this way the triples will be grouped together and we will execute the join only once,avoiding generating duplicates.In general,we set as key those parts of the input triples that are also used in the derived triple.The parts depend on the applied rule.In the example above, the only part of the input that is also used in the output is the subject.Since the key is used to partition the data,for a given rule,all triples that produce some new triple will be sent to the same reducer.It is then trivial to output that triple only once in the reducer.As value,we emit those elements of the triple that will be matched against the schema.5.3Ordering the application of the RDFS rulesWe analyse the RDFS ruleset with regard to input and output of each rule,to understand which rule may be triggered by which other rule.By ordering the execution of rules we can limit the number of iterations needed for full closure. As explained before,we ignore some of the rules with a single antecedent(1,4, 6,8,10)without loss of generality:these can be implemented at any point in time without a join,using a single pass over the data.Wefirst categorise the rules based on their output:–rules5and12produce schema triples with rdfs:subPropertyOf as predicate,–rules11and13produce schema triples with rdfs:subClassOf as predicate,–rules2,3,and9produce instance triples with rdf:type as predicate,–rule7may produce arbitrary triples.We also categorise the rules based on the predicates in their antecedents:–rules5and10operate only on triples with sub-class or sub-property triples,–rules9,12and13operate on triples with type,sub-class,and sub-property,7Fig.3.Relation between the various RDFS rules–rule2,3and7can operate on arbitrary triples.Figure3displays the relation between the RDFS rules,connecting rules based on their input and output(antecedents and consequents).An ideal execution should proceed from the bottom of the picture to the top:first apply the tran-sitivity rules(rule5and11),then apply rule7,then rule2and3,then rule9 andfinally rules12and13.It may seem that rule12and13could produce triples that would serve as input to rules5and11;however,looking carefully we see that this is not the case:Rule12outputs(?s rdfs:subPropertyOf rdfs:member),rule13out-puts(?s rdfs:subClassOf rdfs:Literal).For rules5and11tofire on these, rdfs:member and rdfs:Literal must have been be defined as sub-classes or sub-properties of something else.However,in RDFS none of these is a sub-class or sub-property of anything.They could of course be super-classed by arbitrary users on the Web.However,such“unauthorised”statements are dangerous be-cause they can cause ontology hijacking and therefore we ignore them following the advice of[7].Hence,the output of rules12and13cannot serve as input to rules5and11.Similarly,rules2and3cannotfire.Furthermore,rule9cannotfire after rule13,since this would require using literals as subjects,which we ignore as being non-standard RDF.The only rules that couldfire after rule12are rules5and7.For complete RDFS inferencing, we would need to evaluate these rules for each container-membership property found in the data,but as we will show,in typical datasets these properties occur very rarely.8As our third optimisation,we conclude that instead of having to iterate over all RDFS rules untilfixpoint,it is sufficient to process them only once,in the order indicated in Figure 3.Fig.4.Data flow.The solid lines refer to data split partitioned using MapReduce while the dashed lines refer to shared data.5.4The complete pictureIn this section,we present an updated algorithm implementing the above opti-misations.The complete algorithm consists of five sequential MapReduce jobs,as shown in Figure 4.First,we perform dictionary encoding and extract the schema triples to a shared distributed file system.Then,we launch the RDFS reasoner that consists in a sequence of four MapReduce jobs.The first job applies the rules that involve the sub-property relations.The second applies the rules concerning domain and range.The third cleans up the duplicated statements produced in the first step and the last applies the rules that use the sub-class relations.In the following subsections,each of these jobs is explained in detail.Distributed dictionary encoding in MapReduce To reduce the physical size of the input data,we perform a dictionary encoding,in which each triple term is rewritten into a unique and small identifier.We have developed a novel technique for distributed dictionary encoding using MapReduce,rewriting each term into an 8-byte identifier;the encoding scales linearly with the input data.Due to space limitations,we refer the reader to [16].Encoding all 865M triples takes about 1hour on 32nodes.Note that schema triples are extracted here.First job:apply rules on sub-properties The first job applies rules 5and 7,which concern sub-properties,as shown in Algorithm 3.Since the schema triples are loaded in memory,these rules can be applied simultaneously.9Algorithm3RDFS sub-property reasoningmap(key,value)://key:null//value:tripleif(subproperties.contains(value.predicate))//for rule7key="1"+value.subject+"-"+value.objectemit(key,value.predicate)if(subproperties.contains(value.object)&&value.predicate=="rdfs:subPropertyOf")//for rule5key="2"+value.subjectemit(key,value.object)reduce(key,iterator values)://key:flag+some triples terms(depends on the flag)//values:triples to be matched with the schemavalues=values.unique//filter duplicate valuesswitch(key[0])case1://we are doing rule7:subproperty inheritancefor(predicate in values)//iterate over the predicates emitted in the map and collect superpropertiessuperproperties.add(subproperties.recursive_get(value))for(superproperty in superproperties)//iterate over superproperties and emit instance triplesemit(null,triple(key.subject,superproperty,key.object)case2://we are doing rule5:subproperty transitivityfor(predicate in values)//iterate over the predicates emitted in the map,and collect superpropertiessuperproperties.add(subproperties.recursive_get(value))for(superproperty in superproperties)//emit transitive subpropertiesemit(null,triple(key.subject,"rdfs:subPropertyOf",superproperty)) To avoid generation of duplicates,we follow the principle of setting as the tuple’s key the triple’s parts that are used in the derivation.This is possible because all inferences are drawn on an instance triple and a schema triple and we load all schema triples in memory.That means that for rule5we output as key the triple’s subject while for rule7we output a key consisting of subject and object.We add an initialflag to keep the groups separated since later we have to apply a different logic that depends on the rule.In case we apply rule5, we output the triple’s object as value,otherwise we output the predicate.The reducer reads theflag of the group’s key and applies to corresponding rule.In both cases,itfirstfilters out duplicates in the values.Then it recursively matches the tuple’s values against the schema and saves the output in a set. Once the reducer hasfinished with this operation,it outputs the new triples using the information in the key and in the derivation output set.This algorithm will not derive a triple more than once,but duplicates may still occur between the derived triples and the input triples.Thus,at a later stage,we will perform a separate duplicate removal job.Second job:apply rules on domain and range The second job applies rules 2and3,as shown in Algorithm4.Again,we use a similar technique to avoid generating duplicates.In this case,we emit as key the triple’s subject and as10Algorithm4RDFS domain and range reasoningmap(key,value)://key:null//value:tripleif(domains.contains(value.predicate))then//for rule2key=value.subjectemit(key,value.predicate+"d")if(ranges.contains(value.predicate))then//for rule3key=value.objectemit(key,value.predicate+’’r’’)reduce(key,iterator values)://key:subject of the input triples//values:predicates to be matched with the schemavalues=values.unique//filter duplicate valuesfor(predicate in values)switch(predicate.flag)case"r"://rule3:find the range for this predicatetypes.add(ranges.get(predicate))case"d"://rule2:find the domain for this predicatetypes.add(domains.get(predicate))for(type in types)emit(null,triple(key,"rdf:type",type))value the predicate.We also add aflag so that the reducers know if they have to match it against the domain or against the range schema.Tuples about domain and range will be grouped together if they share the same subject since the two rules might derive the same triple.Third job:delete duplicate triples The third job is simpler and eliminates duplicates between the previous two jobs and the input data.Due to space limitations,we refer the reader to[16].Fourth job:apply rules on sub-classes The last job applies rules9,11,12, and13,which are concerned with sub-class relations.The procedure,shown in Algorithm5,is similar to the previous job with the following difference:during the map phase we do notfilter the triples but forward everything to the reducers instead.In doing so,we are able to also eliminate the duplicates against the input. 6Experimental resultsWe use the Hadoop2framework,an open-source Java implementation of MapRe-duce.Hadoop is designed to efficiently run and monitor MapReduce applications on clusters of commodity machines.It uses a distributedfile system and manages execution details such as data transfer,job scheduling,and error management.Our experiments were performed on the DAS-3distributed supercomputer3 using up to64compute nodes with4cores and4GB of main memory each,using 23http://www.cs.vu.nl/das311Algorithm5RDFS sub-class reasoningmap(key,value)://key:source of the triple(irrelevant)//value:tripleif(value.predicate="rdf:type")key="0"+value.predicateemit(key,value.object)if(value.predicate="rdfs:subClassOf")key="1"+value.predicateemit(key,value.object)reduce(key,iterator values)://key:flag+triple.subject//iterator:list of classesvalues=values.unique//filter duplicate valuesfor(class in values)superclasses.add(subclasses.get_recursively(class))switch(key[0])case0://we’re doing rdf:typefor(class in superclasses)if!values.contains(class)emit(null,triple(key.subject,"rdf:type",class))case1://we’re doing subClassOffor(class in superclasses)if!values.contains(class)emit(null,triple(key.subject,"rdfs:subClassOf",class))dataset input output timeσWordnet 1.9M 4.9M3’39”9.1%Falcon32.5M863.7M4’19”3.8%Swoogle78.8M 1.50B7’15”8.2%DBpedia150.1M172.0M5’20”8.6%others601.5Mall864.8M30.0B56’57”1.2%Table3.Closure computation using datasets of increasing size on32nodes Gigabit Ethernet as an interconnect.We have experimented on real-world data from the Billion Triple Challenge20084.An overview of these datasets is shown in Table3,where dataset all refers to all the challenge datasets combined except for Webscope,whose access is limited under a license.All the code used for our experiments is publicly available5.6.1Results for RDFS reasoningWe evaluate our system in terms of time required to calculate the full closure.We report the average and the relative deviationσ(the standard deviation divided by the average)of three runs.The results,along with the number of output 4http://www.cs.vu.nl/~pmika/swc/btc.html5https:///~jrbn/+junk/reasoning-hadoop12triples,are presented in Table3.Figure6shows the time needed for each rea-soning phase.Our RDFS implementation shows very high performance:for the combined dataset of865M triples,it produced30B triples in less than one hour. This amounts to a total throughput of8.77million triples/sec.for the output and252.000triples/sec.for the input.These results do not include dictionary encoding,which took,as mentioned,one hour for all datasets combined.In-cluding this time,the throughput becomes4.27million triples/sec.and123.000 triples/sec.respectively,which to the best of our knowledge,still outperforms any results reported both in the literature[11]and on the Web6.Besides absolute performance,an important metric in parallel algorithms is how performance scales with additional compute nodes.Table4shows the speedup gained with increasing number of nodes and the resulting efficiency,on the Falcon and DBpedia datasets.Similar results hold for the other datasets. To the best of our knowledge,the only published speedup results for distributed reasoning on a dataset of this size can be found in[14];for both datasets,and all numbers of nodes,our implementation outperforms this approach.The speedup results are also shown in Figure5.They show that our high throughput rates are already obtained when utilising only16compute nodes. We attribute the decreasing efficiency on larger numbers of nodes to thefixed Hadoop overhead for starting jobs on nodes:on64nodes,our computation per node is not big enough to compensate platform overhead.Figure6shows the division of runtime over the computation phase from Figure4,and confirms the widely-held intuition that subclass-reasoning is the most expensive part of RDFS inference on real-world datasets.We have verified the correctness of our implementation on the(small)Word-net dataset.We have not stored the output of our algorithm:30B triples(each of them occupying25bytes using our dictionary encoding)produce750GB of data.Mapping these triples back to the original terms would require approx.500 bytes per triple,amounting to some15TB of disk space.In a distributed setting load balancing is an important issue.The Hadoop framework dynamically schedules tasks to optimize the node workload.Fur-thermore,our algorithms are designed to prevent load balancing problems by intelligently grouping triples(see sections5.1and5.2).During experimentation, we did not encounter any load balancing issues.6.2Results for OWL reasoningWe have also encoded the OWL Horst rules[9]to investigate whether our ap-proach can be extended for efficient OWL reasoning.The OWL Horst rules are more complex than the RDFS rules,and we need to launch more jobs to compute the full closure.Due to space restrictions,we refer to[16],for the algorithms and the implementation.On the LUBM(50)benchmark dataset[5],containing7M triples,we com-pute the OWL Horst closure on32nodes in about3hours,resulting in about 6e.g.at /topic/LargeTripleStores13。