Implementation of MapReduce-based image conversion module in cloud computing environment

合集下载

运用实例简述mapreduce原理

运用实例简述mapreduce原理

运用实例简述mapreduce原理MapReduce是一种编程模型和模型化的方法,用于大规模数据集(如分布式文件系统)的并行处理。

它通常用于处理和转换大数据集,以进行数据挖掘、机器学习、数据库等领域的应用。

MapReduce原理的核心思想是将一个复杂的问题拆解成多个小问题,然后将小问题分配给多个处理器(可以是多个计算机或处理器),最后将处理结果汇总并生成最终结果。

这个过程主要包括两个阶段:Map阶段和Reduce阶段。

1.Map阶段:Map阶段的任务是将输入数据集分解为多个小的数据块,并对每个数据块进行处理,生成中间结果。

这个过程通常是一个用户定义的函数,它接受输入数据块并产生一组键-值对。

这些键-值对随后被合并并发送到Reduce阶段。

举个例子,假设我们要对一个大规模的文本文件进行词频统计。

Map阶段会将文本文件分解为单词,并对每个单词生成一个键值对(键为单词,值为该单词在文本中出现的次数)。

2.Reduce阶段:Reduce阶段的任务是将Map阶段产生的中间结果进行汇总,并执行用户定义的Reduce函数,对汇总后的键值对进行处理并生成最终结果。

Reduce函数通常也是用户自定义的函数,它接受一组键值对并产生一个输出结果。

同样以词频统计为例,Reduce阶段会对所有相同的单词进行计数,并将结果输出为一个新的文本文件,其中包含每个单词及其对应的频数。

MapReduce原理的优势在于它能够充分利用多台计算机或处理器的计算资源,实现大规模数据的并行处理。

同时,MapReduce还提供了简单易用的编程接口,使得用户可以轻松地处理大规模数据集。

在实际应用中,MapReduce已被广泛应用于各种领域,如数据挖掘、机器学习、数据库等。

通过MapReduce,我们可以轻松地处理和分析大规模数据集,从而获得更有价值的信息和知识。

需要注意的是,MapReduce原理并不是适用于所有类型的大规模数据处理任务。

对于一些特定的任务,可能需要使用其他类型的并行处理模型和方法。

mapreduce基础概念

mapreduce基础概念

MapReduce是一种大数据处理模型,用于并行处理大规模的数据集。

它由Google在2004年提出,并成为Apache Hadoop的核心组件之一。

MapReduce模型的设计目的是为了简化并行计算任务,使得开发人员可以在分布式系统上高效地处理大规模数据。

MapReduce模型的基本概念如下:1. 输入数据集:MapReduce将输入数据集分割成多个小数据块,并且每个数据块可以由一个或多个键值对组成。

2. 映射 Map)函数:映射函数是并行处理输入数据块的核心操作。

它将输入数据块的每个键值对进行处理,并生成一系列中间键值对。

映射函数可以根据需求进行自定义操作,比如提取关键词、计数等。

3. 中间数据集:MapReduce将映射函数生成的中间键值对根据键进行分组,将具有相同键的值组合在一起,形成中间数据集。

4. 归约 Reduce)函数:归约函数对每个中间键值对的值列表进行处理,并生成最终的输出结果。

归约函数通常是进行聚合操作,比如求和、求平均值等。

5. 输出数据集:MapReduce将归约函数处理后的结果保存在输出数据集中。

MapReduce模型的工作过程如下:1. 切分输入数据集:将大规模的输入数据集切分成多个小数据块,并分配给不同的计算节点。

2. 映射:每个计算节点将分配到的数据块使用映射函数进行处理,并生成中间键值对。

3. 分组:根据中间键的值,将相同键的中间值进行分组,以便后续的归约操作。

4. 归约:每个计算节点对分组后的中间值进行归约操作,生成最终的输出结果。

5. 合并输出结果:将所有计算节点的输出结果进行合并,形成最终的输出数据集。

MapReduce模型的优点包括:- 可扩展性:可以处理非常大规模的数据,并利用分布式计算资源进行并行处理,提高处理效率。

- 容错性:MapReduce具备容错机制,当某个计算节点发生故障时,可以重新分配任务到其他节点上。

- 灵活性:开发人员可以根据具体需求自定义映射和归约函数,实现各种数据处理操作。

hadoop mapreduce工作原理

hadoop mapreduce工作原理

hadoop mapreduce工作原理Hadoop MapReduce工作原理什么是Hadoop MapReduceHadoop MapReduce是Hadoop分布式计算框架中的一个重要组件。

它提供了分布式处理大量数据的能力,结合Hadoop分布式文件系统(HDFS),可有效地进行数据处理和分析。

MapReduce的基本原理MapReduce的基本原理是将大数据集分解为若干小的数据块,然后在集群中分配多个节点并行处理这些小数据块。

关键在于将数据分发到各个节点上进行处理,最终将结果合并到一起,得到最终的结果。

具体来说,MapReduce由两个任务组成:Map和Reduce。

Map任务负责将原始输入数据分割成若干独立的部分,并将每个部分交给Reduce任务来进行处理。

Reduce任务负责接收Map任务输出的结果,并将它们合并成一个输出结果。

MapReduce的实现过程MapReduce的实现过程可以分为以下几个步骤:•输入:将输入数据分成若干小数据块,并将它们分发到多个计算节点上。

•Map阶段:每个计算节点都会启动Map任务,对分配到的小数据块进行处理。

Map任务将每个小数据块映射为若干个键值对,并将它们分组成不同的组。

•Shuffle阶段:在Map任务的输出结果中,每个组的键值对需要按照键值进行排序,并按照键值将它们归类为不同的组。

这个过程叫做Shuffle,它会将不同计算节点上的输出结果合并为一个更大的键值对列表。

•Reduce阶段:所有节点都会启动Reduce任务,拿到Shuffle阶段的键值对列表,对它们进行处理,并将它们输出到结果文件中。

MapReduce的优点MapReduce具有以下优点:•适合处理大数据问题。

由于MapReduce使用了分布式计算,可以将大数据分成多个小块,并并行地处理,可大大降低数据处理时间。

•易于扩展。

只需要增加计算节点,即可完成计算能力的扩展。

•容错性强。

由于MapReduce是分布式处理,即便其中一个节点出现故障,也不会影响整个计算过程。

Google_云计算三大论文中文版

Google_云计算三大论文中文版

Google_云计算三大论文中文版Google公司是全球最大的搜索引擎和云计算服务提供商之一。

Google的云计算架构和算法在业界受到广泛关注,其通过一系列论文来介绍这些技术,并分享了它们的最佳实践。

本文将针对Google公司发表的三篇云计算论文(论文名称分别为《MapReduce:Simplified Data Processing on Large Clusters》、《The Google File System》、《Bigtable: A Distributed Storage System for Structured Data》),进行分类讲解,以帮助读者更好地了解云计算领域的相关技术。

一、MapReduce:Simplified Data Processing on Large ClustersMapReduce论文是Google公司云计算领域中的重要代表作之一,它的作者是Jeffrey Dean和Sanjay Ghemawat。

MAPREDUCE是一种大规模数据处理技术,其主要目的是在一个大型集群中分Distribute and Parallel Execution(分布式和并行执行)处理任务。

MapReduce将计算逻辑分解成两个部分- Map阶段和Reduce阶段。

在Map阶段,数据被按键提取;在Reduce阶段,数据被收集以计算结果。

这两个阶段可以在许多物理节点上并行执行,大大提高了计算效率。

此外,该论文引入了GFS分布式文件系统,为MapReduce提供了强大的文件系统支持。

二、The Google File SystemGFS是由Sanjay Ghemawat、Howard Gobioff和Shun-TakLeung共同编写的一篇论文。

它旨在解决分布式文件系统上的问题,以应对Google的大规模数据集和两台甚至三台以上的机器发生故障的情况。

GFS可以处理超过100TB以上的数据集,加速数据读取和写入,处理大规模数据存储集群。

改进蚁群优化算法求解折扣{0-1}背包问题

改进蚁群优化算法求解折扣{0-1}背包问题
Modified Ant Colony Optimization Algorithm for Discounted {0-1} Knapsack Problem ZHANG Ming1,2, DENG Wenhan1,2, LIN Juan1,2, ZHONG Yiwen1,2
1.College of Computer and Information Science, Fujian Agriculture and Forestry University, Fuzhou 350002, China 2.Key Laboratory of Smart Agriculture and Forestry(Fujian Agriculture and Forestry University), Fuzhou 350002, China
背包问题(Knapsack Problem,KP)是经典的 NP-困 难问题,它包括多维背包问题、完全背包问题、分组背包 问 题 和 折 扣 {0- 1} 背 包 问 题(Discounted {0- 1} Knapsack Problem,DKP)等多种类型。DKP 在 2007 年首先 在 文 献 [1] 中 提 出 ,它 是 对 商 场 促 销 行 为 的 抽 象 ,在 商 业、投资决策、资源分配和密码学等方面都有实际应用
(1)根据 DKP 的构造特点,采用组内竞争方式计算 物品的选择概率,从而降低算法的时间复杂度。
(2)在不降低算法精度的前提下舍去启发式信息, 从而减少算法所使用的参数,简化参数设置。
(3)采用混合基于价值密度及价值的优化算子,提 高算法的寻优能力。
(4)基于上述的改进模块设计出的算法,在 DKP 问 题的求解中具有良好的性能表现。
86 2021,57(13)

基于数据挖掘的计量装置在线监测与智能诊断系统的设计与实现

基于数据挖掘的计量装置在线监测与智能诊断系统的设计与实现

基于数据挖掘的计量装置在线监测与智能诊断系统的设计与实现摘要:为加强对现场计量装置、采集设备和配电网运行监测的实时性,提高对用户用电行为异常分析的准确性,研制了一套计量装置在线监测和智能诊断系统,其利用数据挖掘工具,实现了对计量装置和采集设备的运行状态评估,并对用户违约用电窃电和计量装置故障进行了智能诊断。

该系统采用SOA技术架构,由异常指标专家库、实时监测与智能诊断和WEB应用三部分组成,并已在安徽省电力公司试运行,对保障电网安全稳定运行、反窃电和降低计量偏差造成的舆情发挥重要作用。

关键字:异常指标专家库;终端事件判定算法;数据挖掘分析;整体状态分析中图分类号:TM933 文献标识码:B 文章编号:1001-1390(2014)00-0000-00 Design and Implementation of Metering Device Online Monitoring and Intelligent Diagnosis System Based on Data MiningAbstract:In order to strengthen the real-time monitor of the field metering device, collecting device and the distribution network operation, and to enhance the accuracy of analysis for the power abnormal use behavior, a metering device online monitoring and intelligent system has been developed in this paper, which will realize the state evaluation of operating metering device and acquisition equipment using the data mining methods, carry out the intelligent diagnosis for the breach of a contract for electricity using and electricity stealing as well as the metering device fault. Adopting the SOA technique structure, the system consists of three parts of the expert database of abnormal index, the real-time monitoring and intelligent diagnosis, and the web application. The system has been test run in Anhui electric power company, the effect of which show that such system plays an important role in ensuring the security and stability operation of power grids, causing public sentiment of anti-electricity stealing, and reducing the metering deviation.Key words: the expert database of abnormal index, algorithm for determining the terminal event,data mining analysis,overall state analysis0 引言国家电网公司自2009年启动电力用户用电信息采集系统(以下简称用电信息采集系统)建设以来,截止2013年12月,已累计完成智能电能表安装1.33亿只;实现采集的用户1.39亿户。

决策树分类MapReduce实现

决策树分类MapReduce实现
//分割数据集,并将分割后的数据集传到 HDFS 上 DataSplit dataSplit = DataHandler.split(new Data(
data.getInstances(), attribute, splitPoints)); for (DataSplitItem item : dataSplit.getItems()) {
public class DecisionTreeC45Job extends AbstractJob {
/** 对数据集做准备工作,主要就是将填充好默认值的数据集再次传到 HDFS 上*/ public String prepare(Data trainData) {
String path = FileUtils.obtainRandomTxtPath(); DataHandler.writeData(path, trainData); System.out.println(path); String name = path.substring(stIndexOf(File.separator) + 1); String hdfsPath = HDFSUtils.HDFS_TEMP_INPUT_URL + name; HDFSUtils.copyFromLocalFile(conf, path, hdfsPath); return hdfsPath; }
Path testSetPath = new Path(testSet); FileSystem testFS = testSetPath.getFileSystem(conf); Path[] testHdfsPaths = HDFSUtils.getPathFiles(testFS, testSetPath); FSDataInputStream fsInputStream = testFS.open(testHdfsPaths[0]); Data testData = DataLoader.load(fsInputStream, true);

MapReduce工作原理图文详解

MapReduce工作原理图文详解

MapReduce⼯作原理图⽂详解前⾔:MapReduce是⼀种编程模型,⽤于⼤规模数据集(⼤于1TB)的并⾏运算。

概念"Map(映射)"和"Reduce(归约)",和它们的主要思想,都是从函数式编程语⾔⾥借来的,还有从⽮量编程语⾔⾥借来的特性。

它极⼤地⽅便了编程⼈员在不会分布式并⾏编程的情况下,将⾃⼰的程序运⾏在上。

当前的软件实现是指定⼀个Map(映射)函数,⽤来把⼀组键值对映射成⼀组新的键值对,指定并发的Reduce(归约)函数,⽤来保证所有映射的键值对中的每⼀个共享相同的键组。

呵呵,下⾯我们进⼊正题,这篇⽂章主要分析以下两点内容:⽬录:1.MapReduce作业运⾏流程2.Map、Reduce任务中Shuffle和排序的过程正⽂:1.MapReduce作业运⾏流程下⾯贴出我⽤visio2010画出的流程⽰意图:流程分析:1.在客户端启动⼀个作业。

2.向JobTracker请求⼀个Job ID。

3.将运⾏作业所需要的资源⽂件复制到HDFS上,包括MapReduce程序打包的JAR⽂件、配置⽂件和客户端计算所得的输⼊划分信息。

这些⽂件都存放在JobTracker专门为该作业创建的⽂件夹中。

⽂件夹名为该作业的Job ID。

JAR⽂件默认会有10个副本(mapred.submit.replication属性控制);输⼊划分信息告诉了JobTracker应该为这个作业启动多少个map任务等信息。

4.JobTracker接收到作业后,将其放在⼀个作业队列⾥,等待作业调度器对其进⾏调度(这⾥是不是很像微机中的进程调度呢,呵呵),当作业调度器根据⾃⼰的调度算法调度到该作业时,会根据输⼊划分信息为每个划分创建⼀个map任务,并将map任务分配给TaskTracker执⾏。

对于map和reduce任务,TaskTracker根据主机核的数量和内存的⼤⼩有固定数量的map 槽和reduce槽。

林子雨大数据技术原理及应用第七章课后题答案

林子雨大数据技术原理及应用第七章课后题答案

《大数据技术第七章课后题答案黎狸1.试述MapReduce和Hadoop的关系。

谷歌公司最先提出了分布式并行编程模型MapReduce, Hadoop MapReduce是它的开源实现。

谷歌的MapReduce运行在分布式文件系统GFS 上,与谷歌类似,HadoopMapReduce运行在分布式文件系统HDFS上。

相对而言,HadoopMapReduce 要比谷歌MapReduce 的使用门槛低很多,程序员即使没有任何分布式程序开发经验,也可以很轻松地开发出分布式程序并部署到计算机集群中。

2.MapReduce 是处理大数据的有力工具,但不是每个任务都可以使用MapReduce来进行处理。

试述适合用MapReduce来处理的任务或者数据集需满足怎样的要求。

适合用MapReduce来处理的数据集,需要满足一个前提条件: 待处理的数据集可以分解成许多小的数据集,而且每一个小数据集都可以完全并行地进行处理。

3.MapReduce 模型采用Master(JobTracker)-Slave(TaskTracker)结构,试描述JobTracker 和TaskTracker的功能。

MapReduce 框架采用了Master/Slave 架构,包括一个Master 和若干个Slave。

Master 上运行JobTracker,Slave 上运行TaskTrackero 用户提交的每个计算作业,会被划分成若千个任务。

JobTracker 负责作业和任务的调度,监控它们的执行,并重新调度已经失败的任务。

TaskTracker负责执行由JobTracker指派的任务。

4.;5.TaskTracker 出现故障会有什么影响该故障是如何处理的6.MapReduce计算模型的核心是Map函数和Reduce函数,试述这两个函数各自的输人、输出以及处理过程。

Map函数的输人是来自于分布式文件系统的文件块,这些文件块的格式是任意的,可以是文档,也可以是二进制格式。

mapreduce的工作原理

mapreduce的工作原理

mapreduce的工作原理
MapReduce是一种用于处理大规模数据集的编程模型和算法。

它的工作原理基于分而治之的思想,将数据分割成多个小块,并在不同的计算节点上进行并行处理,然后将中间结果进行合并,最终得到最终的结果。

MapReduce的过程可以分为两个阶段:Map阶段和Reduce阶段。

在Map阶段,输入的大数据集被拆分为多个小数据块,并由
不同的计算节点进行并行处理。

每个计算节点都会执行用户定义的Map函数,该函数将每个输入数据块转化为一系列(键,值)对的中间结果。

这个过程中,Map函数是独立执行的,
从而实现了并行处理。

在Reduce阶段,所有的中间结果会被按照键进行排序,并进
行分组。

然后,每个计算节点都会执行用户定义的Reduce函数,该函数将每个组的值进行聚合并生成最终结果。

在Reduce阶段,数据的处理是有序的,以确保数据的正确性和
一致性。

MapReduce的工作原理主要基于数据的分割、并行处理和结
果的合并。

通过将大规模数据集划分为小块并在多个计算节点上同时处理,可以大幅提高数据处理的效率和速度。

同时,通过中间结果的合并,可以得到最终的结果。

这种分布式计算模型在处理大规模数据时具有优势,并被广泛应用于各种领域,如搜索引擎、数据分析和机器学习等。

CVaaS_ Computer Vision as a Cloud Software Service

CVaaS_ Computer Vision as a Cloud Software Service

CVaaS:Computer Vision as a Cloud Software Service Chad DeLoatch,James ONeill,Jeff Chih-Yuan LinUniversity of Illinois at Urbana ChampaignFeburary22,2011AbstractComputer vision[1]is the transformation of data from a still or video camera into either a decision or a new repre-sentation.All such transformations are done for achieving some particular goal.The input data may include some contextual information such as the camera is mounted in a car or laser rangefinder indicates an object is1meter away. The decision might be there is a person in this scene or there are14tumor cells on this slide.A new representation might mean turning a color image into a grayscale image or removing camera motion from an image sequence.Cloud computing is now receiving a lot of attention from sci-entists as a technology for running scientific applications. While the definition of cloud-computing tends to be in the eye of the beholder,in this context we consider the cloud as a cluster that offers virtualized services,service-oriented provisioning of resources and,most important of all,pay-as-you-go pricing.Cloud computing is now offered by a number of commercial providers,and Open Source clouds are also available.The benefits are obvious:end-users pay only for what they use,the strain on local system main-tenance and power consumption is reduced,and resources are available on demand(elasticity)and are not subject to scheduling overheads.This paper provides an overview of CVaaS(Computer Vision as a Cloud Software Service)as a software platform for performing advanced Computer Vi-sion analysis and processing on large real-time and batch image datasets in the cloud.1IntroductionCVaaS is a a software platform for performing advanced Computer Vision analysis and processing on large real-time and batch image datasets in the cloud.The plat-form provides a uniform development model for perform-ing advanced Computer Vision analysis and processing tasks such as Image Processing(Smoothing,Morphology, Thresholding,etc.),Image Transformations,Image Seg-mentation,Image Tracking/Motion Detection,and Object Detection.CVaaS uses the cloud intra-structure(IaaS)and platform services(PaaS)to execute Computer Vision tasks in parallel and provide a robust and scalable execution en-vironment.The rest of this paper is organized as follows. Section2discusses the Related Work.Section3provides details of the roadmap,implementation schedule and mile-stones for CVaaS.2Related WorkMODISAzure[4]is a cloud application that uses the Mi-crosoft Azure cloud platform to process and analysis large environmental satellite image datasets.MODISAzure,is one of thefirst escience applications to use the Microsoft Windows Azure cloud platform.A typical execution of MODISAzure produces an analysis of environmental char-acteristics for each day being studied(in a bag-of-tasks style)and then aggregates the day-ofyear results.The GIS image processing platform[3]uses cloud computing as a platform for processing3D GIS data that requires a great amount of computational resources because of the com-plexity and large amounts of spatial information.3CVaas and Computer Vision Cloud Processing.The majority of the service will be implemented using Hadoop[2],OpenCV[6]and Amazon Web Services (AWS)[8]technologies.Hadoop is a software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.It is designed to scale up from single servers to thousands of machines,each offering local computation and storage.Rather than rely on hardware to deliver high-avaiability,the library itself is designed to detect and han-dle failures at the application layer,so delivering a highly-availabile service on top of a cluster of computers,each of which may be prone to failures.OpenCV is an open source computer vision library.OpenCV was designed for com-putational efficiency and with a strong focus on real-time applications.OpenCV is written in optimized C and can take advantage of multicore processors.OpenCV is aimed at providing the basic tools needed to solve computer vi-sion problems.AWS is a collection of remote computing 1services(also called web services)that together make up a cloud computing platform This following sections con-tain the service components,implementation schedule and milestones.3.1Cloud Environment SetupUse the existing UIUC AWS EC2instance or create a new AWS EC2instance with Hadoop.Configure Hadoop and install OpenCV.Estimated Duration-3Days3.2Hadoop/OpenCV Service ComponentLibraryDesign and implement a service component library in C++or Java.The library will use Hadoop and OpenCV to perform advanced Computer Vision analysis and processing tasks in parallel.Estimated Duration-2.5Weeks3.3A WS Web ServiceDesign and implement an AWS web service.The web service will implement and publish an API that will allow clients to perform Computer Vision analysis and processing in the cloud.The service will use the service component library described above to perform the Com-puter Vision operations.Estimated Duration-1Week3.4IPhone or Android Mobile Client Design and implement a mobile application in Objective-C or Java.The mobile application will use the AWS iOS and Android SDKs to communicate with the AWS web service described above to perform Computer Vision analysis and processing in the cloud.The mobile application may also provide real-time and batch image datasets as inputs for the service.Estimated Duration-1Week3.5Analyze Results and Complete Final Pa-perAnalysis results and completefinal paper and presentation. Estimated Duration-2WeekReferences[1]Computer Vision.The Computer Vision..[2]Hadoop.Apache Software Foundation..[3]Cloud Computing Platform for GIS Image Process-ing in U-city.Jong Won Park,Chang Ho Yun,Shin-gyu Kim,Heon Y Yeom,Yong Woo,School of Electri-cal Engineering,University of Seoul,School of Com-puter Science and Engineering,Seoul National Univer-sity.Advanced Communication Technology(ICACT), 201113th International Conference,13-16Feb.2011, pp.1151-1155.[4]Assessing the Value of Cloudbursting:A Case Studyof Satellite Image Processing on Windows Azure.M.Humphrey,Z.Hill,C.van Ingen,K.Jackson,and Y.Ryu.IEEE e-Science2011Conference.Stockholm, Sweden.Dec5-8,2011.[5]The Application of Cloud Computing to the Creationof Image Mosaics and Management of Their Prove-nance.G.Bruce Berriman,Infrared Processing and Analysis Center,California Institute of Technology, Ewa Deelman,Paul Groth,and Gideon Juve,Infrared Processing and Analysis Center,California Institute of Technology. . [6]OpenCV.Willow Garage/Intel./wiki.[7]Using Transaction Based Parallel Computing to SolveImage Processing and Computational Physics Prob-lems.Harold Trease,Daniel Fraser,Rob Farber, Stephen Elbert,Pacific Northwest National Laboratory, Cloud Computing and Its Applications. .[8]Amazon Web Services.Amazon..[9]Astronomy in the Cloud:Using MapReduce for Im-age Coaddition.K.Wiley,A.Connolly,J.Gardner and S.Krugho,Survey Science Group,Department of As-tronomy,University of Washington,and M.Balazinska,B.Howe,Y.Kwon and Y.Bu Database Group,Depart-ment of Computer Science,University of Washington.Publications of the Astronomical Society of the Pacific, V ol.123,No.901.(March2011),pp.366-380. [10]Learning OpenCV.Gary Bradski,Adrian Kaehler.O’Reilly,Copyright2008.2。

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters

MapReduceSimplified Data Processing on Large ClustersMapReduce:Simplified Data Processing on Large Clusters◆Introduction◆Programming model◆Implementation◆Refinements◆Performance◆ExperienceIntroduction —What is MapReduce? ComputationStraightforwardIntroduction —What is MapReduce?Computation StraightforwardInput dataLargeButIntroduction —What is MapReduce?Computation StraightforwardInput dataLargeComputationDistributed acrossmachinesBut SoIntroduction —What is MapReduce?Computation StraightforwardInput dataLargeComputationDistributed acrossmachinesFinish inReasonable time But SoWe wantIntroduction —What is MapReduce?Computation StraightforwardInput dataLargeComputationDistributed acrossmachinesFinish inReasonable time But SoWe wantparallelize the computationdistribute the datahandle failuresIntroduction —What is MapReduce?Computation StraightforwardInput dataLargeComputationDistributed acrossmachinesFinish inReasonable time But SoWe wantparallelize the computationdistribute the data handle failurescomplex lead tolead tolead toIntroduction —What is MapReduce? Solution:Express the simple computationsHide the messy details of parallelization, fault-tolerance, data distributionMapReduceMapReduce:Simplified Data Processing on Large Clusters◆Introduction◆Programming model◆Implementation◆Refinements◆Performance◆ExperienceProgramming ModelInput: Key-Value pairsMapReduce libraryOutput: Key-Value pairsProgramming ModelInput: Key-Value pairsMapMapReduce libraryReduceOutput: Key-Value pairsProgramming ModelInput: Key-Value pairsMapReduce libraryOutput: Key-Value pairsMapReducepseudo-code:count the number of occurrences of each word in a documentsProgramming ModelInput: Key-Value pairsMapReduce libraryOutput: Key-Value pairsMapReducemap(String key, String value): // key: document name// value: document contents for each word w in value: EmitIntermediate(w, "1");pseudo-code:count the number of occurrences of each word in a documentsProgramming ModelInput: Key-Value pairsMapReduce libraryOutput: Key-Value pairsMapReducemap(String key, String value): // key: document name// value: document contents for each word w in value: EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of counts int result = 0;for each v in values: result += ParseInt(v); Emit(AsString(result));pseudo-code:count the number of occurrences of each word in a documentsProgramming Modelmap(String key, String value): // key: document name// value: document contents for each word w in value: EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of counts int result = 0;for each v in values: result += ParseInt(v); Emit(AsString(result));pseudo-code:example:I am a programmer, you are also a programmer.Hellocount the number of occurrences of each word in a documentsProgramming Modelmap(String key, String value): // key: document name// value: document contents for each word w in value: EmitIntermediate(w, "1");pseudo-code:example:I am a programmer, you are also a programmer.Hellocount the number of occurrences of each word in a documents Key = HelloValue = I am a programmer, you are also a programmer.Programming Modelmap(String key, String value): // key: document name// value: document contents for each word w in value: EmitIntermediate(w, "1");pseudo-code:example:I am a programmer, you are also a programmer.Hellocount the number of occurrences of each word in a documents Key = HelloValue = I am a programmer, you are also a programmer.( I, 1)( am, 1)( a, 1)( programmer, 1)( you, 1)( are, 1)( also, 1)Programming Modelmap(String key, String value): // key: document name// value: document contents for each word w in value: EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of counts int result = 0;for each v in values: result +=ParseInt(v); Emit(AsString(result));pseudo-code:example:I am a programmer, you are also a programmer.Hellocount the number of occurrences of each word in a documentsProgramming Modelmap(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of counts int result = 0;for each v in values: result += ParseInt(v); Emit(AsString(result));pseudo-code:example:I am a programmer, you are also a programmer.Hellocount the number of occurrences of each word in a documents ( I, 1)( am, 1)( a, 2)( programmer, 2)( you, 1)( are, 1)( also, 1)Programming ModelDistributed GrepCount of URL Access FrequencyReverse Web-Link GraphTerm-Vector per HostInverted IndexDistributed SortMapReduce:Simplified Data Processing on Large Clusters◆Introduction◆Programming model◆Implementation◆Refinements◆Performance◆ExperienceImplementation --How it works? Computing environment:Implementation --How it works? The overall flow of a MapReduce operation –step 1input filesMapReduce libraryinImplementation --How it works? The overall flow of a MapReduce operation –step 1input files123M...MapReduce libraryinImplementation --How it works?The overall flow of a MapReduce operation –step 2MasterWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkererWorkerWorkerWorkerWorImplementation --How it works?The overall flow of a MapReduce operation –step 2MasterWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkerWorkererWorkerWorkerWorkerWormap taskImplementation --How it works? The overall flow of a MapReduce operation –step 3 MasterWorkerImplementation --How it works? The overall flow of a MapReduce operation –step 3 MasterWorker12Implementation --How it works? The overall flow of a MapReduce operation –step 3 MasterWorker12Map Function outputkey-value pairsImplementation --How it works? The overall flow of a MapReduce operation –step 3 MasterWorker12Map Functionoutputkey-value pairs memoryoutputintermediatekey-value pairsImplementation --How it works? The overall flow of a MapReduce operation –step 4memoryoutputintermediatekey-value pairsdiskPartitioningFunction1234R...MasterImplementation --How it works? The overall flow of a MapReduce operation –step 4memoryoutputintermediatekey-value pairsdiskPartitioningFunction1234R...MasterThe master knowsthe location ofthese bufferedpairs.Implementation --How it works?The overall flow of a MapReduce operation –step 5, 6, 7MasterWorker1234R...Implementation --How it works?The overall flow of a MapReduce operation –step 5, 6, 7MasterWorker1234R...Reduce Functionprocess(sort) and output the key and the correspondingset of intermediatevaluesImplementation --How it works?The overall flow of a MapReduce operation –step 5, 6, 7MasterWorker1234R...Reduce Functionprocess(sort) and output the key and the correspondingset of intermediatevaluesfinal output file for reduce partition 1outputappended toImplementation --How it works?The overall flow of a MapReduce operation –step 5, 6, 7MasterWorker 1234R...Reduce Functionprocess(sort) and output the key and the correspondingset of intermediatevaluesfinal output file for reduce partition 1outputappended touser programwake upAll map tasks and reduce tasks have been completed.Implementation Master Data Structures:stateidentity of the worker machineImplementationMaster Data Structures:stateidentity of the worker machineFault Tolerance:Worker Failuremaster ping worker periodicallyIF no response in a certain amount of timeTHEN master mark the worker as failed Master FailurecheckpointImplementationMaster Data Structures:stateidentity of the worker machineFault Tolerance:Worker Failuremaster ping worker periodicallyIF no response in a certain amount of timeTHEN master mark the worker as failedMaster FailurecheckpointLocality:master schedule a map task on a machine which contains a replicaImplementationTask Granularity :There are practical bounds on how large M and R can be M + R scheduling decisionsM ∗R stateImplementationTask Granularity :There are practical bounds on how large M and R can beM + R scheduling decisionsM ∗R statetimeBackup Tasks:taskImplementationTask Granularity :There are practical bounds on how large M and R can beM + R scheduling decisionsM ∗R statetimeBackup Tasks:taskImplementationTask Granularity :There are practical bounds on how large M and R can beM + R scheduling decisions M ∗R stateBackup Tasks:The task is marked as completedThe primary task is completed The backup task is not completedorThe primary task is not completedThe backup task is completedMapReduce:Simplified Data Processing on Large Clusters◆Introduction◆Programming model◆Implementation◆Refinements◆Performance◆ExperienceRefinementsPartitioning Function Ordering Guarantees Combiner Function Input and Output Types Side-effectsSkipping Bad Records Local Execution Status Information CountersRefinementsPartitioning Function Ordering Guarantees Combiner Function Input and Output Types Side-effectsSkipping Bad Records Local Execution Status Information Counters Map FunctionPartitioningFunctionReduceFunctionDefault SpecialRefinementsPartitioning Function Ordering GuaranteesCombiner Function Input and Output Types Side-effectsSkipping Bad Records Local Execution Status Information Counters within a given partition key-value pairs increasing key orderRefinementsPartitioning Function Ordering Guarantees Combiner Function Input and Output Types Side-effectsSkipping Bad Records Local Execution Status Information CountersMapReduce libraryOutput: Key-Value pairsMapReducemap(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values: result += ParseInt(v);Emit(AsString(result));pseudo-code:Reduce( the, 1)( the, 1)( the, 1)( the, 1)( the, 1)( the, 1)( the, 1)……RefinementsPartitioning Function Ordering Guarantees Combiner Function Input and Output Types Side-effectsSkipping Bad Records Local Execution Status Information CountersMapReduce libraryOutput: Key-Value pairsMapReducemap(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values: result += ParseInt(v);Emit(AsString(result));pseudo-code:Reduce( the, 1)( the, 1)( the, 1)( the, 1)( the, 1)( the, 1)( the, 1)……Combiner Reduce( the, 100)……。

mapreduce 工作原理

mapreduce 工作原理

mapreduce 工作原理MapReduce 是一种用于大数据处理的编程模型,它可以将大任务分解成小任务,然后在不同的计算节点上并行执行。

MapReduce 设计出来的初衷就是要处理大数据,它的分布式处理能力和可伸缩性使得它成为了云计算时代的一个重要组成部分。

MapReduce 的基本工作原理是将大规模数据集分成很多小的数据块,然后依次对每一个数据块进行处理,最后将每个子任务的处理结果合并起来形成最终的结果。

MapReduce 模型分为两个阶段:Map 阶段和 Reduce 阶段。

Map 阶段中,MapReduce 模型将输入数据集分成若干个大小相等的数据块,然后为每个数据块启动一个 Map 任务,将该数据块输入到Map 函数中进行处理。

Map 函数会将每个输入的键值对经过一系列转换后输出成一系列新的键值对。

这些新的键值对会被分类和排序,最终输出到 Reduce 阶段进行处理。

Reduce 阶段中,MapReduce 模型将所有 Map 任务输出的键值对进行聚合和排序,并且将具有相同 key 值的键值对输出到同一个Reduce 任务中进行处理。

Reduce 函数将所有输入的键值对按照 key 值进行聚合,并且将聚合后的结果输出到一个新的文件中。

MapReduce 模型的优势在于其高可伸缩性和高容错性。

由于每个Map 任务之间互不干扰,也因为 Map 函数之间没有任何交互,所以MapReduce 模型可以非常容易地进行并行化处理。

同时为了解决可能存在的某些节点发生故障,MapReduce 模型通过持久化所有任务处理过程的数据以进行容错处理。

总之,MapReduce 模型是处理大数据的一个非常好的方法。

该模型将输入数据集拆分成几个小数据块,然后为每个数据块启动一个Map 任务,最终将 Map 任务的输出结果输入到 Reduce 函数中聚合处理,最后得到最终的结果。

该模型的可伸缩性和容错性都是很好的,使得它成为了当前大数据处理的一个非常好的解决方案。

mapreduce编程模型的实现

mapreduce编程模型的实现

MapReduce编程模型的实现什么是MapReduceMapReduce是一种用于处理大规模数据集的编程模型和算法。

它由Google公司提出,并在2004年的一篇论文中进行了详细描述。

MapReduce将数据处理任务分为两个阶段:Map和Reduce。

在Map阶段,输入数据被分割成多个小块,每个小块由一个Map任务处理。

Map任务将输入数据映射为(key, value)对,并将这些对作为中间结果输出。

在Reduce阶段,所有Map任务的输出被合并并按照key进行分组。

每组数据由一个Reduce任务处理。

Reduce任务对每个(key, value)对执行特定的操作,并生成最终的输出结果。

通过将大规模数据集划分为多个小块,并并行处理这些小块,MapReduce能够高效地处理大规模数据集,提供可扩展性和容错性。

MapReduce编程模型的实现实现MapReduce编程模型需要具备以下几个关键要素:分布式计算框架要实现MapReduce编程模型,需要使用一种分布式计算框架来管理任务调度、数据分发和结果收集等操作。

目前比较常用的分布式计算框架有Apache Hadoop、Apache Spark等。

Map函数在实现过程中,需要定义一个Map函数来执行特定的映射操作。

Map函数将输入数据映射为(key, value)对,并将这些对作为中间结果输出。

Map函数通常是独立的,可以并行执行,每个Map任务处理一部分输入数据。

Map 函数的具体实现取决于具体的应用场景和需求。

Reduce函数除了Map函数,还需要定义一个Reduce函数来执行特定的合并和计算操作。

Reduce函数对每个(key, value)对执行特定的操作,并生成最终的输出结果。

Reduce函数也是独立的,可以并行执行,每个Reduce任务处理一组(key, value)对。

Reduce函数的具体实现同样取决于具体的应用场景和需求。

数据分割和分发在实现过程中,需要将输入数据分割成多个小块,并将这些小块分发给不同的Map任务进行处理。

MapReduce 中文版论文

MapReduce 中文版论文

Google MapReduce中文版译者: alexMapReduce是一个编程模型,也是一个处理和生成超大数据集的算法模型的相关实现。

用户首先创建一个Map函数处理一个基于key/value pair的数据集合,输出中间的基于key/value pair的数据集合;然后再创建一个Reduce函数用来合并所有的具有相同中间key值的中间value值。

现实世界中有很多满足上述处理模型的例子,本论文将详细描述这个模型。

MapReduce架构的程序能够在大量的普通配置的计算机上实现并行化处理。

这个系统在运行时只关心:如何分割输入数据,在大量计算机组成的集群上的调度,集群中计算机的错误处理,管理集群中计算机之间必要的通信。

采用MapReduce架构可以使那些没有并行计算和分布式处理系统开发经验的程序员有效利用分布式系统的丰富资源。

我们的MapReduce实现运行在规模可以灵活调整的由普通机器组成的集群上:一个典型的MapReduce 计算往往由几千台机器组成、处理以TB计算的数据。

程序员发现这个系统非常好用:已经实现了数以百计的MapReduce程序,在Google的集群上,每天都有1000多个MapReduce程序在执行。

在过去的5年里,包括本文作者在内的Google的很多程序员,为了处理海量的原始数据,已经实现了数以百计的、专用的计算方法。

这些计算方法用来处理大量的原始数据,比如,文档抓取(类似网络爬虫的程序)、Web请求日志等等;也为了计算处理各种类型的衍生数据,比如倒排索引、Web文档的图结构的各种表示形势、每台主机上网络爬虫抓取的页面数量的汇总、每天被请求的最多的查询的集合等等。

大多数这样的数据处理运算在概念上很容易理解。

然而由于输入的数据量巨大,因此要想在可接受的时间内完成运算,只有将这些计算分布在成百上千的主机上。

如何处理并行计算、如何分发数据、如何处理错误?所有这些问题综合在一起,需要大量的代码处理,因此也使得原本简单的运算变得难以处理。

详解MapReduce的模式、算法和用例 - Hadoop - TechTarget商务智能

详解MapReduce的模式、算法和用例 - Hadoop - TechTarget商务智能
class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelations do
Emit(id m, message getMessage(N))
class Reducer
method Reduce(id m, [s1, s2,...])
如果要累计计数的的不只是单个文档中的内容,还包括了一个Mapper节点处理的所有文档,那就要用 到Combiner了:
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1)
class N State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise method getMessage(object N) return N.State method calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1)
class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0
sum =பைடு நூலகம்sum + c Emit(term t, count sum)
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

Implementation of MapReduce-based Image Conversion Module in Cloud ComputingEnvironmentHyeokju LeeDivision of Internet & Multimedia EngineeringKonkuk UniversitySeoul Koreahjlee09@konkuk.ac.kr Myoungjin Kim, Joon Her, and Hanku Lee* Division of Internet & Multimedia EngineeringKonkuk UniversitySeoul Korea{tough105, herj00n, hlee}@konkuk.ac.krAbstract—In recent years, the rapid advancement of the Internet and the growing number of people using social networking services (SNSs) have facilitated the sharing of multimedia data. However, multimedia data processing techniques such as transcoding and transmoding impose a considerable burden on the computing infrastructure as the amount of data increases. Therefore, we propose a MapReduce-based image-conversion module in cloud computing environment in order to reduce the burden of computing power. The proposed module consists of two parts: a storage system, i.e., Hadoop distributed file system (HDFS) for image data and a MapReduce program with a Java Advanced Imaging (JAI) library for image transcoding. It can process image data in distributed and parallel cloud computing environments, thereby minimizing the computing infrastructure overhead. In this paper, we describe the implementation of the proposed module using Hadoop and JAI. In addition, we evaluate the proposed module in terms of processing time under varying experimental conditions.Keywords-Cloud Computing; Hadoop; MapReduce; HDFS; Image ConversionI.I NTRODUCTIONThe wide availability of inexpensive hardware such as personal computers, digital cameras, smartphones, and other easy-to-use technologies has enabled the average user to dabble in multimedia. The phenomenal growth of Internet technologies such as social networking services (SNSs) allows users to disseminate multimedia objects. SNS and media content providers are constantly working toward providing multimedia-rich experiences to end users. Although the ability to share multimedia objects makes the Internet more attractive to consumers, clients and underlying networks are not always able to keep up with this growing demand.Users access multimedia objects not only from traditional desktops but also from mobile devices, such as smart phones and smart pads, whose resources are constrained in terms of processing, storage, and display capabilities.Multimedia processing is characterized by large amounts of data, requiring large amounts of processing, storage, and communication resources, thereby imposing a considerable burden on the computing infrastructure [1]. The traditional approach to transcoding multimedia data requires specific and expensive hardware because of the high-capacity and high-definition features of multimedia data. Therefore, general-purpose devices and methods are not cost effective, and they have limitations. Recently, transcoding based on cloud computing has been investigated in some studies [2][3].In this study, we design and implement an image-conversion module based on MapReduce and HDFS (Hadoop distributed file system) in order to address the problems mentioned above. The proposed module consists of two parts. The first part stores a large amount of image data into HDFS for distributed parallel processing. The second part processes the stored image data in HDFS using the MapReduce framework and Java Advanced Imaging (JAI) for converting image data into target formats. We use the SequenceFiles method to address the problem of processing small files in the Map function.We perform two experiments to demonstrate the proposed module’s excellence in transcoding function. In the first experiment, we compare the proposed module with a non-Hadoop-based single program running on two different machines. In addition, we conduct the performance evaluation of the proposed module according to the Java Virtual Machine (JVM) reuse option for the problem of many small files.The remainder of this paper is organized as follows. In section 2, we introduce Hadoop HDFS, MapReduce, and JAI. The module architecture and its features are proposed in section 3. In section 4, we describe the implementation of the module. The results of the evaluation are presented in section 5. Finally, section 6 concludes this paper with suggestions for future research.Related WorkA.HDFSHDFS is the primary storage system used by Hadoop applications [5]. HDFS creates multiple replicas of data blocksand distributes them on computed nodes throughout a cluster to enable reliable and extremely rapid computations. HDFS has a master-slave structure and uses the TCP/IP protocol to communicate with each node. Figure 1 shows the structure ofHDFS.Figure 1. HDFS StructureAs shown in Figure 1, NameNode manages the namespace and controls the file access by the client, and DataNode manages the storage of each node in the cluster. In addition, DataNode executes block commands issued by NameNode. B. MapReduceMapReduce is a programming model for the parallel processing of distributed large-scale data [6]. MapReduce processes an entire large-scale data set by dividing itamong multiple servers. Figure 2 shows the structure of MapReduce.Figure 2. MapReduce StructureMapReduce frameworks provide a specific programming model and a run-time system for processing and creating large amounts of datasets which is amenable to various real-world tasks [8]. MapReduce framework also handles automatic scheduling, communication, synchronization for processing huge datasets and it has the ability related with fault tolerance. MapReduce programming model is executed in two main steps, called mapping and reducing . Mapping and reducing are defined by mapper and reducer functions that are s data processing functions. Each phase has a list of key and values pairs as input and output. In the mapping , MapReduce input datasets and then feeds each data element to the mapper as aform of key and value pairs. In the reducing , all the outputs from the mapper are processed and a final result is created by reducer with merging process.C. JAIJAI is an open-source Java library used for image processing [7]. JAI supports various image formats (BMP, JPEG, PNG, PNM, TIFF) and encoder/decoder functions. In addition, most of the functions related with image conversion are provided through an API, and thus, JAI can be used as a simple framework for image processing.II.I MAGE C ONVERSION M ODULE A RCHITECTUREA. Social Media Cloud Computing Service ModelIn this study, we designed and implemented a MapReduce-based image conversion module in a cloud-computing environment to solve the problem of computing infrastructure overhead. Such overhead increases the burden on the Internet infrastructure owing to the increase in multimedia data shared through the Internet. The traditional approach of transcoding multimedia data usually involves general-purpose devices and offline-based processes. However, multimedia data processing is time consuming and requires large computing resources. To solve this problem, we designed an image conversing module that exploits the advantages of cloud computing. The proposed module can resize and convert images in a distributed and parallel manner.The proposed module use HDFS as storage for distributed parallel processing. The image data is distributed in HDFS. For distributed parallel processing, the proposed module uses the Hadoop MapReduce framework. In addition, the proposedmodule uses the JAI library in Mapper for image resizing and conversion. Figure 3 shows the proposed module architecture.Figure 3. Image Conversion Module ArchitectureAs shown in Figure 3, the proposed module stores image data into HDFS. HDFS automatically distributes the image data to each data node. The Map function processes each image data in a distributed and parallel manner.The proposed module does not have a summary or construction stage. Thus, there is no need to implement the Reduce function in the proposed module; only the Map function is implemented.III. I MPLEMENTATION OF IMAGE CONVERSION MODULE Figure 4 shows the programming elements of the image conversion module. This diagram shows the implementation ofthe proposed module. The steps for programming the processes are presented below:Figure 4. Programming Elements of Image Conversion ModuleFirst, The Conversion module reads image data from HDFS using the RecordReader method of the class InputFormat. InputFormat transforms the image data into sets of Keys (file names) and Values (bytes).Second, InputFormat passes the sets of Keys and Values to the Mapper. The Mapper processes the image data using the user defined settings and methods for image conversion via the JAI library. The conversion module converts the image data into specific formats suitable for a variety of devices such as smart phones, pads and personal computers in a fully distributed manner. The Mapper completes the the image conversion and passes the results to OutputFormat as Key (file name) and Value (byte).Finally, the Mapper passes the set of Key and Value to OutputFormat. The RecordWriter method of the OutputFormat class writes the result as a file to HDFS.In this study, the image conversion module was implemented on the basis of Hadoop. However, small chunked files bring problems for the Hadoop MapReduce process. Map tasks usually process a single block of input data at each time instant. If there are many small files, then each Map task processes only a small amount of input data, and as a result, there are many unscheduled Map tasks, each of which imposes extra bookkeeping overhead. Consider a 1-GB file, broken into 16 64-MB blocks, and approximately 10,000 100-KB files. The 10,000 files may require tens or hundreds of times more processing time than an equivalent single-input file.To alleviate the bookkeeping overhead, we exploit some inherent features of Hadoop. In particular, we run multiple Map tasks in one JVM by reusing JVM tasks, thereby avoiding some overhead associated with JVM startup.The other method considered is the SequenceFiles method in the Map function. The SequenceFiles method uses the filename as the key and the file contents as the value. This method is optimized for use with many small files.In the proposed module, we use the BytesWritable interface of Hadoop for inputting the data contents of the image. The proposed module converts the size and format of the image using the following options: maxWidth, maxHeight, Image Format.IV. E VALUATIONThe cloud server used in the experiments for evaluation is a single enterprise scale cluster that consists of 28 computational nodes. Table 1 lists the specifications of the evaluation cluster. Because the structure of the cluster is homogeneous, it provides a uniform evaluation environment.Nine data sets were used to verify the performance of the proposed module. The average size of an image files was approximately 19.8 MB. Table 2 lists the specific information about the data sets used.During the experiment, the following default options in Hadoop were used. (1) The number of block replications was set to 3, and (2) the block size was set to 64 MB.We evaluated the performance of the proposed module and optimized it. We planned and executed two experiments. In the first experiment, we compared the proposed module with a non-Hadoop-based single program on two different single machines. The specifications of the two machines are listed in Table 3.TABLE II.I MAGE DATASETSSize (GB) 124810204050100Numbers 521042084165201040211125945188Format JPG SourceFlickerTABLE I. E VALUATION C LUSTER S PECIFICATIONS (28 NODES ) CPUIntel Xeon 4 Core DP E5506 2.13GHz * 2EARAM 4GB Registered ECC DDR * 4EAHDD 1TB SATA-2 7,200 RPMOS Linux CentOS 5.5 Java Java 1.6.0_23 Hadoop Hadoop-0.20.2JAIJAI (Java Advanced Imaging) 1.1.3We measured each running time taken in our cloud server using MapReduce programming and taken in machine A and B applying only sequential programing using JAI libraries without MapReduce, respectively. Figure 5 shows the result of the first experiment.Figure 5. Elapsed Time for Proposed Module with Two Different MachinesThe elapsed times in machines A and B are less than the run time taken in less than 2 nodes in our cluster. In cases 1 and 2, the performance without Hadoop in machines A and B is better than that in our cluster because in MapReduce programming, the nodes distribute the processing, thereby causing overhead associated with the creation of map tasks, job scheduling, and transporting speed of HDFS.In the second experiment, we compare the JVM reuse option 1 with the JVM reuse option -1. Option 1 reuses JVM only one time. There is no limit on the number of times Option -1 reuses JVM. Figure 6 shows the result of the second experiment.Figure 6. Elapsed Time for JVM Reuse OptionThe elapsed time for the two options remains the same until the number of files reaches 520. However, after 1040 files, the difference between the performances of the two options grows. When the number of processing files exceeds a certain level, the task of creating Map generates JVM overhead. The option of reusing JVM is a possible solution to reduce overhead created by processing numerous small files on HDFS, as can be seen in the results presented above.V.C ONCLUSIONS & F UTURE W ORKRecently, the wide availability of inexpensive hardware, such as personal computers, digital cameras, smart phones, and other easy-to-use technologies has enabled the average user to dabble in multimedia. The phenomenal growth of Internet technologies such as SNSs allows users to disseminate multimedia objects. However, the increasing amount of multimedia data imposes a considerable burden on the internet infrastructure required to process image conversion function to provide reliable services to numerous heterogeneous devices. Thus, we designed and implemented a MapReduce-base image conversion module in a cloud computing Environment. The proposed module is based on Hadoop HDFS and the MapReduce framework for distributed parallel processing of large-scale image data. We redesigned and implemented InputFormat and OutputFormat in the MapReduce framework for image data. We used the JAI library for converting the image format and resizing the images. We exploited the advantages of cloud computing to handle multimedia data processing.We performed two experiments to evaluate the proposed module. In the first experiment, we compared the proposed module with a non-Hadoop-based single program using the JAI library. The proposed module shows better performance than the single program. In the second experiment, we changed the mapred.job.reuse.jvm.num.task option in the mapred-site.xml file, and we evaluated its performance. The results of the second experiment show that when the proposed module processes large numbers of small files, it exhibits better performance.Future research should focus not only on image data but also on video data. We plan to implement an integrated multimediaTABLE III. S PECIFICATIONS O F T HE T WO MACHINESMachine A Specifications CPU AMD Athlon II-X4 632 2.9GHzRAM 4GB DDRHDD 600GB SATA-2 7,200 RPMOS Linux CentOS 5.5 Java Java 1.6.0_23JAI JAI (Java Advanced Imaging) 1.1.3Machine B SpecificationsCPU Intel Xeon 4 Core DP E5506 2.13GHz * 2EARAM 4GB Registered ECC DDR * 4EAHDD 1TB SATA-2 7,200 RPMOS Linux CentOS 5.5 Java Java 1.6.0_23JAIJAI (Java Advanced Imaging) 1.1.3process system and a multimedia share system for SNS in a cloud-computing environment.A CKNOWLEDGMENTThis research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised bythe NIPA (National IT Industry Promotion Agency (NIPA-2011 – (C1090-1101-0008)).R EFERENCES[1]Sun-Moo Kang, Bu-Ihl Kim, Hyun-Sok Lee, Young-so Cho, Jae-SupLee, Byeong-Nam Yoon, “A study on a public multimedia seviceprovisioning architecture for enterprise networks”, Network Operationsand Management Symposium, 1998, NOMS 98., IEEE, 15-20 Feb 1998,44-48 vol.1, ISBN : 0 -7803-4351-4[2]Hari Kalva, Aleksandar Colic, Garcia, Borko Furht, “Parallelprogramming for multimedia applications”, MULTIMEDIA TOOLSAND APPLICATIOS, volume 51, number 2, 901-818, DOI:10.1007/s11042-010-o656-2[3]Gracia, A., Kalva, H., “Cloud transcoding for mobile video contentdelivery”, Consmer Electronics(ICCE), 2011 IEEE InternationalConference on, 9-12 Jan. 2011, 379-380, ISSN : 2158-3994[4]/blog/2009/02/the-small-files-problem/[5]Hadoop Distributed File System : /hdfs/[6]Jeffrey Dean, Sanjay Ghemawat, “MapReduce : Simplified DataProcessing on large Cluster”, OSDI`04 : Sixth Symposium on OperatingSystem Design and Implementation, San Francisco, CA, December,2004.[7]Java Advanced Imaging Library :/javase/technologies/desktop/media/jai/[8]Shivnath Babu, “Towards Automatic Optimization of MapReducePrograms”, The 1st ACM symposium on Cloud Computing, 2010.。

相关文档
最新文档