本文将针对Google公司发表的三篇云计算论文(论文名称分别为《MapReduce:Simplified Data Processing on Large Clusters》、《The Google File System》、《Bigtable: A Distributed Storage System for Structured Data》),进行分类讲解,以帮助读者更好地了解云计算领域的相关技术。

一、MapReduce:Simplified Data Processing on Large ClustersMapReduce论文是Google公司云计算领域中的重要代表作之一,它的作者是Jeffrey Dean和Sanjay Ghemawat。

MAPREDUCE是一种大规模数据处理技术,其主要目的是在一个大型集群中分Distribute and Parallel Execution(分布式和并行执行)处理任务。

MapReduce将计算逻辑分解成两个部分- Map阶段和Reduce阶段。




二、The Google File SystemGFS是由Sanjay Ghemawat、Howard Gobioff和Shun-TakLeung共同编写的一篇论文。




Implementation of MapReduce-based Image Conversion Module in Cloud ComputingEnvironmentHyeokju LeeDivision of Internet & Multimedia EngineeringKonkuk UniversitySeoul Myoungjin Kim, Joon Her, and Hanku Lee* Division of Internet & Multimedia EngineeringKonkuk UniversitySeoul Korea{tough105, herj00n, hlee}—In recent years, the rapid advancement of the Internet and the growing number of people using social networking services (SNSs) have facilitated the sharing of multimedia data. However, multimedia data processing techniques such as transcoding and transmoding impose a considerable burden on the computing infrastructure as the amount of data increases. Therefore, we propose a MapReduce-based image-conversion module in cloud computing environment in order to reduce the burden of computing power. The proposed module consists of two parts: a storage system, i.e., Hadoop distributed file system (HDFS) for image data and a MapReduce program with a Java Advanced Imaging (JAI) library for image transcoding. It can process image data in distributed and parallel cloud computing environments, thereby minimizing the computing infrastructure overhead. In this paper, we describe the implementation of the proposed module using Hadoop and JAI. In addition, we evaluate the proposed module in terms of processing time under varying experimental conditions.Keywords-Cloud Computing; Hadoop; MapReduce; HDFS; Image ConversionI.I NTRODUCTIONThe wide availability of inexpensive hardware such as personal computers, digital cameras, smartphones, and other easy-to-use technologies has enabled the average user to dabble in multimedia. The phenomenal growth of Internet technologies such as social networking services (SNSs) allows users to disseminate multimedia objects. SNS and media content providers are constantly working toward providing multimedia-rich experiences to end users. Although the ability to share multimedia objects makes the Internet more attractive to consumers, clients and underlying networks are not always able to keep up with this growing demand.Users access multimedia objects not only from traditional desktops but also from mobile devices, such as smart phones and smart pads, whose resources are constrained in terms of processing, storage, and display capabilities.Multimedia processing is characterized by large amounts of data, requiring large amounts of processing, storage, and communication resources, thereby imposing a considerable burden on the computing infrastructure [1]. The traditional approach to transcoding multimedia data requires specific and expensive hardware because of the high-capacity and high-definition features of multimedia data. Therefore, general-purpose devices and methods are not cost effective, and they have limitations. Recently, transcoding based on cloud computing has been investigated in some studies [2][3].In this study, we design and implement an image-conversion module based on MapReduce and HDFS (Hadoop distributed file system) in order to address the problems mentioned above. The proposed module consists of two parts. The first part stores a large amount of image data into HDFS for distributed parallel processing. The second part processes the stored image data in HDFS using the MapReduce framework and Java Advanced Imaging (JAI) for converting image data into target formats. We use the SequenceFiles method to address the problem of processing small files in the Map function.We perform two experiments to demonstrate the proposed module’s excellence in transcoding function. In the first experiment, we compare the proposed module with a non-Hadoop-based single program running on two different machines. In addition, we conduct the performance evaluation of the proposed module according to the Java Virtual Machine (JVM) reuse option for the problem of many small files.The remainder of this paper is organized as follows. In section 2, we introduce Hadoop HDFS, MapReduce, and JAI. The module architecture and its features are proposed in section 3. In section 4, we describe the implementation of the module. The results of the evaluation are presented in section 5. Finally, section 6 concludes this paper with suggestions for future research.Related WorkA.HDFSHDFS is the primary storage system used by Hadoop applications [5]. HDFS creates multiple replicas of data blocksand distributes them on computed nodes throughout a cluster to enable reliable and extremely rapid computations. HDFS has a master-slave structure and uses the TCP/IP protocol to communicate with each node. Figure 1 shows the structure ofHDFS.Figure 1. HDFS StructureAs shown in Figure 1, NameNode manages the namespace and controls the file access by the client, and DataNode manages the storage of each node in the cluster. In addition, DataNode executes block commands issued by NameNode. B. MapReduceMapReduce is a programming model for the parallel processing of distributed large-scale data [6]. MapReduce processes an entire large-scale data set by dividing itamong multiple servers. Figure 2 shows the structure of MapReduce.Figure 2. MapReduce StructureMapReduce frameworks provide a specific programming model and a run-time system for processing and creating large amounts of datasets which is amenable to various real-world tasks [8]. MapReduce framework also handles automatic scheduling, communication, synchronization for processing huge datasets and it has the ability related with fault tolerance. MapReduce programming model is executed in two main steps, called mapping and reducing . Mapping and reducing are defined by mapper and reducer functions that are s data processing functions. Each phase has a list of key and values pairs as input and output. In the mapping , MapReduce input datasets and then feeds each data element to the mapper as aform of key and value pairs. In the reducing , all the outputs from the mapper are processed and a final result is created by reducer with merging process.C. JAIJAI is an open-source Java library used for image processing [7]. JAI supports various image formats (BMP, JPEG, PNG, PNM, TIFF) and encoder/decoder functions. In addition, most of the functions related with image conversion are provided through an API, and thus, JAI can be used as a simple framework for image processing.II.I MAGE C ONVERSION M ODULE A RCHITECTUREA. Social Media Cloud Computing Service ModelIn this study, we designed and implemented a MapReduce-based image conversion module in a cloud-computing environment to solve the problem of computing infrastructure overhead. Such overhead increases the burden on the Internet infrastructure owing to the increase in multimedia data shared through the Internet. The traditional approach of transcoding multimedia data usually involves general-purpose devices and offline-based processes. However, multimedia data processing is time consuming and requires large computing resources. To solve this problem, we designed an image conversing module that exploits the advantages of cloud computing. The proposed module can resize and convert images in a distributed and parallel manner.The proposed module use HDFS as storage for distributed parallel processing. The image data is distributed in HDFS. For distributed parallel processing, the proposed module uses the Hadoop MapReduce framework. In addition, the proposedmodule uses the JAI library in Mapper for image resizing and conversion. Figure 3 shows the proposed module architecture.Figure 3. Image Conversion Module ArchitectureAs shown in Figure 3, the proposed module stores image data into HDFS. HDFS automatically distributes the image data to each data node. The Map function processes each image data in a distributed and parallel manner.The proposed module does not have a summary or construction stage. Thus, there is no need to implement the Reduce function in the proposed module; only the Map function is implemented.III. I MPLEMENTATION OF IMAGE CONVERSION MODULE Figure 4 shows the programming elements of the image conversion module. This diagram shows the implementation ofthe proposed module. The steps for programming the processes are presented below:Figure 4. Programming Elements of Image Conversion ModuleFirst, The Conversion module reads image data from HDFS using the RecordReader method of the class InputFormat. InputFormat transforms the image data into sets of Keys (file names) and Values (bytes).Second, InputFormat passes the sets of Keys and Values to the Mapper. The Mapper processes the image data using the user defined settings and methods for image conversion via the JAI library. The conversion module converts the image data into specific formats suitable for a variety of devices such as smart phones, pads and personal computers in a fully distributed manner. The Mapper completes the the image conversion and passes the results to OutputFormat as Key (file name) and Value (byte).Finally, the Mapper passes the set of Key and Value to OutputFormat. The RecordWriter method of the OutputFormat class writes the result as a file to HDFS.In this study, the image conversion module was implemented on the basis of Hadoop. However, small chunked files bring problems for the Hadoop MapReduce process. Map tasks usually process a single block of input data at each time instant. If there are many small files, then each Map task processes only a small amount of input data, and as a result, there are many unscheduled Map tasks, each of which imposes extra bookkeeping overhead. Consider a 1-GB file, broken into 16 64-MB blocks, and approximately 10,000 100-KB files. The 10,000 files may require tens or hundreds of times more processing time than an equivalent single-input file.To alleviate the bookkeeping overhead, we exploit some inherent features of Hadoop. In particular, we run multiple Map tasks in one JVM by reusing JVM tasks, thereby avoiding some overhead associated with JVM startup.The other method considered is the SequenceFiles method in the Map function. The SequenceFiles method uses the filename as the key and the file contents as the value. This method is optimized for use with many small files.In the proposed module, we use the BytesWritable interface of Hadoop for inputting the data contents of the image. The proposed module converts the size and format of the image using the following options: maxWidth, maxHeight, Image Format.IV. E VALUATIONThe cloud server used in the experiments for evaluation is a single enterprise scale cluster that consists of 28 computational nodes. Table 1 lists the specifications of the evaluation cluster. Because the structure of the cluster is homogeneous, it provides a uniform evaluation environment.Nine data sets were used to verify the performance of the proposed module. The average size of an image files was approximately 19.8 MB. Table 2 lists the specific information about the data sets used.During the experiment, the following default options in Hadoop were used. (1) The number of block replications was set to 3, and (2) the block size was set to 64 MB.We evaluated the performance of the proposed module and optimized it. We planned and executed two experiments. In the first experiment, we compared the proposed module with a non-Hadoop-based single program on two different single machines. The specifications of the two machines are listed in Table 3.TABLE II.I MAGE DATASETSSize (GB) 124810204050100Numbers 521042084165201040211125945188Format JPG SourceFlickerTABLE I. E VALUATION C LUSTER S PECIFICATIONS (28 NODES ) CPUIntel Xeon 4 Core DP E5506 2.13GHz * 2EARAM 4GB Registered ECC DDR * 4EAHDD 1TB SATA-2 7,200 RPMOS Linux CentOS 5.5 Java Java 1.6.0_23 Hadoop Hadoop-0.20.2JAIJAI (Java Advanced Imaging) 1.1.3We measured each running time taken in our cloud server using MapReduce programming and taken in machine A and B applying only sequential programing using JAI libraries without MapReduce, respectively. Figure 5 shows the result of the first experiment.Figure 5. Elapsed Time for Proposed Module with Two Different MachinesThe elapsed times in machines A and B are less than the run time taken in less than 2 nodes in our cluster. In cases 1 and 2, the performance without Hadoop in machines A and B is better than that in our cluster because in MapReduce programming, the nodes distribute the processing, thereby causing overhead associated with the creation of map tasks, job scheduling, and transporting speed of HDFS.In the second experiment, we compare the JVM reuse option 1 with the JVM reuse option -1. Option 1 reuses JVM only one time. There is no limit on the number of times Option -1 reuses JVM. Figure 6 shows the result of the second experiment.Figure 6. Elapsed Time for JVM Reuse OptionThe elapsed time for the two options remains the same until the number of files reaches 520. However, after 1040 files, the difference between the performances of the two options grows. When the number of processing files exceeds a certain level, the task of creating Map generates JVM overhead. The option of reusing JVM is a possible solution to reduce overhead created by processing numerous small files on HDFS, as can be seen in the results presented above.V.C ONCLUSIONS & F UTURE W ORKRecently, the wide availability of inexpensive hardware, such as personal computers, digital cameras, smart phones, and other easy-to-use technologies has enabled the average user to dabble in multimedia. The phenomenal growth of Internet technologies such as SNSs allows users to disseminate multimedia objects. However, the increasing amount of multimedia data imposes a considerable burden on the internet infrastructure required to process image conversion function to provide reliable services to numerous heterogeneous devices. Thus, we designed and implemented a MapReduce-base image conversion module in a cloud computing Environment. The proposed module is based on Hadoop HDFS and the MapReduce framework for distributed parallel processing of large-scale image data. We redesigned and implemented InputFormat and OutputFormat in the MapReduce framework for image data. We used the JAI library for converting the image format and resizing the images. We exploited the advantages of cloud computing to handle multimedia data processing.We performed two experiments to evaluate the proposed module. In the first experiment, we compared the proposed module with a non-Hadoop-based single program using the JAI library. The proposed module shows better performance than the single program. In the second experiment, we changed the mapred.job.reuse.jvm.num.task option in the mapred-site.xml file, and we evaluated its performance. The results of the second experiment show that when the proposed module processes large numbers of small files, it exhibits better performance.Future research should focus not only on image data but also on video data. We plan to implement an integrated multimediaTABLE III. S PECIFICATIONS O F T HE T WO MACHINESMachine A Specifications CPU AMD Athlon II-X4 632 2.9GHzRAM 4GB DDRHDD 600GB SATA-2 7,200 RPMOS Linux CentOS 5.5 Java Java 1.6.0_23JAI JAI (Java Advanced Imaging) 1.1.3Machine B SpecificationsCPU Intel Xeon 4 Core DP E5506 2.13GHz * 2EARAM 4GB Registered ECC DDR * 4EAHDD 1TB SATA-2 7,200 RPMOS Linux CentOS 5.5 Java Java 1.6.0_23JAIJAI (Java Advanced Imaging) 1.1.3process system and a multimedia share system for SNS in a cloud-computing environment.A CKNOWLEDGMENTThis research was supported by the MKE (The Ministry of Knowledge Economy), Korea, under the ITRC (Information Technology Research Center) support program supervised bythe NIPA (National IT Industry Promotion Agency (NIPA-2011 – (C1090-1101-0008)).R EFERENCES[1]Sun-Moo Kang, Bu-Ihl Kim, Hyun-Sok Lee, Young-so Cho, Jae-SupLee, Byeong-Nam Yoon, “A study on a public multimedia seviceprovisioning architecture for enterprise networks”, Network Operationsand Management Symposium, 1998, NOMS 98., IEEE, 15-20 Feb 1998,44-48 vol.1, ISBN : 0 -7803-4351-4[2]Hari Kalva, Aleksandar Colic, Garcia, Borko Furht, “Parallelprogramming for multimedia applications”, MULTIMEDIA TOOLSAND APPLICATIOS, volume 51, number 2, 901-818, DOI:10.1007/s11042-010-o656-2[3]Gracia, A., Kalva, H., “Cloud transcoding for mobile video contentdelivery”, Consmer Electronics(ICCE), 2011 IEEE InternationalConference on, 9-12 Jan. 2011, 379-380, ISSN : 2158-3994[4]/blog/2009/02/the-small-files-problem/[5]Hadoop Distributed File System : /hdfs/[6]Jeffrey Dean, Sanjay Ghemawat, “MapReduce : Simplified DataProcessing on large Cluster”, OSDI`04 : Sixth Symposium on OperatingSystem Design and Implementation, San Francisco, CA, December,2004.[7]Java Advanced Imaging Library :/javase/technologies/desktop/media/jai/[8]Shivnath Babu, “Towards Automatic Optimization of MapReducePrograms”, The 1st ACM symposium on Cloud Computing, 2010.。
