mapreduce数据分析-文档资料

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

wk.baidu.com
作者2 Erik Paulson， University of Wisconsin 1 MapReduce and parallel DBMSs: friends or foes? 2 A comparison of approaches to large-scale data analysis 3 Clustera: an integrated computation and data management system 和第一作者一样，主要做Hadoop(Mapreduce)和并行数据库管理系统比较，用于大规模数据集分析。
ABSTRACT：There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis. Although the basic control ﬂow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we deﬁne a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
作者3 Alexander Rasin ，Brown University 1 CORADD: correlation aware database designer for materialized views and indexes 2 MapReduce and parallel DBMSs: friends or foes? 3 HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads 4 Correlation maps: a compressed access method for exploiting soft functional dependencies 5 A comparison of approaches to large-scale data analysis 6 H-store: a high-performance, distributed main memory transaction processing system 作者在本文的基础上，设计了HadoopDB系统，一个Mapreduce和并行数据库管理系统结合的系统。
摘要
目前有相当大的兴趣在基于MapReduce（MR）模式的大规模数据分析。虽然这个框架的基本控制流已经存在于并行SQL数据库管理系统超过20年，也有人称MR为最新的计算模型。在本文中，我们描述和比较这两个模式。此外，我们评估两个系统的性能和开发复杂度。最后，我们定义一个包含任务集的基准运行于MR开源平台和两个并行数据库管理系统上。对于每个任务，我们在100台机子的集群上衡量每个系统的各个方面的并行性能。我们的研究结果揭示了一些有趣的取舍。虽然加载数据和调整并行数据库管理系统执行的过程比MR花费更多的时间，但是观察到的这些数据库管理系统性能显著地改善。我们推测巨大的性能差异的原因，并考虑将来的系统应该从这两种架构中吸取优势。
大规模数据分析方法对比 A Comparison of Approaches to Large-Scale Data Analysis
作者简介
作者1：Andrew Pavlo ，Brown University 1 MapReduce and parallel DBMSs: friends or foes? 朋友还是冤家 2 A comparison of approaches to large-scale data analysis 3 H-store: a high-performance, distributed main memory transaction processing system 4 The NMI build & test laboratory: continuous integration framework for distributed computing software 5 Smoother transitions between breadth-first-spanning-tree-based drawings 主要做Hadoop(Mapreduce)和并行数据库管理系统比较，用于大规模数据集分析。