火龙果软件-MapReduce课件
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
reduce (out_key, intermediate_value list) -> out_value list
6
精选课件
map
火龙果整理 uml.org.cn
Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).
hundreds/thousands of CPUs … Want to make this easy
"Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data."
From: http://googlesystem.blogspot.com/2006/09/howmuch-data-does-google-store.html
map() produces one or more intermediate values along with an output key from the input.
7
精选课件
reduce
火龙果整理 uml.org.cn
After the map phase is over, all the intermediate values for a given output key are combined together into a list
火龙果整理 uml.org.cn
reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v);
Emit(AsString(result));
10
精选课件
Example
Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
火龙果整Байду номын сангаас uml.org.cn
11
精选课件
Map output
8
精选课件
Parallelism
火龙果整理 uml.org.cn
map() functions run in parallel, creating different intermediate values from different input data sets
reduce() functions also run in parallel, each working on a different output key
4
精选课件
MapReduce
火龙果整理 uml.org.cn
Automatic parallelization & distribution
Fault-tolerant Provides status and monitoring tools
Clean abstraction for programmers
3
精选课件
From: www.cs.cmu.edu/~bryant/presentations/DISC-FCRC07.ppt
火龙果整理 uml.org.cn
Motivation: Large Scale Data Processing
Want to process lots of data ( > 1 TB) Want to parallelize across
reduce() combines those intermediate values into one or more final values for that same output key
(in practice, usually only one final value per key)
火龙果整理 uml.org.cn
MapReduce课件
1
精选课件
Outline
MapReduce overview Discussion Questions MapReduce
火龙果整理 uml.org.cn
2
精选课件
Motivation
火龙果整理 uml.org.cn
200+ processors 200+ terabyte database 1010 total clock cycles 0.1 second response time 5¢average advertising revenue
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
5
精选课件
Programming Model
Borrows from functional programming Users implement interface of two functions:
火龙果整理 uml.org.cn
map (in_key, in_value) -> (out_key, intermediate_value) list
All values are processed independently
Bottleneck: reduce phase can’t start until map phase is completely finished.
9
精选课件
Example: Count word occurrences