分布式流式数据处理框架:功能对比以及性能评估
合集下载
相关主题
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Functional Comparison and Performance Evaluation
Overview
➢ Streaming Core ➢ MISC ➢ Performance Benchmark
Choose your weapon!
2
Execution Model + Fault Tolerance Mechanism
Aapchwk.baidu.com Flink*
At least once
• Ackers know about if a record is processed successfully or not. If it failed, replay it.
• There is no state consistency guarantee.
• Trident: ✓ persistentAggregate ✓ State
8
*
API
Compositional
• Highly customizable operator based on basic building blocks • Manual topology definition and optimization
Apache Gearpump*
Apache Spark Streaming*
Apache Storm Trident*
Exactly once
• State is persisted in durable storage
• Checkpoint is linked with state storage per Batch
• Flink ScalaAPI: ✓ mapWithState
• Gearpump ✓ persistState
Apache Spark Streaming*
Apache Storm Trident*
Yes
• Spark 1.5: ✓ updateStateByKey
• Spark 1.6: ✓ mapWithState
Acker
offset
state str ack
JobManager/ HDFS
offset state str job status
Driver
HDFS
critical This is the
part, as it affects many features
5
Continuous Streaming
Continuous Streaming
Micro-Batch
Apache Storm*
Twitter Heron*
Aapche Flink*
Apache Gearpump*
Apache Spark Streaming*
Apache Storm Trident*
Source Operator Sink
7
*
Native State Operator
Apache Storm*
Twitter Heron*
Yes*
• Storm: ✓ KeyValueState
• Heron: X User Maintain
Aapche Flink*
Apache Gearpump*
Yes
• Flink JavaAPI: ✓ ValueState ✓ ListState ✓ ReduceState
Apache Storm Trident*
Low Latency High Overhead Low Throughput
High Latency Low Overhead High Throughput
6
*
Delivery Guarantee
Apache Storm*
Twitter Heron*
Storage
Micro-Batch
Checkpoint per Batch
Apache Spark Streaming*
Apache Storm Trident*
Storage
Source Operator
Sink Source
Operator
Sink Source Operator
Sink
id
id
Ack per Record
Apache Storm*
Twitter Heron*
Continuous Streaming
Checkpoint “per Batch”
Aapche Flink*
Apache Gearpump*
Micro-Batch
Checkpoint per Batch
Apache Spark Streaming*
Apache Spark Streaming*
Apache Storm Trident* Aapche Flink* Apache
Gearpump*
DataStream<String> text = env.readTextFile(params.get("input")); DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1);
Source Operator
Sink
4
Continuous Streaming
Ack per Record
Apache Storm*
Twitter Heron*
Storage
Continuous Streaming
Checkpoint “per Batch”
Aapche Flink*
Apache Gearpump*
Apache Storm*
Twitter Heron* Apache Gearpump*
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“input", new RandomSentenceSpout(), 1); builder.setBolt("split", new SplitSentence(), 3).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 2).fieldsGrouping("split", new Fields("word"));
“foo, foo, bar”
“foo”, “foo”, “bar”
{“foo”:2, “bar”: 1}
Spout
Bolt
Bolt
10
*
Declarative
• Higher order function as operators (filter, mapWithState…) • Logical plan optimization
Overview
➢ Streaming Core ➢ MISC ➢ Performance Benchmark
Choose your weapon!
2
Execution Model + Fault Tolerance Mechanism
Aapchwk.baidu.com Flink*
At least once
• Ackers know about if a record is processed successfully or not. If it failed, replay it.
• There is no state consistency guarantee.
• Trident: ✓ persistentAggregate ✓ State
8
*
API
Compositional
• Highly customizable operator based on basic building blocks • Manual topology definition and optimization
Apache Gearpump*
Apache Spark Streaming*
Apache Storm Trident*
Exactly once
• State is persisted in durable storage
• Checkpoint is linked with state storage per Batch
• Flink ScalaAPI: ✓ mapWithState
• Gearpump ✓ persistState
Apache Spark Streaming*
Apache Storm Trident*
Yes
• Spark 1.5: ✓ updateStateByKey
• Spark 1.6: ✓ mapWithState
Acker
offset
state str ack
JobManager/ HDFS
offset state str job status
Driver
HDFS
critical This is the
part, as it affects many features
5
Continuous Streaming
Continuous Streaming
Micro-Batch
Apache Storm*
Twitter Heron*
Aapche Flink*
Apache Gearpump*
Apache Spark Streaming*
Apache Storm Trident*
Source Operator Sink
7
*
Native State Operator
Apache Storm*
Twitter Heron*
Yes*
• Storm: ✓ KeyValueState
• Heron: X User Maintain
Aapche Flink*
Apache Gearpump*
Yes
• Flink JavaAPI: ✓ ValueState ✓ ListState ✓ ReduceState
Apache Storm Trident*
Low Latency High Overhead Low Throughput
High Latency Low Overhead High Throughput
6
*
Delivery Guarantee
Apache Storm*
Twitter Heron*
Storage
Micro-Batch
Checkpoint per Batch
Apache Spark Streaming*
Apache Storm Trident*
Storage
Source Operator
Sink Source
Operator
Sink Source Operator
Sink
id
id
Ack per Record
Apache Storm*
Twitter Heron*
Continuous Streaming
Checkpoint “per Batch”
Aapche Flink*
Apache Gearpump*
Micro-Batch
Checkpoint per Batch
Apache Spark Streaming*
Apache Spark Streaming*
Apache Storm Trident* Aapche Flink* Apache
Gearpump*
DataStream<String> text = env.readTextFile(params.get("input")); DataStream<Tuple2<String, Integer>> counts = text.flatMap(new Tokenizer()).keyBy(0).sum(1);
Source Operator
Sink
4
Continuous Streaming
Ack per Record
Apache Storm*
Twitter Heron*
Storage
Continuous Streaming
Checkpoint “per Batch”
Aapche Flink*
Apache Gearpump*
Apache Storm*
Twitter Heron* Apache Gearpump*
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(“input", new RandomSentenceSpout(), 1); builder.setBolt("split", new SplitSentence(), 3).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 2).fieldsGrouping("split", new Fields("word"));
“foo, foo, bar”
“foo”, “foo”, “bar”
{“foo”:2, “bar”: 1}
Spout
Bolt
Bolt
10
*
Declarative
• Higher order function as operators (filter, mapWithState…) • Logical plan optimization