spark入门教程及经验总结

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

spark入门教程及经验总结

问题导读1.cluster mode 模式运行包含哪些流程？

2.yarn mode 运行模式有什么特点？

3..在关闭http file server进程时，遇到什么错误？一、环境准备测试环境使用的cdh提供的quickstart vmhadoop版本：2.5.0-cdh5.2.0spark版本：1.1.0

二、Hello Spark将

/usr/lib/spark/examples/lib/spark-examples-1.1.0-cdh5.2.0-hado op2.5.0-cdh5.2.0.jar 移动到

/usr/lib/spark/lib/spark-examples-1.1.0-cdh5.2.0-hadoop2.5.0-cd h5.2.0.jar

执行程序./bin/run-example SparkPi 10

日志分析：程序检查ip,host,SecurityManager

启动sparkDriver。通过akka工具启动一个tcp监听[akka.tcp://sparkDriver@192.168.128.131:42960]

启动MapOutputTracker，BlockManagerMaster

启动一个block manager，也就是

ConnectionManagerId(192.168.128.131,41898)，其中包含一个MemoryStore

通过netty启动一个HTTP file server：

SocketConnector@0.0.0.0:55161

启动一个sparkUI：http://192.168.128.131:4040通过http上传本地程序运行Jar包

连接HeartbeatReceiver:

akka.tcp://sparkDriver@192.168.128.131:42960/user/Heartbeat Receiver

Starting job: reduce分析中job，有stage 0 (MappedRDD[1])

添加并启动运行task Submitting 10 missing tasks from Stage 0通过http协议获取程序jar包，并添加到classloader完成task 后，将结果发送到driverscheduler.DAGScheduler完成Stage

的所有task在localhost的scheduler.TaskSetManager收集完成的task

job finishedStop Spark Web UI

Stop DAGScheduler

MapOutputTrackerActor stopped

stop ConnectionManager

MemoryStore cleared

BlockManager stopped

Shutting down remote daemon.

Successfully stopped SparkContext

三、cluster mode 运行模式运行流程：：SparkContext 连接cluster Manager (either Spark’s own standalone cluster manager or Mesos/YARN),

spark Application向Cluster Manager请求资源executors (运行计算和存储数据的线程)

将程序Jar包或者python程序分发到executors

SparkContext发送tasks到executors上运行Cluster Manager 类型：Standalone Spark 内置的cluster manager，可以快速启动一个集群Apache Mesos 一个通用的Cluster manger，可以运行hadoop的Mapreduce和其他Service applicationsHadoop YARN Hadoop 2中的Clustger Manager主要概念TermMeaningApplicationUser program built on Spark. Consists of a driver program and executors on the cluster.Application jarA jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime.Driver programThe process running the main() function of the application and creating the SparkContextCluster managerAn external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)Deploy modeDistinguishes where the driver process runs. In "cluster"

mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.Worker nodeAny node that can run application code in the clusterExecutorA process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.TaskA unit of work that will be sent to one executorJobA parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.StageEach job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.