sparksql优化

合集下载

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

sparksql优化1、内存优化
1.1、RDD
RDD默认cache仅使⽤内存
可以看到使⽤默认cache时，四个分区只在内存中缓存了3个分区，4.4G的数据
使⽤kryo序列化+MEMORY_ONLY_SER
可以看到缓存了四个分区的全部数据，且只缓存了1445.8M
所以这两种缓存⽅式如何选择，官⽹建议
也就是说集群资源⾜够使⽤默认cache，资源紧张使⽤kryo序列化+MEMORY_ONLY_SER 1.2、DataFrame与DataSet
DataSet不使⽤Java和Kryo序列化，它使⽤特殊的编码器序列化
使⽤默认cache，保存在内存和磁盘
同样多的数据也是全部缓存，只使⽤了内存612.3M
使⽤序列化缓存时⽐使⽤默认缓存还多缓存了30M，共646.2M
df和ds直接使⽤默认cache即可
2、⼩⽂件过多问题
2.1、RDD中并⾏度设置
spark.default.parallelism
For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
2.2、spark sql
Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable("tableName") to remove the table from memory.
建议如果下⾯⽆其它任务，缓存可以不释放，有其它任务要释放
算⼦⽅式
result.unpersist
spark.sql.shuffle.partitions Configures the number of partitions to use when shuffling data for joins or aggregations.
默认200
如不减少分区，join后hadoop上会有200个⼩⽂件
前三个stage 为读⽂件（控制不了），后两个stage 为join 并⾏度，为200
1、使⽤coalesce 算⼦缩⼩分区，不能⼤于原有分区数值
2、如果数值⼩于vcore ，有些vcore 就不会⼯作，速度会慢
如压缩成1，并⾏度就是1只有⼀个vcore 在⼯作，不会shuffle ，如果数据量很⼤且参数很⼩，可能会产⽣oom
可以看到将分区减少到20，hadoop上只有20个⽂件3、合理利⽤cpu资源
未优化时任务可以看到最后200个任务没有平均分到每台机器上，压⼒全在hadoop103上，如果数据量很少，hadoop103上有些vcore可能没数据在空转，没有合理利⽤cpu资源将·spark.sql.shuffle.partitions设置为总vcore的2到3倍可以达到最优效果
不添加缩⼩分区coalesce可以看到有36个任务
任务分配也很平均，达到优化效果，避免空转情况，合理利⽤cpu资源，任务时间缩短到2.5分钟
4、⼴播join
将⼩表聚合到driver端，分发到每个executor，规避shuffle，避免此stage
只适合⼩表join⼤表
正常⼤表join⼤表⾛SortMergeJoin
⼩于等于10M，⾃动进⾏⼴播join
spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB)
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run.
4.1 API
禁⽤掉⼴播join ，设置参数为-1
可以看到只剩下⼀个36task的join stage，多出来⼀步broadcast exchange，
变成BroadcastHashJoin 耗时变成2分钟
4.2 参数
单位不能是M，10485760
默认10M,实际⽣产可调⼤参数，如改成20MB，可以避免⼩表join⼤表时数据倾斜set("spark.sql.autoBroadcastJoinThreshold","20971520")
5、数据倾斜
并不只是500万，因为⼀个分区⾥不只⼀个key，包含多个key，如分区1 101 102 分区2 103 104本质将相同key的数据聚集到⼀个task
5.1 解决数据倾斜错误⽅法
5.2 解决数据倾斜
1、⼴播join
2、打散⼤表，扩容⼩表能解决，但可能更加耗时，因为⼩表数据量增加了
除⾮情况⾮常严重，结果出不来
拿打散后的courseId进⾏join
可以看到dataframe.map后变成dataset
循环⾥⾯为i+"_"+courseid,写错了
可以看到已经得到优化
但是时间由变成50秒
和3中只缩⼩分区34秒时间增加，虽然3中有数据倾斜
排序时间优化，数据量⼤的时候效果很明显
spark中⽀持分桶必须⽤saveastable，insertinto不⽀持分桶
先拿两张分桶表做join
分桶后task数就和分桶数⼀样
7、使⽤堆外内存
3.0之前
3.0之后
修改内存测试
max(2G*0.1 384）
2G+2G+384>4G
实际申请4个G，申请会⼤于4个G
堆外内存使⽤
堆内堆外会互相借⽤
什么情况下使⽤堆外内存。