question Re: Rdd/DataFrame/DataSet Performance Tuning in Archives of Support Questions (Read Only)

Rdd/DataFrame/DataSet Performance Tuning

jestinm — Sat, 23 Jul 2016 22:33:22 GMT

Hello,Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2.

The first, df1, is very large (many gigabytes) compared to df2 (250 Mb).

Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me. Currently I have set the following in my spark-defaults.conf:

spark.executor.instances	24
spark.executor.memory	10g
spark.executor.cores	3
spark.driver.memory	5g
spark.sql.autoBroadcastJoinThreshold	200Mb

I have a couple of questions regarding tuning for performance as a beginner.

Right now I'm running Spark 1.6.0. Would moving to Spark 2.0 DataSet (or even DataFrames) be much better?
What if I used RDDs instead? I know that reduceByKey is better than groupByKey, and DataFrames don't have that method.
I think I can do a broadcast join and have set a threshold. Do I need to set it above my second DataFrame size? Do I need to explicitly call broadcast(df2)?
What's the point of driver memory?
Can anyone point out something wrong with my tuning numbers, or any additional parameters worth checking out?

Thank you a lot!

Sincerely, Jestin

Re: Rdd/DataFrame/DataSet Performance Tuning

mqureshi — Sun, 24 Jul 2016 01:01:13 GMT

@jestin ma

I wonder if doing a filter would help rather than a join and achive the same results. So instead of join, is it possible to do something like this?

df1.filter(df2).groupBy(key).count().

Re: Rdd/DataFrame/DataSet Performance Tuning

jwiden — Sun, 24 Jul 2016 06:05:34 GMT

Use a broadcast variable for the smaller table, to join it to the larger table. This will implement a broadcast join, the same as a mapside join and save you quite a bit of network IO and time.

Re: Rdd/DataFrame/DataSet Performance Tuning

jestinm — Tue, 26 Jul 2016 05:17:54 GMT

Unfortunately, I'm doing a full outer join, so I can't filter.

Re: Rdd/DataFrame/DataSet Performance Tuning

clukasik — Tue, 26 Jul 2016 22:21:35 GMT

Some thoughts/questions:

What does the key distribution look like? If lumpy, perhaps a repartition would help? Looking at the Spark UI might give some insight into the bottleneck.
Bumping up spark.sql.autoBroadcastJoinThreshold to 300M might help ensure that the map-side join (broadcast join) happens. Check here though because it notes "...that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run."