Created 07-23-2016 03:33 PM
The first, df1, is very large (many gigabytes) compared to df2 (250 Mb).
Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me about 1 hour and 40 minutes to perform the groupBy, count, and join, which seems very slow to me. Currently I have set the following in my spark-defaults.conf:spark.executor.instances | 24 |
spark.executor.memory | 10g |
spark.executor.cores | 3 |
spark.driver.memory | 5g |
spark.sql.autoBroadcastJoinThreshold | 200Mb |
I have a couple of questions regarding tuning for performance as a beginner.
Thank you a lot!
Sincerely, Jestin
Created 07-26-2016 03:21 PM
Some thoughts/questions:
ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan
has been run."Other answers:
Created 07-23-2016 06:01 PM
I wonder if doing a filter would help rather than a join and achive the same results. So instead of join, is it possible to do something like this?
df1.filter(df2).groupBy(key).count().
Created 07-25-2016 10:17 PM
Unfortunately, I'm doing a full outer join, so I can't filter.
Created 07-23-2016 11:05 PM
Use a broadcast variable for the smaller table, to join it to the larger table. This will implement a broadcast join, the same as a mapside join and save you quite a bit of network IO and time.
Created 07-26-2016 03:21 PM
Some thoughts/questions:
ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan
has been run."Other answers: