Created 07-10-2017 05:43 AM
I have two dataframes. One with 131 rows and another with 54 million rows. I need join the first one with the second and thereby generating 6 billion rows.
Its taking forever even doing a broadcast hash join along with executor/memory tuning trials.
I need help with the syntax.
Created 07-11-2017 01:43 AM
//set the broadcast threshold value should perform the broadcast join automatically conf.set("spark.sql.autoBroadcastJoinThreshold", 1024*1024*500) //other alternative is to explicitly specify the data frame to be broadcasted in join(can be used with conjenction with first option). val resDF = bigDF.join(broadcast(smallDF),bigDF.k1==smallDF.k1)
Created 07-13-2017 10:51 PM
I was planning to avoid broadcast thats why I asked it. Thanks!