I have two dataframes. One with 131 rows and another with 54 million rows. I need join the first one with the second and thereby generating 6 billion rows.
Its taking forever even doing a broadcast hash join along with executor/memory tuning trials.
I need help with the syntax.
//set the broadcast threshold value should perform the broadcast join automatically conf.set("spark.sql.autoBroadcastJoinThreshold", 1024*1024*500) //other alternative is to explicitly specify the data frame to be broadcasted in join(can be used with conjenction with first option). val resDF = bigDF.join(broadcast(smallDF),bigDF.k1==smallDF.k1)