Support Questions

Find answers, ask questions, and share your expertise

How to perform a Cartesian product join on Spark 1.6.1

Expert Contributor

I have two dataframes. One with 131 rows and another with 54 million rows. I need join the first one with the second and thereby generating 6 billion rows.

Its taking forever even doing a broadcast hash join along with executor/memory tuning trials.

I need help with the syntax.


Super Collaborator
//set the broadcast threshold value should perform the broadcast join automatically
conf.set("spark.sql.autoBroadcastJoinThreshold", 1024*1024*500)

//other alternative is to explicitly specify the data frame to be broadcasted in join(can be used with conjenction with first option).
val resDF = bigDF.join(broadcast(smallDF),bigDF.k1==smallDF.k1)

Expert Contributor

I was planning to avoid broadcast thats why I asked it. Thanks!