Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

​Is there a way to do broadcast join in Spark 2.1 in java

avatar
Contributor

I noticed that we can do:

join(broadcast(right),...) in Spark 1.6 in Java, but it looks like the broadcast function is not available in Spark 2.1.0

1 ACCEPTED SOLUTION

avatar
Contributor

I am not talking about the broadcast variables, I am talking about the broadcast hint in join:

join(broadcast(right),...)

the 'broadcast' here is a function defined specifically for dataframe:

public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }

It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:

sc.broadcast (...)

View solution in original post

6 REPLIES 6

avatar
Expert Contributor

avatar

@X Long

I do not believe it was removed in Spark 2.1.0. Here's the documentation for Broadcast Variables (for scala, java, and python): http://spark.apache.org/docs/2.1.0/programming-guide.html#broadcast-variables

You may also need to get the spark.sql.autoBroadcastJoinThreshold parameter, if you are running into errors. This parameter sets the max size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.

If you are running in to an error, can you please post that as well. Thanks!

avatar
Contributor

I am not talking about the broadcast variables, I am talking about the broadcast hint in join:

join(broadcast(right),...)

the 'broadcast' here is a function defined specifically for dataframe:

public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }

It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:

sc.broadcast (...)

avatar
Expert Contributor

what is the error you are getting while trying to use it then?

This is what I used in Spark 1.6.1

import org.apache.spark.sql.functions.broadcast

val joined_df = df1.join(broadcast(df2), "key")

avatar
Contributor

It works now, it is my IDE problem,

Thanks

avatar
Master Guru

Hi @X Long, how about up-voting some answers, the guys tried to help you but could not have imagined that the issue was something trivial with your IDE. Give and take. Tnx!