Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

​Is there a way to do broadcast join in Spark 2.1 in java

avatar
New Member

I noticed that we can do:

join(broadcast(right),...) in Spark 1.6 in Java, but it looks like the broadcast function is not available in Spark 2.1.0

1 ACCEPTED SOLUTION

avatar
New Member

I am not talking about the broadcast variables, I am talking about the broadcast hint in join:

join(broadcast(right),...)

the 'broadcast' here is a function defined specifically for dataframe:

public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }

It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:

sc.broadcast (...)

View solution in original post

6 REPLIES 6

avatar
Expert Contributor

avatar

@X Long

I do not believe it was removed in Spark 2.1.0. Here's the documentation for Broadcast Variables (for scala, java, and python): http://spark.apache.org/docs/2.1.0/programming-guide.html#broadcast-variables

You may also need to get the spark.sql.autoBroadcastJoinThreshold parameter, if you are running into errors. This parameter sets the max size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.

If you are running in to an error, can you please post that as well. Thanks!

avatar
New Member

I am not talking about the broadcast variables, I am talking about the broadcast hint in join:

join(broadcast(right),...)

the 'broadcast' here is a function defined specifically for dataframe:

public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }

It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:

sc.broadcast (...)

avatar
Expert Contributor

what is the error you are getting while trying to use it then?

This is what I used in Spark 1.6.1

import org.apache.spark.sql.functions.broadcast

val joined_df = df1.join(broadcast(df2), "key")

avatar
New Member

It works now, it is my IDE problem,

Thanks

avatar
Master Guru

Hi @X Long, how about up-voting some answers, the guys tried to help you but could not have imagined that the issue was something trivial with your IDE. Give and take. Tnx!