Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

​Is there a way to do broadcast join in Spark 2.1 in java

Solved Go to solution
Highlighted

​Is there a way to do broadcast join in Spark 2.1 in java

I noticed that we can do:

join(broadcast(right),...) in Spark 1.6 in Java, but it looks like the broadcast function is not available in Spark 2.1.0

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

I am not talking about the broadcast variables, I am talking about the broadcast hint in join:

join(broadcast(right),...)

the 'broadcast' here is a function defined specifically for dataframe:

public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }

It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:

sc.broadcast (...)

View solution in original post

6 REPLIES 6
Highlighted

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

Expert Contributor

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

@X Long

I do not believe it was removed in Spark 2.1.0. Here's the documentation for Broadcast Variables (for scala, java, and python): http://spark.apache.org/docs/2.1.0/programming-guide.html#broadcast-variables

You may also need to get the spark.sql.autoBroadcastJoinThreshold parameter, if you are running into errors. This parameter sets the max size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.

If you are running in to an error, can you please post that as well. Thanks!

Highlighted

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

I am not talking about the broadcast variables, I am talking about the broadcast hint in join:

join(broadcast(right),...)

the 'broadcast' here is a function defined specifically for dataframe:

public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }

It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:

sc.broadcast (...)

View solution in original post

Highlighted

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

Expert Contributor

what is the error you are getting while trying to use it then?

This is what I used in Spark 1.6.1

import org.apache.spark.sql.functions.broadcast

val joined_df = df1.join(broadcast(df2), "key")
Highlighted

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

It works now, it is my IDE problem,

Thanks

Highlighted

Re: ​Is there a way to do broadcast join in Spark 2.1 in java

Hi @X Long, how about up-voting some answers, the guys tried to help you but could not have imagined that the issue was something trivial with your IDE. Give and take. Tnx!

Don't have an account?
Coming from Hortonworks? Activate your account here