Created 03-08-2017 05:33 PM
I noticed that we can do:
join(broadcast(right),...) in Spark 1.6 in Java, but it looks like the broadcast function is not available in Spark 2.1.0
Created 03-08-2017 06:03 PM
I am not talking about the broadcast variables, I am talking about the broadcast hint in join:
join(broadcast(right),...)
the 'broadcast' here is a function defined specifically for dataframe:
public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }
It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:
sc.broadcast (...)
Created 03-08-2017 05:42 PM
Hi @X Long
The official documentation does include it
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
Here is one tutorial using spark 2
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-broadcast.html
Created 03-08-2017 05:46 PM
I do not believe it was removed in Spark 2.1.0. Here's the documentation for Broadcast Variables (for scala, java, and python): http://spark.apache.org/docs/2.1.0/programming-guide.html#broadcast-variables
You may also need to get the spark.sql.autoBroadcastJoinThreshold parameter, if you are running into errors. This parameter sets the max size (in bytes) for a table that will be broadcast to all worker nodes when performing a join.
If you are running in to an error, can you please post that as well. Thanks!
Created 03-08-2017 06:03 PM
I am not talking about the broadcast variables, I am talking about the broadcast hint in join:
join(broadcast(right),...)
the 'broadcast' here is a function defined specifically for dataframe:
public static org.apache.spark.sql.DataFrame broadcast(org.apache.spark.sql.DataFrame dataFrame) { /* compiled code */ }
It is different from the broadcast variable explained in your link, which needs to be called by a spark context as below:
sc.broadcast (...)
Created 03-08-2017 06:11 PM
what is the error you are getting while trying to use it then?
This is what I used in Spark 1.6.1
import org.apache.spark.sql.functions.broadcast val joined_df = df1.join(broadcast(df2), "key")
Created 03-08-2017 10:23 PM
It works now, it is my IDE problem,
Thanks
Created 03-09-2017 01:40 AM
Hi @X Long, how about up-voting some answers, the guys tried to help you but could not have imagined that the issue was something trivial with your IDE. Give and take. Tnx!