Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Is there a way to broadcast a Dataframe/RDD without a collect first?

avatar
Contributor

Is there a way to broadcast a Dataframe/RDD without doing the collect first?

I am thinking this could avoid a copy to the driver first.

I did notice that there is a broadcast function that is used in the broadcast join for the DataFrame.

public static DataFrame broadcast(DataFrame df) //Marks a DataFrame as small enough for use in broadcast joins.The following example marks the right DataFrame for broadcast hash join using joinKey.

   // left and right are DataFrames
   left.join(broadcast(right), "joinKey")
<code>

It seems that Sparks determines when the broadcasting is needed automatically when it finds that a join operation is needed.

What I am wondering is that if I wanted to use in some other more general context, does the above broadcast function still work, i.e., the broadcast still occurs.

The other thing is after the boradcasting, does the partition concept still exists for the dataframe, e.g. can I still apply functions like mapPartitions to the dataframe?

Thanks

1 ACCEPTED SOLUTION

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
1 REPLY 1

avatar
Super Collaborator
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login