I used explain() and here is what I found.
I found that without explicitly calling functions.broadcast(), there was no mention of broadcasting.
When I called broadcast(), there was a broadcast hint in the logical plans but the final physical plan did not mention broadcasting.
However, I do not see anything like a Broadcast RDD or an RDD (is this because I'm using DataFrames?)
In order for the broadcast join to work, you need to run ANALYZE TABLE [tablename] COMPUTE STATISTICS noscan. Otherwise, it won't know the size to do the broadcast map join. Also, you mention you are trying to do a groupBy. You need to use a reduceBy instead. That will dramatically increase the performance. Spark will do the count in the map, then distribute the results to the reducer (if you want to think about it in map-reduce terms). Switching to reduceBy alone should solve your performance issue.
Agreed 100%. If you can accomplish the same task using reduceByKey, it implements a combiner, so its basically does the aggregate locally, then shuffles the results for each partition. Just keep an eye on GC when doing this.
I'm doing a groupBy using DataFrames. I considered converting to RDD and doing reduceByKey, then converting back to DataFrames, but using DataFrames offers under-the-hood optimizations so I don't need to worry about the benefit of local aggregation.
If you call groupByKey on a dataframe it implicitly converts the dataframe to an rdd. You lose all benefits of the optimizer for this step.
I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.