Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Tuning parallelism: increase or decrease?

Highlighted

Re: Tuning parallelism: increase or decrease?

Explorer

I used explain() and here is what I found.

I found that without explicitly calling functions.broadcast(), there was no mention of broadcasting.

When I called broadcast(), there was a broadcast hint in the logical plans but the final physical plan did not mention broadcasting.

However, I do not see anything like a Broadcast RDD or an RDD (is this because I'm using DataFrames?)

Highlighted

Re: Tuning parallelism: increase or decrease?

Expert Contributor

Can you share the explain plan?

Highlighted

Re: Tuning parallelism: increase or decrease?

In order for the broadcast join to work, you need to run ANALYZE TABLE [tablename] COMPUTE STATISTICS noscan. Otherwise, it won't know the size to do the broadcast map join. Also, you mention you are trying to do a groupBy. You need to use a reduceBy instead. That will dramatically increase the performance. Spark will do the count in the map, then distribute the results to the reducer (if you want to think about it in map-reduce terms). Switching to reduceBy alone should solve your performance issue.

Re: Tuning parallelism: increase or decrease?

Expert Contributor

Agreed 100%. If you can accomplish the same task using reduceByKey, it implements a combiner, so its basically does the aggregate locally, then shuffles the results for each partition. Just keep an eye on GC when doing this.

Highlighted

Re: Tuning parallelism: increase or decrease?

Explorer

I'm doing a groupBy using DataFrames. I considered converting to RDD and doing reduceByKey, then converting back to DataFrames, but using DataFrames offers under-the-hood optimizations so I don't need to worry about the benefit of local aggregation.

Highlighted

Re: Tuning parallelism: increase or decrease?

Expert Contributor

If you call groupByKey on a dataframe it implicitly converts the dataframe to an rdd. You lose all benefits of the optimizer for this step.

Highlighted

Re: Tuning parallelism: increase or decrease?

I should have read the post a little closer I thought you were doing a groupByKey. You are correct, you need to use groupBy to keep the execution within the dataframe and out of Python. However, you said you are doing an outer join. If it is a left join and the right side is larger than the left, then do an inner join first. Then do your left join on the result. Your result most likely will be broadcasted to do the left join. This is a pattern that Holden described at Strata this year in one of her sessions.

Don't have an account?
Coming from Hortonworks? Activate your account here