Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Dataframe join on multiple columns is too slow

Spark Dataframe join on multiple columns is too slow

Contributor

Iam using 4 node cluster which has 160gb RAM. I have set 8 cores per worker node. Iam fetching 3 datasets from s3 where one dataset is around 4 gb & the other 2 datasets are around 440gb. Altogether the dataset is around 450gb. I have done lot of filtering before performing dataframe joins so that i can get better performance. I have also tried coalesce & repartitioning but no luck in terms of execution time as well. Its coming around the same execution time even if do coalesce or repartition. Iam not sure for s3 how its considering number of partitions but for one dataset its picking 200 partitions and for the bigger dataset its taking 90000 partitions or tasks which i tried to coalesce or repartition which didnt improve performance for me. Its taking around 1 hour for me to execute an action on 450gb of joined dataframe. Iam not sure if i can implement BroadcastHashjoin to join multiple columns as one of the dataset is 4gb and it can fit in memory but i need to join on around 6 columns. Apart from that i also tried to save the joined dataframe as a table by registerTempTable and run the action on it to avoid lot of shuffling it didnt work either. It showing 5 stages and 90000 tasks for the action. Is one hour normal execution time or how can i improve performance here by reducing shuffling. Any tips are appreciated. Thank you

2 REPLIES 2

Re: Spark Dataframe join on multiple columns is too slow

Contributor

Hello, Can you explain a bit more the data, volume and key distribution ?

Re: Spark Dataframe join on multiple columns is too slow

How many executors? What is the size of each executor? Also see slides & video about executor size selection.