- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
how spark distributes partitions to executors and how to distribute evenly
- Labels:
-
Apache Spark
Created ‎04-19-2021 04:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm facing a severe performance issue on a job that suddenly(no code changes) takes 4x time to complete, after debugging and investigating I found most of the data is read to a single executor(13 GB one executor vs 200 MB the rest). Initially i thought it was an classical uneven partitions issue so i started to test different partition numbers (and criteria) but the issue was not fixed, I did a rows per partition analysis and found all partitions have similar number of rows so this is not the problem, it seems the scheduler assigns most partitions to a single executor instead of evenly, question is how spark decides which partitions go to which executor, and how to control that behavior to make distribution even?
I asked this on SO too: https://stackoverflow.com/questions/67133177/how-spark-distributes-partitions-to-executors
Created ‎04-20-2021 12:16 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello
Sounds like a Spark data skew problem
Some good approaches to consider
1. Broadcast join
2. Key Salting
You can make a reference here: https://itnext.io/handling-data-skew-in-apache-spark-9f56343e58e8
Created ‎04-20-2021 11:26 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, I appreciate the comment. This issue occurs after a hive sql query that joins around 15 tables(some of them big) so I think broadcast join do not applies, salting would imply breaking down the query and running the joins on spark functions instead of hive sql, because of the number of tables it can be time consuming, so my question is, is there any other way to force spark do distribute the partitions evenly to executors?
