I have spark job that reading large amount of HDFS files.
I run the job in resource pool with limited executores and vcores per executor, but i still see high HDFS I/O.
Also i reduced number of executor for the job but still affecting the cluster HDFS I/O.
Any suggestion how to protect the cluster from this?
I'm not clear what the problem is. Resource pools and Spark mechanisms don't allocate I/O. Smaller jobs don't cause less I/O, though, they might take longer and spread out that I/O. If you're seriously I/O bound, then both large and small jobs might hit the same limit. It's not clear the I/O is "too high" or that it's not normal.
Example: I have spark job that running on 30 GB of HDFS, does the HDFS I/O will be the same if i run this Spark job with 5 executors and 3 vcores per executors vs 15 executors and 5 vcores per executors, i mean is there away to optimize the I/O by tunning executors and vcores per executors.
It's hard to know how to optimize I/O without know the underlying hardware, like how many servers, how many disks and how many cores for cpu. Creating more executors may distribute across more server if you are seeing hotspots. You don't want too many cores per executor as you may have too many threads trying to read from the hard drives. You'll want to play around with some settings and monitor your resources to get a better understanding.
Does this mean that running spark job with 80 executors, 5 vcore per executor and memory 6GB will reulting in more I/O than running 200 executor, 2 vcore per executor and 2.5 GB
It's unkown as you won't be able to determine how many executors get placed on the same server by YARN. This will all depend on how much resource you allocate for YARN and how much is currently available for each server, and how many servers are available.
It is usually best to have around 5 vcores per executor as this is a good balance between limiting overhead of starting more executors (less jvms, sharing of broadcast variables and block process local) and low enough to not oversubscribe I/O and allowing more granular placement into YARN containers.
Sandy Ryza has a great blog post on tuning spark executors.