Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark and high HDFS I/O

Spark and high HDFS I/O

Super Collaborator

Hi,

 

I have spark job that reading large amount of HDFS files.

 

I run the job in resource pool with limited executores and vcores per executor, but i still see high HDFS I/O.

 

Also i reduced number of executor for the job but still affecting the cluster HDFS I/O.

 

Any suggestion how to protect the cluster from this?

8 REPLIES 8

Re: Spark and high HDFS I/O

Super Collaborator

Any insights?

Re: Spark and high HDFS I/O

Master Collaborator

I'm not clear what the problem is. Resource pools and Spark mechanisms don't allocate I/O. Smaller jobs don't cause less I/O, though, they might take longer and spread out that I/O. If you're seriously I/O bound, then both large and small jobs might hit the same limit. It's not clear the I/O is "too high" or that it's not normal.

Re: Spark and high HDFS I/O

Super Collaborator

Example: I have spark job that running on 30 GB of HDFS, does the HDFS I/O will be the same if i run this Spark job with 5 executors and 3 vcores per executors vs 15 executors and 5 vcores per executors, i mean is there  away to optimize the I/O by tunning executors and vcores per executors.

Re: Spark and high HDFS I/O

Super Collaborator

Any one can help here?

Re: Spark and high HDFS I/O

Expert Contributor

It's hard to know how to optimize I/O without know the underlying hardware, like how many servers, how many disks and how many cores for cpu.  Creating more executors may distribute across more server if you are seeing hotspots.  You don't want too many cores per executor as you may have too many threads trying to read from the hard drives.  You'll want to play around with some settings and monitor your resources to get a better understanding.

Re: Spark and high HDFS I/O

Super Collaborator

Does this mean  that running spark job with 80 executors, 5 vcore per executor and memory 6GB will reulting in more I/O than running 200 executor, 2 vcore per executor and 2.5 GB

Re: Spark and high HDFS I/O

Expert Contributor

It's unkown as you won't be able to determine how many executors get placed on the same server by YARN.  This will all depend on how much resource you allocate for YARN and how much is currently available for each server, and how many servers are available.

 

It is usually best to have around 5 vcores per executor as this is a good balance between limiting overhead of starting more executors (less jvms, sharing of broadcast variables and block process local) and low enough to not oversubscribe I/O and allowing more granular placement into YARN containers.

 

Sandy Ryza has a great blog post on tuning spark executors[1].

 

1.  https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Highlighted

Re: Spark and high HDFS I/O

Super Collaborator

Thanks, already saw this blog