Created 02-19-2016 03:28 PM
Given the fact that Spark is in memory processing and Hadoop is more to do with Disk based processing (higher disk I/O), I was wondering for sizing the containers and RAM needs, do we need more RAM for running same use case with Spark when compared to Hadoop Map-Reduce?
Created 02-19-2016 04:45 PM
The answer is a little more, but not for the reason you think. Many people see Spark and think "In Memory!", that is only partially true. Spark has the ability to cache intermediate data into memory and reuse that data instead of doing another read from the HDFS, but only if the developer explicitly tells Spark to put the data in memory.
The reason for the slightly higher memory requirements are Spark uses "executors" instead of mappers and reducers. Executors have a higher memory overhead then your standard mappers/reducers. So you will need a little more RAM at the minimum. If your data set doesn't fit completely into memory, Spark spills to local disk just like MR, and will continue processing. So even if you don't have TONS of memory, Spark will still be able to process your application.
Created 02-19-2016 04:45 PM
The answer is a little more, but not for the reason you think. Many people see Spark and think "In Memory!", that is only partially true. Spark has the ability to cache intermediate data into memory and reuse that data instead of doing another read from the HDFS, but only if the developer explicitly tells Spark to put the data in memory.
The reason for the slightly higher memory requirements are Spark uses "executors" instead of mappers and reducers. Executors have a higher memory overhead then your standard mappers/reducers. So you will need a little more RAM at the minimum. If your data set doesn't fit completely into memory, Spark spills to local disk just like MR, and will continue processing. So even if you don't have TONS of memory, Spark will still be able to process your application.
Created 02-19-2016 06:41 PM
Yep, what @Joe Widen said and sprinkle in a bit of "it depends" on top for good measure. If I put my consulting cap on again, I'd ask "what are you trying to do?". Meaning, is the job in question more about memory usage or overall compute time to complete. There will be factors at play that don't let you have a perfect yes or no answer to the question above and I'd recommend picking an indicative use case of your data processing and have a small bake-off on your cluster based on what is most important to you if you really need to a more data-driven decision. Chances are that your decision will be based on what you plan on developing in going forward more so than the answer to this question about memory usage.