Created on 05-09-2016 07:40 PM - edited 09-16-2022 03:18 AM
Hello,
I have "stupid" question of a beginner in SparkSQL. 🙂 Just to correctly understand how it works:
If I have a HBase table with 28Tera of data (90Billions of lines) and only a few tera of memory in the cluster, what's going to happen?
The dataframe will be bigger that the available memory, so is it going to swap? It will crash? I would like to clearly understand the mecanisme that spark used to manage that.
Also do you have any recommendation in term of infrastructure to handle this kind of database?
Thanks in advance for all the info 🙂
Michel
Created 05-10-2016 11:02 AM
Contrary to popular believe Spark is not in-memory only
a) Simple read no shuffle ( no joins, ... )
For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )
So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.
b) Shuffle
This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.
c) After Shuffle
RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.
d) Spark Streaming
This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.
Created 05-09-2016 10:31 PM
There are several details around Spark memory management especially since Spark 1.6. But to simply answer your question, if the dataset does not fit in memory, Spark spills to disk. It's clear that this behavior is necessary for Spark to work with large datasets but it impacts the performances. You can have an idea on this in the initial Spark's design dissertation by Matei Zaharia (section 2.6.4 Behavior with Insufficient Memory).
Created 05-10-2016 11:02 AM
Contrary to popular believe Spark is not in-memory only
a) Simple read no shuffle ( no joins, ... )
For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )
So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.
b) Shuffle
This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.
c) After Shuffle
RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.
d) Spark Streaming
This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.