Support Questions

Find answers, ask questions, and share your expertise

dataframe bigger than evalable memory

avatar
Rising Star

Hello,

I have "stupid" question of a beginner in SparkSQL. 🙂 Just to correctly understand how it works:

If I have a HBase table with 28Tera of data (90Billions of lines) and only a few tera of memory in the cluster, what's going to happen?

The dataframe will be bigger that the available memory, so is it going to swap? It will crash? I would like to clearly understand the mecanisme that spark used to manage that.

Also do you have any recommendation in term of infrastructure to handle this kind of database?

Thanks in advance for all the info 🙂

Michel

1 ACCEPTED SOLUTION

avatar
Master Guru

Contrary to popular believe Spark is not in-memory only

a) Simple read no shuffle ( no joins, ... )

For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )

So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.

b) Shuffle

This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.

c) After Shuffle

RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.

d) Spark Streaming

This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.

View solution in original post

2 REPLIES 2

avatar

Hi @Michel Sumbul

There are several details around Spark memory management especially since Spark 1.6. But to simply answer your question, if the dataset does not fit in memory, Spark spills to disk. It's clear that this behavior is necessary for Spark to work with large datasets but it impacts the performances. You can have an idea on this in the initial Spark's design dissertation by Matei Zaharia (section 2.6.4 Behavior with Insufficient Memory).

avatar
Master Guru

Contrary to popular believe Spark is not in-memory only

a) Simple read no shuffle ( no joins, ... )

For the initial reads Spark like MapReduce reads the data in a stream and processes it as it comes along. I.e. unless there is a reason spark will NOT materialize the full RDDs in memory ( you can tell him to do it however if you want to cache a small dataset ) An RDD is resilient because spark knows how to recreate it ( re read a block from hdfs for example ) not because its stored in mem in different locations. ( that can be done too though. )

So if you filter out most of your data or do an efficient aggregation that aggregates on the map side you will never have the full table in memory.

b) Shuffle

This is done very similarly to MapReduce as it writes the map outputs to disc and reads them with the reducers through http. However spark uses an aggressive filesystem buffer strategy on the Linux filesystem so if the OS has memory available the data will not be actually written to physical disc.

c) After Shuffle

RDDs after shuffle are normally cached by the engine ( otherwise a failed node or RDD would require a complete re run of the job ) however as abdelkrim mentions Spark can spill these to disc unless you overrule that.

d) Spark Streaming

This is a bit different. Spark streaming expects all data to fit in memory unless you overwrite settings.