Reply
Highlighted
New Contributor
Posts: 2
Registered: ‎12-14-2017

Spark Dataframe memory problem

We are use apache spark version 1.6.1  and work with spark dataframe and rdd.

        Data size : 215 GB  parquet file

        We have 12 nodes, memory total 4.97 TB Cores Total: 920

 

 

  • Our start configuration for RDD like under below:

 

spark-shell --master yarn-client --driver-memory 16G num-executors 24 --executor-cores 30 --driver-cores 16 --executor-memory 16G

 

our RDD  process scenario like:

read parquet file->filter-> lowercase -> findallin(with regex) -> map -> replace -> flatmap-> saveAsTextFile

 

So RDD process data successfully about 10 minute.But we want to be better process time and  lower memory usage with DATAFRAME.

 

  • When we try Scenario under below which make same work with RDD . We have got memory error.

Our Dataframe process scenario like:

Read parquet file-> filter -> spark sql -> findallin (with Regex)->filter -> saveAsParquet

               Dataframe could not process same data like rdd. We tried lots of configuration about executor memory, driver memory , executormemoryoverhead and executor cores munbers . But We could not solve this memory problem.

According to spark and cloudera web page  Dataframe better than RDD for memory and execution time. Also for memory error there are lots of answer in web ,but all answer for spark 2.x .

So we think that our spark version very old and we want to upgrade version 2.x.

Announcements
New solutions