12-14-2017 05:21 AM
We are use apache spark version 1.6.1 and work with spark dataframe and rdd.
Data size : 215 GB parquet file
We have 12 nodes, memory total 4.97 TB Cores Total: 920
spark-shell --master yarn-client --driver-memory 16G num-executors 24 --executor-cores 30 --driver-cores 16 --executor-memory 16G
our RDD process scenario like:
read parquet file->filter-> lowercase -> findallin(with regex) -> map -> replace -> flatmap-> saveAsTextFile
So RDD process data successfully about 10 minute.But we want to be better process time and lower memory usage with DATAFRAME.
Our Dataframe process scenario like:
Read parquet file-> filter -> spark sql -> findallin (with Regex)->filter -> saveAsParquet
Dataframe could not process same data like rdd. We tried lots of configuration about executor memory, driver memory , executormemoryoverhead and executor cores munbers . But We could not solve this memory problem.
According to spark and cloudera web page Dataframe better than RDD for memory and execution time. Also for memory error there are lots of answer in web ,but all answer for spark 2.x .
So we think that our spark version very old and we want to upgrade version 2.x.