I wrote different Spark applications, that save Dataset data to Phoenix tables. When I try to process huge datasets, some of my Spark Jobs fail by a ExecutorLostFailure exception. The jobs are retried and seem to finish successfully on their second approaches.
Here the code, that saves the dataframe to my Phoenix table:
dfToSave.write().format("org.apache.phoenix.spark").mode("overwrite").option("table", "PHOENIX_TABLE_NAME").option("zkUrl", "server.name:2181:/hbase-unsecure").save();
Here the output of one of the Jobs in the Spark History UI:
ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 3.0 GB of 3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Why do I get this error when I use the Spark plugin for Apache Phoenix? Are there any configurations to manage the memory consumption of the Phoenix-Spark job?
You might want to increase the --executor-memory (and probably yarn.scheduler.maximum-allocation-mb as well) to a value that can hold your data size in memory. In some cases repartitioning is a better option.