I wrote different Spark applications, that save Dataset data to Phoenix tables. When I try to process huge datasets, some of my Spark Jobs fail by a ExecutorLostFailure exception. The jobs are retried and seem to finish successfully on their second approaches.
Here the code, that saves the dataframe to my Phoenix table:
Here the output of one of the Jobs in the Spark History UI:
ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 3.0 GB of 3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Why do I get this error when I use the Spark plugin for Apache Phoenix? Are there any configurations to manage the memory consumption of the Phoenix-Spark job?
You might want to increase the --executor-memory (and probablyyarn.scheduler.maximum-allocation-mb as well) to a value that can hold your data size in memory. In some cases repartitioning is a better option.