i am running spark2.1.cloudera1 over yarn in CDH 5.7.1.
I have pyspark job submitted with yarn cluster that always failed.
There is not error in the stdout/stderr of the driver, but i saw on the logs of
the NodeManager than the driver ran over him the error:
yarn container is running beyond physical memory limits
The spark job is very big, it has 1000+ of jobs and it should take about 20hours.
unfortunatly, i can`t post my code,
but i can approve that driver-functions(e.g collect) is being done over few rows,
and the code shouldn`t crash on driver memory.
Just for understanding, i gave the driver 70GB of memory (spark.driver.memory),
but it seems that after 4 hours + - it crash.
I tried to optimize with some parameters but nothing help.
Any one has suggestion of parameters i can try, or what should i do in case like that?
I think the problem is with yarn, because this job is running on my spark standalone cluster with same configuration.
It seems your jobs are using yarn and You have mentioned that you have tried with some parameters but not sure what parameters that you have tried and what was the value that you have used for those parameters.
Any how, In general this issue is possible when some of your parameters are not meeting the below criteria
1. yarn.scheduler.minimum-allocation-mb <= mapreduce.map.memory.mb
2. yarn.scheduler.maximum-allocation-mb <= yarn.nodemanager.resource.memory-mb.
3. mapreduce.map.memory.mb <= yarn.scheduler.maximum-allocation-mb
3. mapreduce.map.java.opts = (mapreduce.job.heap.memory-mb.ratio) * (mapreduce.map.memory.mb)
4. Do the same for reducer
Hope this will help you
The basic parameter is spark.driver.memory that was set to 70g,
i was trying give more parameters like:
spark.driver.extraJavaOptions(Giving it Xms)
and more parameters that i cant remmember their name (smth like spark.driver.nonHeapEnabled)
I dont understand how the parametrs you give me below should help...
scheduler.minimum & maximum allocated mb are set throw the cloudera for (min 4GB & max 100GB)
Its a spark job so why i would take those parametrs "mapreduce.map.memory.mb"
4. There is no reducer.
Pls go to CM -> Spark -> Configuration -> Search for "YARN (MR2 Included) Service". If it has been enabled for "YARN (MR2 Included)" then Spark service instance has dependency on YARN (MR2 Included).
I assume it should be enabled (i may be wrong) in your case because your error shows issue related to yarn 'yarn container is running beyond physical memory limits'
If my above understand is correct then hope it will answer your question!!