We are submitting spark job on yarn cluster with config: --master yarn --deploy-mode cluster --executor-cores 5 --num-executors 3 --executor-memory 8G --driver-memory 3g --conf spark.yarn.executor.memoryOverhead=6144 --conf spark.cores.max=30 --conf spark.memory.fraction=0.9 --conf spark.memory.storageFraction=0.1
whereas having 3 node cluster environment with 60 GB available on each node.
The abrupt behavior is when sometimes the job fails with xml not found exception when all the xmls are present on spark job submit stated jar paths while the other time it runs successfully with same command arguments.
We have checked RAM and cluster memory availability in our first analysis and found no issues there. Also compared logs of each time(fail as well as success) and found no discrepancy.
No clues what need to be checked here so that our job run will not fail intermittently using same command.
Please support here to en light more possible components to be checked to resolve this issue.
You have not given the stack trace here so folks will not really know how to address that clearly unless the stack trace is provided. But given the explanation that was provided, I would suggest that you can try to pass the given xml with the "--files" to the spark submit command and then try again.