My resourceManager JVM heap free space continuously decreases.
The possible cause:
Many of the failed Spark jobs seem to have leftover tasks running long after failure.
From the Spark web UI, under Tasks for a specific job stage I find lines like:
Index ID Attempt Status Locality Level Executor ID / Host Launch Time Duration GC Time Shuffle Read Size / Records Errors
|Index||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch Time||Duration||GC Time||Shuffle Read Size / Records||Errors|
|0||764||0||RUNNING||PROCESS_LOCAL||81 / node2.domain.com||2018/01/29 12:43:50||189.6 h||/|
(This is not a streaming job, and it has state "Finished" and final status "Failed" in YARN)
The table with "Aggregated Metrics by Executor" for the same job stage does not contain an executor with ID 81.
It does contain a number of executors which were lost, and their addresses are therefore "CANNOT FIND ADDRESS"
I have tried tracking the task ID of tasks which never ended in the stderr logs, but found no common path. Some are created, only to be lost due to errors like FetchFailed or java.io.FileNotFoundException. Others, and most from my impression, are created and then never heard from again.
I have multiple clusters all showing these symptoms. Versions are:
CDH 5.4.7 & 5.4.4
Any help is much appreciated, and more information can be added, if you tell me where to find it.