Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark on YARN tasks never finish

Spark on YARN tasks never finish

Explorer

The problem:

My resourceManager JVM heap free space continuously decreases.

 

The possible cause:

Many of the failed Spark jobs seem to have leftover tasks running long after failure.

From the Spark web UI, under Tasks for a specific job stage I find lines like:

Index ID Attempt Status Locality Level Executor ID / Host Launch Time Duration GC Time Shuffle Read Size / Records Errors

IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDurationGC TimeShuffle Read Size / RecordsErrors
07640RUNNINGPROCESS_LOCAL81 / node2.domain.com2018/01/29 12:43:50189.6 h /

(This is not a streaming job, and it has state "Finished" and final status "Failed" in YARN)

The table with "Aggregated Metrics by Executor" for the same job stage does not contain an executor with ID 81.

It does contain a number of executors which were lost, and their addresses are therefore "CANNOT FIND ADDRESS"

 

I have tried tracking the task ID of tasks which never ended in the stderr logs, but found no common path. Some are created, only to be lost due to errors like FetchFailed or java.io.FileNotFoundException. Others, and most from my impression, are created and then never heard from again. 

 

I have multiple clusters all showing these symptoms. Versions are:

CDH 5.4.7 & 5.4.4

YARN 2.6.0

Spark 1.3

 

Any help is much appreciated, and more information can be added, if you tell me where to find it.

 

Best regards

Jonathan McGowan