We noticed that when a yarn job fails and the aggregation option is enabled, we can’t find the containers’ logs (about this failed job) in the usual folder into HDFS (/tmp/logs/…). We can see all logs about the jobs finished with success, but nothing about those failed (we have a Log Aggregation Retention set to 7 days and we have had this problem 2 days ago).
We’re wondering if maybe could be a bug in the aggregation process, but we would like have further information about this issue from Cloudera Support, in order to confirm that or have another explanation…
Any tips about this problem?
Many thanks in advance for the kind cooperation.
yes, I set to 7 days the parameter Log Retain Duration, but anything change. I have modified also the parameter yarn.nodemanager.delete.debug-delay-sec to 7 days but I can't find any logs on HDFS or in the local location as well.
Maybe is an issue of CDH 5.8.0, because with CDH 5.13.1 I didn't find this problems.
Look for the Logaggregation related messages in the Node manger log file on one of the node where one of the container was running for the application:
In normal case you should see:
2018-12-07 20:27:59,994 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_1544179594403_0020
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Application just finished : application_1544179594403_0020
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_e06_1544179594403_0020_01_000001. Current good log dirs are /yarn/container-logs
Do you see these messages for the failing application or do you see some error/exception instead?
If you can paste the relevant log for the failing application I can take a look.