Created on 02-15-2018 03:06 AM - edited 09-16-2022 05:51 AM
Hi all,
We noticed that when a yarn job fails and the aggregation option is enabled, we can’t find the containers’ logs (about this failed job) in the usual folder into HDFS (/tmp/logs/…). We can see all logs about the jobs finished with success, but nothing about those failed (we have a Log Aggregation Retention set to 7 days and we have had this problem 2 days ago).
We’re wondering if maybe could be a bug in the aggregation process, but we would like have further information about this issue from Cloudera Support, in order to confirm that or have another explanation…
Any tips about this problem?
Many thanks in advance for the kind cooperation.
Regards,
Alex
Created 12-05-2018 04:45 AM
Hi Alex,
Did you checked for Log Retain Duration. Log Retain Duration is for Time in seconds to retain logs
Created 12-07-2018 07:06 AM
Hi AKR,
yes, I set to 7 days the parameter Log Retain Duration, but anything change. I have modified also the parameter yarn.nodemanager.delete.debug-delay-sec to 7 days but I can't find any logs on HDFS or in the local location as well.
Maybe is an issue of CDH 5.8.0, because with CDH 5.13.1 I didn't find this problems.
Regards,
Alex
Created 12-07-2018 08:35 PM
Hi Alex,
Look for the Logaggregation related messages in the Node manger log file on one of the node where one of the container was running for the application:
In normal case you should see:
2018-12-07 20:27:59,994 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_1544179594403_0020
...
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Application just finished : application_1544179594403_0020
..
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_e06_1544179594403_0020_01_000001. Current good log dirs are /yarn/container-logs
Do you see these messages for the failing application or do you see some error/exception instead?
If you can paste the relevant log for the failing application I can take a look.
Regards
Bimal