- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Failed yarn job logs are missing, when aggregation is enabled
- Labels:
-
Apache YARN
-
Cloudera Manager
-
HDFS
-
MapReduce
Created on ‎02-15-2018 03:06 AM - edited ‎09-16-2022 05:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
We noticed that when a yarn job fails and the aggregation option is enabled, we can’t find the containers’ logs (about this failed job) in the usual folder into HDFS (/tmp/logs/…). We can see all logs about the jobs finished with success, but nothing about those failed (we have a Log Aggregation Retention set to 7 days and we have had this problem 2 days ago).
We’re wondering if maybe could be a bug in the aggregation process, but we would like have further information about this issue from Cloudera Support, in order to confirm that or have another explanation…
Any tips about this problem?
Many thanks in advance for the kind cooperation.
Regards,
Alex
Created ‎12-05-2018 04:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
Did you checked for Log Retain Duration. Log Retain Duration is for Time in seconds to retain logs
Created ‎12-07-2018 07:06 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi AKR,
yes, I set to 7 days the parameter Log Retain Duration, but anything change. I have modified also the parameter yarn.nodemanager.delete.debug-delay-sec to 7 days but I can't find any logs on HDFS or in the local location as well.
Maybe is an issue of CDH 5.8.0, because with CDH 5.13.1 I didn't find this problems.
Regards,
Alex
Created ‎12-07-2018 08:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alex,
Look for the Logaggregation related messages in the Node manger log file on one of the node where one of the container was running for the application:
In normal case you should see:
2018-12-07 20:27:59,994 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_1544179594403_0020
...
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Application just finished : application_1544179594403_0020
..
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_e06_1544179594403_0020_01_000001. Current good log dirs are /yarn/container-logs
Do you see these messages for the failing application or do you see some error/exception instead?
If you can paste the relevant log for the failing application I can take a look.
Regards
Bimal
