Reply
Explorer
Posts: 18
Registered: ‎01-29-2018

Failed yarn job logs are missing, when aggregation is enabled

[ Edited ]

Hi all,

 

We noticed that when a yarn job fails and the aggregation option is enabled, we can’t find the containers’ logs (about this failed job) in the usual folder into HDFS (/tmp/logs/…). We can see all logs about the jobs finished with success, but nothing about those failed (we have a Log Aggregation Retention set to 7 days and we have had this problem 2 days ago).

 

We’re wondering if maybe could be a bug in the aggregation process, but we would like have further information about this issue from Cloudera Support, in order to confirm that or have another explanation…

 

Any tips about this problem?

 

Many thanks in advance for the kind cooperation.

 

Regards,

 

Alex

Cloudera Employee AKR
Cloudera Employee
Posts: 22
Registered: ‎09-20-2018

Re: Failed yarn job logs are missing, when aggregation is enabled

Hi Alex,

 

Did you checked for Log Retain Duration. Log Retain Duration is for Time in seconds to retain logs

 

 

 

 

Explorer
Posts: 18
Registered: ‎01-29-2018

Re: Failed yarn job logs are missing, when aggregation is enabled

Hi AKR,

yes, I set to 7 days the parameter Log Retain Duration, but anything change. I have modified also the parameter yarn.nodemanager.delete.debug-delay-sec to 7 days but I can't find any logs on HDFS or in the local location as well.

Maybe is an issue of CDH 5.8.0, because with CDH 5.13.1 I didn't find this problems.


Regards,


Alex

 

 

 

Highlighted
Cloudera Employee
Posts: 61
Registered: ‎04-24-2017

Re: Failed yarn job logs are missing, when aggregation is enabled

Hi Alex,

 

Look for the Logaggregation related messages in the Node manger log file on one of the node where one of the container was running for the application:

In normal case you should see:

 

2018-12-07 20:27:59,994 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping application application_1544179594403_0020


...

org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Application just finished : application_1544179594403_0020
..

org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_e06_1544179594403_0020_01_000001. Current good log dirs are /yarn/container-logs

 

 

Do you see these messages for the failing application or do you see some error/exception instead?

If you can paste the relevant log for the failing application I can take a look.

 

Regards
Bimal

Announcements