Created 02-10-2016 03:29 PM
We have log aggregation enabled in the Yarn configuration for our cluster (yarn.log-aggregation-enable).
But it doesn't seem to work.
When I try to drill into the history of a job in the resource manager GUI, the link for "logs" always takes me to a page that says: "aggregation is not enabled".
I've opened a ticket asking for help on this, and they told us we need to upgrade, so we did, but it didn't help.
I opened another ticket and am currently waiting for a response.
In the meantime, has anyone seen this?
Is there is a known hack to fix it?
Any advice about where to look for the solution?
We're currently on 2.2.8
Created 02-11-2016 06:33 PM
@Zack Riesland , thanks for looking at this with me over webex. It turns out it was the ownership of the mr-history directory that was causing aggregation to not work on the web side, this needed to be owned by mapred and hdfs. The initial failure to start was due to that incorrect class which we fixed.
Created 02-10-2016 04:09 PM
Have you tried to retrieve the logs using the YARN CLI?
yarn logs -applicationId <id of the application>
This will stream back the aggregated log to the screen ... if you have access in HDFS to see the log files. You will see a message about aggregation not being enabled if you lack permissions to see the log files. In that case, modify the command to use the application owner.
yarn logs -appOwner <user id> -applicationId <id of the application>
Created 02-10-2016 04:18 PM
Thanks Terry,
Both of these approaches work - I get back the relevant logs for a given application ID.
But I'm interested in (and tasked with) getting the UI links to work for simplicity of all the folks on our team.
Created 02-10-2016 04:36 PM
I just got a reply from my support ticket and it was literally a link to this thread.
So I guess we better figure it out here!
Created 02-10-2016 04:39 PM
@Zack Riesland if you have a dev cluster, try going step by step and enabling TS. Also look for any deprecated properties.
Created 02-10-2016 04:44 PM
We don't have a dev cluster. I am the only one who needs the cluster today, so I can break stuff as long as it is put back together by tonight's ingest.
It sounds like that's the only way I'm going to get this to work...
Created 02-10-2016 04:50 PM
@Zack Riesland you can roll back to old configs when you are done and it doesn't work. Take some standard precautions like backup yarn-site.xml, etc. Definitely post your results here.
Created 02-10-2016 04:53 PM
Thanks,
Another followup:
The instructions say:
yarn.timeline-service.entity-group-fs-store.active-dir
and yarn.timeline-service.entity-group-fs-store.done-dir
must exist on the cluster on HDFS. Active-dir should have permission 01777, owned by YARN, group admin-group. Done-dir should have permission 0700, owned by yarn, group admin-group.
2 things:
1) 01777 isn't a valid permission set
2) when it says 'admin-group', does it literally mean that the group should be set to 'admin-group', or just a group with admin privileges? Almost everything in HDFS seems to be in either 'hadoop' or 'hdfs'.
Created 02-10-2016 05:06 PM
@Zack Riesland 1. I believe they mean 1777 which is a sticky bit. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html
2. yes the group with admin privs not literally admin-group.
Created 02-10-2016 04:43 PM
@Mark Herring FYI
Created 02-11-2016 06:33 PM
@Zack Riesland , thanks for looking at this with me over webex. It turns out it was the ownership of the mr-history directory that was causing aggregation to not work on the web side, this needed to be owned by mapred and hdfs. The initial failure to start was due to that incorrect class which we fixed.