Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Workaround for log aggregation bug

avatar
Super Collaborator

We have log aggregation enabled in the Yarn configuration for our cluster (yarn.log-aggregation-enable).

But it doesn't seem to work.

When I try to drill into the history of a job in the resource manager GUI, the link for "logs" always takes me to a page that says: "aggregation is not enabled".

I've opened a ticket asking for help on this, and they told us we need to upgrade, so we did, but it didn't help.

I opened another ticket and am currently waiting for a response.

In the meantime, has anyone seen this?

Is there is a known hack to fix it?

Any advice about where to look for the solution?

We're currently on 2.2.8

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Zack Riesland , thanks for looking at this with me over webex. It turns out it was the ownership of the mr-history directory that was causing aggregation to not work on the web side, this needed to be owned by mapred and hdfs. The initial failure to start was due to that incorrect class which we fixed.

View solution in original post

20 REPLIES 20

avatar
Super Collaborator

Have you tried to retrieve the logs using the YARN CLI?

yarn logs -applicationId <id of the application> 

This will stream back the aggregated log to the screen ... if you have access in HDFS to see the log files. You will see a message about aggregation not being enabled if you lack permissions to see the log files. In that case, modify the command to use the application owner.

yarn logs -appOwner <user id> -applicationId <id of the application>

avatar
Super Collaborator

Thanks Terry,

Both of these approaches work - I get back the relevant logs for a given application ID.

But I'm interested in (and tasked with) getting the UI links to work for simplicity of all the folks on our team.

avatar
Super Collaborator

I just got a reply from my support ticket and it was literally a link to this thread.

So I guess we better figure it out here!

avatar
Master Mentor

@Zack Riesland if you have a dev cluster, try going step by step and enabling TS. Also look for any deprecated properties.

avatar
Super Collaborator

We don't have a dev cluster. I am the only one who needs the cluster today, so I can break stuff as long as it is put back together by tonight's ingest.

It sounds like that's the only way I'm going to get this to work...

avatar
Master Mentor

@Zack Riesland you can roll back to old configs when you are done and it doesn't work. Take some standard precautions like backup yarn-site.xml, etc. Definitely post your results here.

avatar
Super Collaborator

Thanks,

Another followup:

The instructions say:

yarn.timeline-service.entity-group-fs-store.active-dir and yarn.timeline-service.entity-group-fs-store.done-dir must exist on the cluster on HDFS. Active-dir should have permission 01777, owned by YARN, group admin-group. Done-dir should have permission 0700, owned by yarn, group admin-group.

2 things:

1) 01777 isn't a valid permission set

2) when it says 'admin-group', does it literally mean that the group should be set to 'admin-group', or just a group with admin privileges? Almost everything in HDFS seems to be in either 'hadoop' or 'hdfs'.

avatar
Master Mentor

@Zack Riesland 1. I believe they mean 1777 which is a sticky bit. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html

2. yes the group with admin privs not literally admin-group.

avatar
Master Mentor

avatar
Expert Contributor

@Zack Riesland , thanks for looking at this with me over webex. It turns out it was the ownership of the mr-history directory that was causing aggregation to not work on the web side, this needed to be owned by mapred and hdfs. The initial failure to start was due to that incorrect class which we fixed.