Support Questions

Find answers, ask questions, and share your expertise

In which format are yarn container logs stored in HDFS?

avatar
Guru

I went into /app-logs/<username>/ to get the logs. But I don't see how these files are stored. I tried getting the file and find format using 'file' but it just says 'data'. hdfs dfs -text also just yields garbled text. We are looking to run some pig jobs of container logs to gain some insights.

1 ACCEPTED SOLUTION

avatar
Guru
REGISTER /tmp/tez-tfile-parser-0.8.2-SNAPSHOT.jar;
yarnlogs = LOAD '/app-logs/hdfs/logs/**/*' USING org.apache.tez.tools.TFileLoader();
lines_with_fetchertime = FILTER yarnlogs BY $2 matches '.*freed by fetcher.*';

This was the code that I used to extract specific text in logs. However, TFileLoader in tez-tools does not seem to scale up that well when we pass a folder with ton on logs. tez-tools I believe is also not part of HDP. You need to build it separately. Worked well on smaller datasets and ran into issues on bigger datasets

Thanks

View solution in original post

5 REPLIES 5

avatar

According to the Azure blog, the yarn container logs under /app-logs are not directly readable, as they are written in a TFile, binary format indexed by container. Normally one can use the yarn cli tool, it emits the content in the stdout:

yarn logs -applicationId <applicationId

avatar
Guru

Good pointer on TFile. We can read TFiles. I just loaded it in pig using org.apache.tez.tools.TFileLoader which is in tez-tools (built from source from git)

avatar
Master Mentor

Could you share code or example on loading into pig ?

avatar
Guru
REGISTER /tmp/tez-tfile-parser-0.8.2-SNAPSHOT.jar;
yarnlogs = LOAD '/app-logs/hdfs/logs/**/*' USING org.apache.tez.tools.TFileLoader();
lines_with_fetchertime = FILTER yarnlogs BY $2 matches '.*freed by fetcher.*';

This was the code that I used to extract specific text in logs. However, TFileLoader in tez-tools does not seem to scale up that well when we pass a folder with ton on logs. tez-tools I believe is also not part of HDP. You need to build it separately. Worked well on smaller datasets and ran into issues on bigger datasets

Thanks

avatar
New Contributor

This method of using yarn command does not cover the use case of running HDInsight cluster on demand when cluster created to run the pipeline and then deleted. One approach is to use https://github.com/shanyu/hadooplogparser .

Is there any option to configure YARN logger to produce text and not TFile binary format?