Support Questions

Find answers, ask questions, and share your expertise

Query on Hadoop logs

avatar
Rising Star

Any idea on what are the various log files that are typically created under /var/log/hadoop/* folders? Is there a defined naming convention and mapping to hadoop deamons? The reason I ask is I see many files listed under /var/log/hadoop/hdfs folder, but don't understand / can't find documentation on what is the purpose of each log file. Any help please.

1 ACCEPTED SOLUTION

avatar
Master Guru

In general .log files are the java log files that tell you about any operations issues

.out files are the log files of the java process starter. So if you get any system faults like jvm cannot start or segmentation faults you will find them there.

All logs will have a rollover. I.e. the file without any timestamp at the end is the newest one, and log4j will keep an amount of older rolled over logs with a timestamp in the name.

Apart from that the naming is pretty straightforward

hadoop-hdfs-datanode: the log of the datanode on the cluster

hadoop-hdfs-namenode: the log of the namenode

hadoop-hdfs-secondarynamenode: the log of the secondary namenode

hdfs-audit: the audit log of hdfs, logs all activities happening in the cluster. ( Users doing things )

gc files: Garbage collection logs enabled for the namenode/datanode processes

So if you have any problems you normally find them in the hadoop-hdfs logs, if the problem is jvm configuration related in .out but normally in .log.

View solution in original post

4 REPLIES 4

avatar
Master Guru

In general .log files are the java log files that tell you about any operations issues

.out files are the log files of the java process starter. So if you get any system faults like jvm cannot start or segmentation faults you will find them there.

All logs will have a rollover. I.e. the file without any timestamp at the end is the newest one, and log4j will keep an amount of older rolled over logs with a timestamp in the name.

Apart from that the naming is pretty straightforward

hadoop-hdfs-datanode: the log of the datanode on the cluster

hadoop-hdfs-namenode: the log of the namenode

hadoop-hdfs-secondarynamenode: the log of the secondary namenode

hdfs-audit: the audit log of hdfs, logs all activities happening in the cluster. ( Users doing things )

gc files: Garbage collection logs enabled for the namenode/datanode processes

So if you have any problems you normally find them in the hadoop-hdfs logs, if the problem is jvm configuration related in .out but normally in .log.

avatar
Rising Star

Thanks @Benjamin Leonhardi

Further to my question, what is the best strategy to remove old log files? Can I simply remove all the logs apart from the "current" ones without any issues? Is there any best practice around log management? Thanks

avatar
Master Guru

You can just delete anything ending in a timestamp that is old enough for you if you want. Other things I have seen are people using "find -mtime" to delete all older logs older than x days. Or you can configure the log4j settings of your hadoop components. ( ambari->hdfs->advanced hdfs-log4j )

Unfortunately the very useful DailyRollingFileAppender currently does not support deleting older files. ( It does in a newer version some hadoop components may support that parameter ). However you could change the log appender to the RollingFileAppender which provides a maxBackupIndex attribute where you can keep up to x log files. ( Don't use it for oozie though since oozie admin features depend on the dailyrollingfileappender )

So as usual a plethora of options 🙂

http://www.tutorialspoint.com/log4j/log4j_logging_files.htm

Edit: the DailyRollingFileAppender in HDFS seems to be newer and has the following setting commented out in HDP 2.4. You can try just commenting that in and set it to a number you are comfortable with. The one below would keep 30 day of log files around.

#log4j.appender.DRFA.MaxBackupIndex=30

avatar
Rising Star

Perfect, Thanks!!