Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Check opening files on HDFS

avatar
Explorer

Dear all,

Currently, I have some files on HDFS and it is being writed for which jobs. But, I don't know this jobs.

So, how to i check opening files is of which jobs?

Thanks,

1 ACCEPTED SOLUTION

avatar
Contributor

If this problem happens a lot,I mean you always need know the mapping from file operations (create, delete, rename etc) to upper level applications, I think you can suggest users use caller context feature, which was released to HDP 2.2 and up.

The feature introduces a new setting hadoop.caller.context.enabled. When set to additional fields are written into namenode audit log records to help identify the job or query that introduced each NameNode operation. This feature is enabled by default starting with this release of HDP.New Behavior: This feature brings a new key-value pair at the end of each audit log record. The newly added key at is callerContext, value context:signature. The overall format would be callerContext=context:signature. If the signature is null or empty, the value will be context only, in the format of callerContext=context. If the hadoop.caller.context.enabled config key is false, the key-value pair will not be showing. The audit log format is not changed in this case. It is also possible to limit the maximum length of context and signature. Consider the hadoop.caller.context.max.size config key (default 128 bytes) and hadoop.caller.context.signature.max.size(default 40 bytes) config key respectively.

There is a chance that the new information in the audit log may break existing scripts/automation that was being used to analyze the audit log. In this case the scripts may need to be fixed. We do not recommend disabling this feature as it can be a useful troubleshooting aid.

Please refer to release notes.

View solution in original post

7 REPLIES 7

avatar
Super Guru
@tunglq it

Are you curios to check for HDFS open files or local FS open files ?

avatar
Master Guru

Hi @tunglq it, which files do you have in mind? After installing HDP cluster there are a number of system (cluster) file written in HDFS but no jobs are running (Ambari metrics in embedded mode and Ranger audit can write files all the time but they are a special case). Folders like apps, hdp, system, tmp, user etc. are created and populated automatically during the cluster upgrade.

avatar
Explorer

thanks you with your answer!

Example, I have 1 file on hdfs: hdfs://cluster/data/logs.txt, it is writing for a any job. Now, i want to know which job is writing this file logs.txt?????

Thanks,

avatar
Super Guru

@tunglq it

So there is no straight forward way to identify which file was written by which job, however we need little bit hand works to achieve this by parsing all job logs through a script and should look for that specific file path or name occurrences in the logs. In most cases if you ran a map reduce job then it is likely that Application master container logs should have that information, if not then better if you parse each job containers logs one by one through a script.

will that help?

avatar
Super Guru
@tunglq it

You need to write a custom script[say bash/perl] which will check for mapreduce log files and accordingly you can capture the src/dest of any hdfs file which the job is using.

Some more login within the script may help you to track which are currently inuse files on hdfs.

avatar
Contributor

First you can check the hdfs audit log which tells the client name that created the file. Then based on the client name you may go and search the yarn application log to identify which job was writing the file.

avatar
Contributor

If this problem happens a lot,I mean you always need know the mapping from file operations (create, delete, rename etc) to upper level applications, I think you can suggest users use caller context feature, which was released to HDP 2.2 and up.

The feature introduces a new setting hadoop.caller.context.enabled. When set to additional fields are written into namenode audit log records to help identify the job or query that introduced each NameNode operation. This feature is enabled by default starting with this release of HDP.New Behavior: This feature brings a new key-value pair at the end of each audit log record. The newly added key at is callerContext, value context:signature. The overall format would be callerContext=context:signature. If the signature is null or empty, the value will be context only, in the format of callerContext=context. If the hadoop.caller.context.enabled config key is false, the key-value pair will not be showing. The audit log format is not changed in this case. It is also possible to limit the maximum length of context and signature. Consider the hadoop.caller.context.max.size config key (default 128 bytes) and hadoop.caller.context.signature.max.size(default 40 bytes) config key respectively.

There is a chance that the new information in the audit log may break existing scripts/automation that was being used to analyze the audit log. In this case the scripts may need to be fixed. We do not recommend disabling this feature as it can be a useful troubleshooting aid.

Please refer to release notes.