Created 04-15-2016 09:19 AM
Dear all,
Currently, I have some files on HDFS and it is being writed for which jobs. But, I don't know this jobs.
So, how to i check opening files is of which jobs?
Thanks,
Created 05-28-2016 01:17 AM
If this problem happens a lot,I mean you always need know the mapping from file operations (create, delete, rename etc) to upper level applications, I think you can suggest users use caller context feature, which was released to HDP 2.2 and up.
| The feature introduces a new setting hadoop.caller.context.enabled. When set to additional fields are written into namenode audit log records to help identify the job or query that introduced each NameNode operation. This feature is enabled by default starting with this release of HDP. | New Behavior: This feature brings a new key-value pair at the end of each audit log record. The newly added key at is callerContext, valuecontext:signature. The overall format would be callerContext=context:signature. If the signature is null or empty, the value will be context only, in the format ofcallerContext=context. If thehadoop.caller.context.enabledconfig key is false, the key-value pair will not be showing. The audit log format is not changed in this case. It is also possible to limit the maximum length of context and signature. Consider thehadoop.caller.context.max.sizeconfig key (default 128 bytes) andhadoop.caller.context.signature.max.size(default 40 bytes) config key respectively.There is a chance that the new information in the audit log may break existing scripts/automation that was being used to analyze the audit log. In this case the scripts may need to be fixed. We do not recommend disabling this feature as it can be a useful troubleshooting aid. | 
Please refer to release notes.
Created 04-15-2016 09:46 AM
Are you curios to check for HDFS open files or local FS open files ?
Created 04-15-2016 09:46 AM
Hi @tunglq it, which files do you have in mind? After installing HDP cluster there are a number of system (cluster) file written in HDFS but no jobs are running (Ambari metrics in embedded mode and Ranger audit can write files all the time but they are a special case). Folders like apps, hdp, system, tmp, user etc. are created and populated automatically during the cluster upgrade.
Created 04-15-2016 10:13 AM
thanks you with your answer!
Example, I have 1 file on hdfs: hdfs://cluster/data/logs.txt, it is writing for a any job. Now, i want to know which job is writing this file logs.txt?????
Thanks,
Created 04-15-2016 10:26 AM
So there is no straight forward way to identify which file was written by which job, however we need little bit hand works to achieve this by parsing all job logs through a script and should look for that specific file path or name occurrences in the logs. In most cases if you ran a map reduce job then it is likely that Application master container logs should have that information, if not then better if you parse each job containers logs one by one through a script.
will that help?
Created 04-15-2016 10:57 AM
You need to write a custom script[say bash/perl] which will check for mapreduce log files and accordingly you can capture the src/dest of any hdfs file which the job is using.
Some more login within the script may help you to track which are currently inuse files on hdfs.
Created 05-28-2016 01:10 AM
First you can check the hdfs audit log which tells the client name that created the file. Then based on the client name you may go and search the yarn application log to identify which job was writing the file.
Created 05-28-2016 01:17 AM
If this problem happens a lot,I mean you always need know the mapping from file operations (create, delete, rename etc) to upper level applications, I think you can suggest users use caller context feature, which was released to HDP 2.2 and up.
| The feature introduces a new setting hadoop.caller.context.enabled. When set to additional fields are written into namenode audit log records to help identify the job or query that introduced each NameNode operation. This feature is enabled by default starting with this release of HDP. | New Behavior: This feature brings a new key-value pair at the end of each audit log record. The newly added key at is callerContext, valuecontext:signature. The overall format would be callerContext=context:signature. If the signature is null or empty, the value will be context only, in the format ofcallerContext=context. If thehadoop.caller.context.enabledconfig key is false, the key-value pair will not be showing. The audit log format is not changed in this case. It is also possible to limit the maximum length of context and signature. Consider thehadoop.caller.context.max.sizeconfig key (default 128 bytes) andhadoop.caller.context.signature.max.size(default 40 bytes) config key respectively.There is a chance that the new information in the audit log may break existing scripts/automation that was being used to analyze the audit log. In this case the scripts may need to be fixed. We do not recommend disabling this feature as it can be a useful troubleshooting aid. | 
Please refer to release notes.
 
					
				
				
			
		
