Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

HDFS Best way to trigger execution at File arrival


Is their a clean and good way to trigger an execution of script or oozie workflow at the completion of a file storage in hdfs on HDP. When file lands on the hdfs.

I can't use NIFI, so please don't respond NIFI.

While going around in forums i only found people saying "not available in hdfs current api" or people making an Oozie job polling directory on a regular basis . Issue is that the more directory you have to trigger the more polling jobs you will have which is a waste of ressources. Also this will in all cases generate a delay in processing to balance with unnecessary workload for polling frequency. Best sounds to be informed of file save and match if it correspond to a regexp.

The below idea is definitevely not enterprise class and base on the namenode log parsing, is their a better and cleaner way to process and has anything being missed .

Consider to monitor /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log, for the sentence *"INFO hdfs.StateChange ("*"completeFile"*"is closed by"*.

The code could look like the below and allow to detect that a file is present in a given directory or directory tree .

tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log | while read line; do
    case "$line" in
        *"INFO  hdfs.StateChange ("*"completeFile"*"is closed by"*)
            v_filename=`echo $line |  sed  -e 's?^.* completeFile: \(.*\) is closed by.*?\1?' `
			v_dirname=`dirname $v_filename` 
			echo File created [$v_filename] Dirname [$v_dirname]
			#echo line $line 
			case "$v_dirname" in
			 "/data/ingest"* )
			 echo WATCH DIRECTORY directory $v_dirname : file $v_filename $v_dirname $v_filename

Any comments , improvment, more industrial solution ?


Super Collaborator

HDFS has an inotify feature which essentially translates those log entries into events that can be consumed.

Here's a Java based example:

Alternatively, rather than having Oozie monitor many directories and waste resources, a script can execute 'hdfs dfs -ls -R /folder|grep|sed' every minute or so but that's still not event based, so it depends how fast of a reaction you need vs how easy you can implement/use the inotify API.