Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HDFS Best way to trigger execution at File arrival

avatar
Explorer

Is their a clean and good way to trigger an execution of script or oozie workflow at the completion of a file storage in hdfs on HDP. When file lands on the hdfs.

I can't use NIFI, so please don't respond NIFI.

While going around in forums i only found people saying "not available in hdfs current api" or people making an Oozie job polling directory on a regular basis . Issue is that the more directory you have to trigger the more polling jobs you will have which is a waste of ressources. Also this will in all cases generate a delay in processing to balance with unnecessary workload for polling frequency. Best sounds to be informed of file save and match if it correspond to a regexp.

The below idea is definitevely not enterprise class and base on the namenode log parsing, is their a better and cleaner way to process and has anything being missed .

Consider to monitor /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log, for the sentence *"INFO hdfs.StateChange (FSNamesystem.java:completeFile"*"completeFile"*"is closed by"*.

The code could look like the below and allow to detect that a file is present in a given directory or directory tree .

tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log | while read line; do
    case "$line" in
        *"INFO  hdfs.StateChange (FSNamesystem.java:completeFile"*"completeFile"*"is closed by"*)
            v_filename=`echo $line |  sed  -e 's?^.* completeFile: \(.*\) is closed by.*?\1?' `
			v_dirname=`dirname $v_filename` 
			echo File created [$v_filename] Dirname [$v_dirname]
			#echo line $line 
			case "$v_dirname" in
			 "/data/ingest"* )
			 echo WATCH DIRECTORY directory $v_dirname : file $v_filename
			 #FileTriggerExec.sh $v_dirname $v_filename
				;;
			esac 
            ;;
    esac
  done 

Any comments , improvment, more industrial solution ?

1 REPLY 1

avatar
Super Collaborator

HDFS has an inotify feature which essentially translates those log entries into events that can be consumed.

https://issues.apache.org/jira/browse/HDFS-6634

Here's a Java based example: https://github.com/onefoursix/hdfs-inotify-example

Alternatively, rather than having Oozie monitor many directories and waste resources, a script can execute 'hdfs dfs -ls -R /folder|grep|sed' every minute or so but that's still not event based, so it depends how fast of a reaction you need vs how easy you can implement/use the inotify API.