Is their a clean and good way to trigger an execution of script or oozie workflow at the completion of a file storage in hdfs on HDP. When file lands on the hdfs.
I can't use NIFI, so please don't respond NIFI.
While going around in forums i only found people saying "not available in hdfs current api" or people making an Oozie job polling directory on a regular basis . Issue is that the more directory you have to trigger the more polling jobs you will have which is a waste of ressources. Also this will in all cases generate a delay in processing to balance with unnecessary workload for polling frequency. Best sounds to be informed of file save and match if it correspond to a regexp.
The below idea is definitevely not enterprise class and base on the namenode log parsing, is their a better and cleaner way to process and has anything being missed .
Consider to monitor /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log,
for the sentence *"INFO hdfs.StateChange (FSNamesystem.java:completeFile"*"completeFile"*"is closed by"*.
The code could look like the below and allow to detect that a file is present in a given directory or directory tree .
tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-*.log | while read line; do
case "$line" in
*"INFO hdfs.StateChange (FSNamesystem.java:completeFile"*"completeFile"*"is closed by"*)
v_filename=`echo $line | sed -e 's?^.* completeFile: \(.*\) is closed by.*?\1?' `
echo File created [$v_filename] Dirname [$v_dirname]
#echo line $line
case "$v_dirname" in
echo WATCH DIRECTORY directory $v_dirname : file $v_filename
#FileTriggerExec.sh $v_dirname $v_filename
Any comments , improvment, more industrial solution ?
Alternatively, rather than having Oozie monitor many directories and waste resources, a script can execute 'hdfs dfs -ls -R /folder|grep|sed' every minute or so but that's still not event based, so it depends how fast of a reaction you need vs how easy you can implement/use the inotify API.