Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Solved Go to solution

Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Rising Star

39784-flow.png

39785-extext.png

39786-fetch.png

39787-updateattr.png

In the flow below, the goal is to take some zip archives from an FTP server, unzip them and send them to HDFS. The implication is that due to rerouting issues in this specific FTP location, i cannot use the built-in Get/List/FetchFTP processors as they fail to follow the proper rerouting.

What i can do is use a command line utility from the Nifi server that can handle rerouting, and indeed ncftp in this case does the trick. So my plan is to use ExecuteProcess to run the bash script:

ncftpget -f login.txt /home/user/test /path/to/remote/*.zip > /dev/null 2>&1                                  
ls /home/user/test | grep ".zip"

The first line gets the wanted zip archives from the FTP server while redirects all output/error streams, as we want to parse only the output of the second line which lists the contents of the specified directory and parses the ones with 'zip' extension.

What i am trying to do is to recreate the proper filenames between ExtractText --> UpdateAttribute and pass them to FetchFile --> UnpackContent --> PutHDFS.

So the output of the ExecuteProcess is something like:

file1_timestamp.zip

file2_timestamp.zip                                                                                           
file3_timestamp.zip                                                                                         
.....
file100_timestamp.zip 

Next, ExtractText processor with an added property 'filename': '\w+.zip' looks for this regex in the flowfile content and outputs a flowfile with new attributes filename1,filename2...filename100 for each match. Subsequently, UpdateAttribute specifies the local path the zip archives have been placed from our bash script ('/home/user/test' in this case), as well as the proper filename so that ${path}/${filename} are passed to the rest of the flow for fetching, unpacking and finally putting to HDFS.

The problem i have is that only the first match is passed to the rest of the flow as only this match corresponds to the 'filename' attribute. The other filenames are parsed according to the ExtractText processor to the attributes 'filename.2', 'filename.3'... 'filename.100'. I would like to find a way to update the attributes passed to FetchFile with some kind of incremental counter. I tried to configure the FetchFile processor with File to Fetch property as ${path}/${filename: nextInt()} but this just looks for 'file_timestamp.zip#' filenames in the specified path that ofc are not there.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Expert Contributor

Hi @balalaika

It would be better to just output the file listing with 1 filename per line and then have a SplitText processor, followed by a FetchFile. You can do this with the -1 parameter for ls:

ls -1 /home/user/test | grep ".zip"

SplitText will just generate 1 flowfile for each file which is then easily picked by FetchFile.

You will still need to add an ExtractText between them but with a simple (.*) rule (this is to transfer the flowfile content - which is the actual filename - into an attribute that can be used by FetchFile as the filename).

4 REPLIES 4

Re: Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Expert Contributor

Hi @balalaika

It would be better to just output the file listing with 1 filename per line and then have a SplitText processor, followed by a FetchFile. You can do this with the -1 parameter for ls:

ls -1 /home/user/test | grep ".zip"

SplitText will just generate 1 flowfile for each file which is then easily picked by FetchFile.

You will still need to add an ExtractText between them but with a simple (.*) rule (this is to transfer the flowfile content - which is the actual filename - into an attribute that can be used by FetchFile as the filename).

Re: Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Rising Star

SplitText in between did the trick, amazing tip thanks a lot! Btw the ls -l output contains also other stuff like permissions etc so a rule for ExtractText to parse only the zip filenames is also needed, but thanks anyway!

Re: Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Expert Contributor

Glad it work, but I didn't say to use ls -l (which outputs the additional stuff), but ls -1 (as in the 1 - the number) which outputs only the filename, one per line.

Highlighted

Re: Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

Expert Contributor