Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.

avatar
Expert Contributor

39784-flow.png

39785-extext.png

39786-fetch.png

39787-updateattr.png

In the flow below, the goal is to take some zip archives from an FTP server, unzip them and send them to HDFS. The implication is that due to rerouting issues in this specific FTP location, i cannot use the built-in Get/List/FetchFTP processors as they fail to follow the proper rerouting.

What i can do is use a command line utility from the Nifi server that can handle rerouting, and indeed ncftp in this case does the trick. So my plan is to use ExecuteProcess to run the bash script:

ncftpget -f login.txt /home/user/test /path/to/remote/*.zip > /dev/null 2>&1                                  
ls /home/user/test | grep ".zip"

The first line gets the wanted zip archives from the FTP server while redirects all output/error streams, as we want to parse only the output of the second line which lists the contents of the specified directory and parses the ones with 'zip' extension.

What i am trying to do is to recreate the proper filenames between ExtractText --> UpdateAttribute and pass them to FetchFile --> UnpackContent --> PutHDFS.

So the output of the ExecuteProcess is something like:

file1_timestamp.zip

file2_timestamp.zip                                                                                           
file3_timestamp.zip                                                                                         
.....
file100_timestamp.zip 

Next, ExtractText processor with an added property 'filename': '\w+.zip' looks for this regex in the flowfile content and outputs a flowfile with new attributes filename1,filename2...filename100 for each match. Subsequently, UpdateAttribute specifies the local path the zip archives have been placed from our bash script ('/home/user/test' in this case), as well as the proper filename so that ${path}/${filename} are passed to the rest of the flow for fetching, unpacking and finally putting to HDFS.

The problem i have is that only the first match is passed to the rest of the flow as only this match corresponds to the 'filename' attribute. The other filenames are parsed according to the ExtractText processor to the attributes 'filename.2', 'filename.3'... 'filename.100'. I would like to find a way to update the attributes passed to FetchFile with some kind of incremental counter. I tried to configure the FetchFile processor with File to Fetch property as ${path}/${filename: nextInt()} but this just looks for 'file_timestamp.zip#' filenames in the specified path that ofc are not there.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @balalaika

It would be better to just output the file listing with 1 filename per line and then have a SplitText processor, followed by a FetchFile. You can do this with the -1 parameter for ls:

ls -1 /home/user/test | grep ".zip"

SplitText will just generate 1 flowfile for each file which is then easily picked by FetchFile.

You will still need to add an ExtractText between them but with a simple (.*) rule (this is to transfer the flowfile content - which is the actual filename - into an attribute that can be used by FetchFile as the filename).

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Hi @balalaika

It would be better to just output the file listing with 1 filename per line and then have a SplitText processor, followed by a FetchFile. You can do this with the -1 parameter for ls:

ls -1 /home/user/test | grep ".zip"

SplitText will just generate 1 flowfile for each file which is then easily picked by FetchFile.

You will still need to add an ExtractText between them but with a simple (.*) rule (this is to transfer the flowfile content - which is the actual filename - into an attribute that can be used by FetchFile as the filename).

avatar
Expert Contributor

SplitText in between did the trick, amazing tip thanks a lot! Btw the ls -l output contains also other stuff like permissions etc so a rule for ExtractText to parse only the zip filenames is also needed, but thanks anyway!

avatar
Super Collaborator

Glad it work, but I didn't say to use ls -l (which outputs the additional stuff), but ls -1 (as in the 1 - the number) which outputs only the filename, one per line.

avatar
Super Collaborator