Support Questions

foivos · ‎10-12-2017

In the flow below, the goal is to take some zip archives from an FTP server, unzip them and send them to HDFS. The implication is that due to rerouting issues in this specific FTP location, i cannot use the built-in Get/List/FetchFTP processors as they fail to follow the proper rerouting.

What i can do is use a command line utility from the Nifi server that can handle rerouting, and indeed ncftp in this case does the trick. So my plan is to use ExecuteProcess to run the bash script:

ncftpget -f login.txt /home/user/test /path/to/remote/*.zip > /dev/null 2>&1                                  
ls /home/user/test | grep ".zip"

The first line gets the wanted zip archives from the FTP server while redirects all output/error streams, as we want to parse only the output of the second line which lists the contents of the specified directory and parses the ones with 'zip' extension.

What i am trying to do is to recreate the proper filenames between ExtractText --> UpdateAttribute and pass them to FetchFile --> UnpackContent --> PutHDFS.

So the output of the ExecuteProcess is something like:

file1_timestamp.zip

file2_timestamp.zip                                                                                           
file3_timestamp.zip                                                                                         
.....
file100_timestamp.zip

Next, ExtractText processor with an added property 'filename': '\w+.zip' looks for this regex in the flowfile content and outputs a flowfile with new attributes filename1,filename2...filename100 for each match. Subsequently, UpdateAttribute specifies the local path the zip archives have been placed from our bash script ('/home/user/test' in this case), as well as the proper filename so that ${path}/${filename} are passed to the rest of the flow for fetching, unpacking and finally putting to HDFS.

The problem i have is that only the first match is passed to the rest of the flow as only this match corresponds to the 'filename' attribute. The other filenames are parsed according to the ExtractText processor to the attributes 'filename.2', 'filename.3'... 'filename.100'. I would like to find a way to update the attributes passed to FetchFile with some kind of incremental counter. I tried to configure the FetchFile processor with File to Fetch property as ${path}/${filename: nextInt()} but this just looks for 'file_timestamp.zip#' filenames in the specified path that ofc are not there.

aanghel · ‎10-12-2017

Hi @balalaika

It would be better to just output the file listing with 1 filename per line and then have a SplitText processor, followed by a FetchFile. You can do this with the -1 parameter for ls:

ls -1 /home/user/test | grep ".zip"

SplitText will just generate 1 flowfile for each file which is then easily picked by FetchFile.

You will still need to add an ExtractText between them but with a simple (.*) rule (this is to transfer the flowfile content - which is the actual filename - into an attribute that can be used by FetchFile as the filename).

View solution in original post

aanghel · ‎10-12-2017

Hi @balalaika

It would be better to just output the file listing with 1 filename per line and then have a SplitText processor, followed by a FetchFile. You can do this with the -1 parameter for ls:

ls -1 /home/user/test | grep ".zip"

SplitText will just generate 1 flowfile for each file which is then easily picked by FetchFile.

You will still need to add an ExtractText between them but with a simple (.*) rule (this is to transfer the flowfile content - which is the actual filename - into an attribute that can be used by FetchFile as the filename).

foivos · ‎10-12-2017

SplitText in between did the trick, amazing tip thanks a lot! Btw the ls -l output contains also other stuff like permissions etc so a rule for ExtractText to parse only the zip filenames is also needed, but thanks anyway!

aanghel · ‎10-12-2017

Glad it work, but I didn't say to use ls -l (which outputs the additional stuff), but ls -1 (as in the 1 - the number) which outputs only the filename, one per line.

aanghel · ‎10-12-2017

@balalaika ^^

Cloudera Community

Support Questions

Nifi flow: Bash script execution, regex on stdout for FetchFile to parse correct filenames.