Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to define a NIFI processor that will unzip a file that contains files in a directory tree

avatar
Rising Star

I've used the GetHTTP processor to get a zip file from the internet.. I then use PutFile to put this into the file system. I then need to unzip the file .. and preserve the directory structure that the zip file specifies. Can I do this unzip with a NIFI processor? Once unzipped, I will then need to do additional nifi processing on specific files within the original zip file. I tried to use UnpackContent, however its output was a set of flowfiles that lost the directory structure.

Would I need a custom script for this (e.g. use ExecuteScript processor)? Or perhaps I should integrate "Storm" with NIFI to facilitate such an unzip.. that seems overly complex.. and i dont even know that its a proper task for a Storm process..

Please advise.. I'd think a simple unzip file action.. is .. well simple.

1 ACCEPTED SOLUTION

avatar
Super Mentor
@David Sargrad

-

The link example you provided in your comment is trying to deal with a zip that contains zipped files (a zip of zips).

If you are talking about a single zip that contains a directory tree with subfiles, this is relatively easy to do.

-

After ingesting your zip file via GetHTTP feed it to an "UnpackContent" processor and then to a "PutFile" processor.

92820-screen-shot-2018-10-12-at-84631-am.png

-

When the "UnpackContent" processor unzips the source file, it will create a new FlowFile for each unique file found. A variety of FlowFile attributes will be set on each of those generated FlowFiles. This includes the "path"

92818-screen-shot-2018-10-12-at-83428-am.png

In the above example I created a directory named "zip-root" and created 4 sub-directories within that zip-root directory. I then created one file in each of those subdirectories. I then zipped (zip -r zip-root.zip zip-root) up the zip-root directory named zip-root.zip. The above screenshots shows just one of those unpacked files.

-

After "UnpackContent" executed, it produced 4 new FlowFile (one for each file found in those sub-directories with in the zip).

-

The "path" FlowFile attribute on each of these generated FlowFiles can be used to maintain the original directory structure when writing out the FlowFiles vi "PutFile" as follows:

92819-screen-shot-2018-10-12-at-84346-am.png

You can see form above configuration that as each FlowFile is processed by the PutFile processor it will place in a directory based on the value assigned to the "path" attribute set on each incoming FlowFile. Here i decide that my target base directory should be /tmp/target/ and then I preserve/generate the original zipped files directory beneath there.

-

Thank you,

Matt

-

If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.

View solution in original post

3 REPLIES 3

avatar
Rising Star

Come on NIFI gurus.. properly unzipping (without losing the zipped directory structure) should be a simple and easy thing to do in NIFI.

I cant imagine that its as complex as it seems to be here:

https://community.hortonworks.com/questions/191223/how-to-uncompress-a-zip-file-which-has-a-folder-i...

Please advise.

avatar
Super Mentor
@David Sargrad

-

The link example you provided in your comment is trying to deal with a zip that contains zipped files (a zip of zips).

If you are talking about a single zip that contains a directory tree with subfiles, this is relatively easy to do.

-

After ingesting your zip file via GetHTTP feed it to an "UnpackContent" processor and then to a "PutFile" processor.

92820-screen-shot-2018-10-12-at-84631-am.png

-

When the "UnpackContent" processor unzips the source file, it will create a new FlowFile for each unique file found. A variety of FlowFile attributes will be set on each of those generated FlowFiles. This includes the "path"

92818-screen-shot-2018-10-12-at-83428-am.png

In the above example I created a directory named "zip-root" and created 4 sub-directories within that zip-root directory. I then created one file in each of those subdirectories. I then zipped (zip -r zip-root.zip zip-root) up the zip-root directory named zip-root.zip. The above screenshots shows just one of those unpacked files.

-

After "UnpackContent" executed, it produced 4 new FlowFile (one for each file found in those sub-directories with in the zip).

-

The "path" FlowFile attribute on each of these generated FlowFiles can be used to maintain the original directory structure when writing out the FlowFiles vi "PutFile" as follows:

92819-screen-shot-2018-10-12-at-84346-am.png

You can see form above configuration that as each FlowFile is processed by the PutFile processor it will place in a directory based on the value assigned to the "path" attribute set on each incoming FlowFile. Here i decide that my target base directory should be /tmp/target/ and then I preserve/generate the original zipped files directory beneath there.

-

Thank you,

Matt

-

If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.

avatar
Rising Star

Thank you. I like your answer very much. I do think the referenced example was not focused on a zip of zip (just a simple zip of a directory tree).. Yet I think your answer is proper.. The "path" attribute does the job. I'll try this.. and thanks.