Support Questions

Find answers, ask questions, and share your expertise

Can we read and compress folders using NiFi.?

avatar
Super Collaborator

Hi, we have a use case where we need to save the source files as it is.

for each day we will have a folder with 1000 files. we will process 1000 files and send required 100 files to HDFS.

but we also want to compress and save the whole folder for any future references.(we like to keep the source as it is at a folder level not at file level)

is there anyway i can do this using NiFi.?

Regards,

Sai

1 ACCEPTED SOLUTION

avatar

Hello @Saikrishna Tarapareddy

There's no Processor which compress a folder that I'm aware of. But you can do that by ExecuteProcess processor:

10136-zipfolderbyexecuteprocess.png

This is an example ExecuteProcess configuration to use zip command.

Command Arguments:

-r /tmp/source-files-${now():format('yyyyMMdd')}.zip /tmp/source-files/

NiFi expression can be used in command arguments, above expression sets the zip filename using current date.

In your use-case, I'd use this single processor scheduled with 'Cron Driven' Scheduling Strategy, in addition to the main flow which processes the files and send those to HDFS.

Hope this helps.

Thanks,

Koji

View solution in original post

3 REPLIES 3

avatar

Hello @Saikrishna Tarapareddy

There's no Processor which compress a folder that I'm aware of. But you can do that by ExecuteProcess processor:

10136-zipfolderbyexecuteprocess.png

This is an example ExecuteProcess configuration to use zip command.

Command Arguments:

-r /tmp/source-files-${now():format('yyyyMMdd')}.zip /tmp/source-files/

NiFi expression can be used in command arguments, above expression sets the zip filename using current date.

In your use-case, I'd use this single processor scheduled with 'Cron Driven' Scheduling Strategy, in addition to the main flow which processes the files and send those to HDFS.

Hope this helps.

Thanks,

Koji

avatar
Super Collaborator

@kkawamura,

Thank you. i will try that. so with above example it zips all the files under source-files into one zip file.?

also is there anyway that i can process this at subfolder level.?

in my case , under a folder "January" i will have 31 subfolders 01012016 to 01312016 (a folder for each day with 8000 files in each folder). if i point the above command at folder "January" it will try to zip all 31 subfolders into one which may come nearly 80G and it may be difficult for me to transfer to HDFS thru site2site.

i was looking at 1 zip for each subfolder. so in this case i will endup with 31 zip files.

Thanks again,

Sai

avatar

@Saikrishna Tarapareddy

In that case, I'd use another command to list the sub folders first, then pass each sub folder to a zip command. NiFi flow looks like below.

List Sub Folders (ExecuteProcess): I used find command here.

find souce-files -type d -depth 1

10167-zipsubfolders.png

The command produces a flow file containing sub folders each line. So, split those outputs, then use ExtractText to extract sub folder path to an attribute 'targetDir'.

10168-extracttext.png

You need to add a dynamic property by clicking the plus sign, then name the property with an attribute name to extract the content to. The Value is a regular expression to extract desired text from the content.

Used ExecuteStreamCommand this time, to use incoming flow files.

- Command Path: zip

- Command Arguments: -r;${targetDir}-${now():format('yyyyMMdd')}.zip;${targetDir}

- Ignore STDIN: true

Then it will zip each sub folders.

Thanks,

Koji