Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What is the best way to decompress/extract different types of incoming files in Apache Nifi?

avatar
Contributor

Hi all,

thanks in advance!

My issue is regarding Apache Nifi:

Whats the best way to decompress/extract different types of incoming files?

In my use case I am getting a lot of files which are differently compressed (e.g. tar.gz, .zip, .rar, .tar or non-compressed .txt/.json), but I need all of them decompressed:

What I tried is to have every file running through every possible Compress/UnpackContent processor, but it is actually not working and probably not the best way performance-wise:

GetFile -> (...) -> CompressContent (uncompressing gzip) -> UnpackContent (extracting .tar) -> UnpackContent (extracting .zip) -> (...) -> PutFile

for example: a "*.json"-file should run through those processors and nothing should happen, a "tar.gz"-file should get uncompressed (changes his name to ".tar") and after that getting extracted in an UnpackProcessor, so I get an uncompressed file after all.

I hope there is a good solution, thanks once again.

best regards

6 REPLIES 6

avatar

Hi @Salda Murrah

You can set the "Compression format" to "use mime.type". This way, the processor will look for an attribute called mime.type and dynamically infer the format and hence the decompression algorithm.

For this to work, you need to use an UpdateAttribute to add an attribute mime.type and set it's value following your logic. Keep in mind that UpdateAttribute have rules logic in the advanced configuration that can be useful for your use case : https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-update-attribute-nar/1.4.0/or...

avatar

@Salda Murrah I forgot to tell about IdentifyMimeType that can be used to automatically identify the type of your file https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.4.0/org.apache...

avatar
Contributor

I managed to solve it by using those identify.mime.type-processors.

42517-bildschirmfoto-vom-2017-11-09-091647.png

But there is still a setting which leads to fill my hard drive completely.

I guess UnpackContent is always getting new files to extract them, but when getting files in GetFile-processor I need to keep the source files, do you have any idea what I need to change?

avatar
Contributor

Thanks so far.

But do I still need both processors (UnpackContent and CompressContent) for my use case?

I am not sure how it should work out: If I add an attribute mime.type, will the UnpackContent processor get what I want?

I tried to set the mime.type attribute to application/${filename:substringAfterLast('.') and it extracted .zip and .tar succesfully, but I still got those compressed .gz files.

It looks like this:

42502-unpackcontent.png

avatar

I am not sure I understand your use case. Why do you use UnpackContent after Compress ? CompressContent can decompress your gz file with the decompress option.

avatar
Contributor

But I didnt manage to extract .tar nor .zip files with CompressContent.

I have many different files (.tar.gz, .tar, .zip ...) which should all be decompressed/extracted at the end.

I thought first of all I get to know all .gz files to decompress them (see first two processors in my screenshot), after that I want to extract all other files (.tar and .zip) what happens in the following two processors.

For example: Getting an 'test.tar.gz', decompressing it to 'test.tar' and extracting it to 'test' afterwards.