Support Questions
Find answers, ask questions, and share your expertise

What is the best way to decompress/extract different types of incoming files in Apache Nifi?

What is the best way to decompress/extract different types of incoming files in Apache Nifi?

Explorer

Hi all,

thanks in advance!

My issue is regarding Apache Nifi:

Whats the best way to decompress/extract different types of incoming files?

In my use case I am getting a lot of files which are differently compressed (e.g. tar.gz, .zip, .rar, .tar or non-compressed .txt/.json), but I need all of them decompressed:

What I tried is to have every file running through every possible Compress/UnpackContent processor, but it is actually not working and probably not the best way performance-wise:

GetFile -> (...) -> CompressContent (uncompressing gzip) -> UnpackContent (extracting .tar) -> UnpackContent (extracting .zip) -> (...) -> PutFile

for example: a "*.json"-file should run through those processors and nothing should happen, a "tar.gz"-file should get uncompressed (changes his name to ".tar") and after that getting extracted in an UnpackProcessor, so I get an uncompressed file after all.

I hope there is a good solution, thanks once again.

best regards

6 REPLIES 6
Highlighted

Re: What is the best way to decompress/extract different types of incoming files in Apache Nifi?

Hi @Salda Murrah

You can set the "Compression format" to "use mime.type". This way, the processor will look for an attribute called mime.type and dynamically infer the format and hence the decompression algorithm.

For this to work, you need to use an UpdateAttribute to add an attribute mime.type and set it's value following your logic. Keep in mind that UpdateAttribute have rules logic in the advanced configuration that can be useful for your use case : https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-update-attribute-nar/1.4.0/or...

Highlighted

Re: What is the best way to decompress/extract different types of incoming files in Apache Nifi?

@Salda Murrah I forgot to tell about IdentifyMimeType that can be used to automatically identify the type of your file https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.4.0/org.apache...

Highlighted

Re: What is the best way to decompress/extract different types of incoming files in Apache Nifi?

Explorer

I managed to solve it by using those identify.mime.type-processors.

42517-bildschirmfoto-vom-2017-11-09-091647.png

But there is still a setting which leads to fill my hard drive completely.

I guess UnpackContent is always getting new files to extract them, but when getting files in GetFile-processor I need to keep the source files, do you have any idea what I need to change?

Highlighted

Re: What is the best way to decompress/extract different types of incoming files in Apache Nifi?

Explorer

Thanks so far.

But do I still need both processors (UnpackContent and CompressContent) for my use case?

I am not sure how it should work out: If I add an attribute mime.type, will the UnpackContent processor get what I want?

I tried to set the mime.type attribute to application/${filename:substringAfterLast('.') and it extracted .zip and .tar succesfully, but I still got those compressed .gz files.

It looks like this:

42502-unpackcontent.png

Highlighted

Re: What is the best way to decompress/extract different types of incoming files in Apache Nifi?

I am not sure I understand your use case. Why do you use UnpackContent after Compress ? CompressContent can decompress your gz file with the decompress option.

Highlighted

Re: What is the best way to decompress/extract different types of incoming files in Apache Nifi?

Explorer

But I didnt manage to extract .tar nor .zip files with CompressContent.

I have many different files (.tar.gz, .tar, .zip ...) which should all be decompressed/extracted at the end.

I thought first of all I get to know all .gz files to decompress them (see first two processors in my screenshot), after that I want to extract all other files (.tar and .zip) what happens in the following two processors.

For example: Getting an 'test.tar.gz', decompressing it to 'test.tar' and extracting it to 'test' afterwards.