- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
What is the best way to decompress/extract different types of incoming files in Apache Nifi?
- Labels:
-
Apache NiFi
Created ‎11-07-2017 10:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
thanks in advance!
My issue is regarding Apache Nifi:
Whats the best way to decompress/extract different types of incoming files?
In my use case I am getting a lot of files which are differently compressed (e.g. tar.gz, .zip, .rar, .tar or non-compressed .txt/.json), but I need all of them decompressed:
What I tried is to have every file running through every possible Compress/UnpackContent processor, but it is actually not working and probably not the best way performance-wise:
GetFile -> (...) -> CompressContent (uncompressing gzip) -> UnpackContent (extracting .tar) -> UnpackContent (extracting .zip) -> (...) -> PutFile
for example: a "*.json"-file should run through those processors and nothing should happen, a "tar.gz"-file should get uncompressed (changes his name to ".tar") and after that getting extracted in an UnpackProcessor, so I get an uncompressed file after all.
I hope there is a good solution, thanks once again.
best regards
Created ‎11-08-2017 08:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can set the "Compression format" to "use mime.type". This way, the processor will look for an attribute called mime.type and dynamically infer the format and hence the decompression algorithm.
For this to work, you need to use an UpdateAttribute to add an attribute mime.type and set it's value following your logic. Keep in mind that UpdateAttribute have rules logic in the advanced configuration that can be useful for your use case : https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-update-attribute-nar/1.4.0/or...
Created ‎11-08-2017 01:54 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Salda Murrah I forgot to tell about IdentifyMimeType that can be used to automatically identify the type of your file https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.4.0/org.apache...
Created on ‎11-09-2017 08:32 AM - edited ‎08-18-2019 12:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I managed to solve it by using those identify.mime.type-processors.
But there is still a setting which leads to fill my hard drive completely.
I guess UnpackContent is always getting new files to extract them, but when getting files in GetFile-processor I need to keep the source files, do you have any idea what I need to change?
Created on ‎11-08-2017 01:00 PM - edited ‎08-18-2019 12:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so far.
But do I still need both processors (UnpackContent and CompressContent) for my use case?
I am not sure how it should work out: If I add an attribute mime.type, will the UnpackContent processor get what I want?
I tried to set the mime.type attribute to application/${filename:substringAfterLast('.') and it extracted .zip and .tar succesfully, but I still got those compressed .gz files.
It looks like this:
Created ‎11-08-2017 01:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am not sure I understand your use case. Why do you use UnpackContent after Compress ? CompressContent can decompress your gz file with the decompress option.
Created ‎11-09-2017 08:34 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But I didnt manage to extract .tar nor .zip files with CompressContent.
I have many different files (.tar.gz, .tar, .zip ...) which should all be decompressed/extracted at the end.
I thought first of all I get to know all .gz files to decompress them (see first two processors in my screenshot), after that I want to extract all other files (.tar and .zip) what happens in the following two processors.
For example: Getting an 'test.tar.gz', decompressing it to 'test.tar' and extracting it to 'test' afterwards.
