Support Questions

MmSs · ‎09-08-2023

Hi, I am using UnpackContent to extract files from a zip and then load them to another location. I am seeing corruption in the files going to that other location that I think is caused by some race condition on the extracted files. The zip files have a directory and file structure that look something like this

file1.zip -> data/file.txt

file2.zip -> data/file.txt

where the contents of file.txt in the two zip files is different but they are named the same. We use UnpackContent with Packaging Format 'zip' and File Filter 'data' to only get files in the data directory.

This works fine when processing individual files, but now that we have scaled this up, there appears to be an issue where the file extracted from file1.zip gets overwitten by the extracted file from file2.zip and then when we copy the file over, the content is corrupted between the two.

Looking at the properties after the UnpackContent, I see absolute.path is something like <NiFi Location>/data/file.txt after extraction so I think they are just corrupting by extracting both files to the same path before they can be moved to the next location. Is there any way to change where UnpackContent puts these files so they don't clobber each other? Maybe something like <NiFi Location>/file1/data/file.txt?

SAMSAL · ‎09-08-2023

Hi @MmSs ,

Can you provide more details about your flow? For example what do you do after the UnpackContent processor and what processor\flow are you using to copy the file to the final location? For example if you are using PutFile after the UnpackContent where different files with the same filename from different packages are getting overwritten, while at the sametime you have another process\flow like GetFile\ListFile\FetchFile to copy the unzipped files to the final target then you are copying files that are constantly being verwritten (through UnpackContnet->PutFile) , this of course will make the files corrupt. The solution would be is when saving the file after the UnpackContent make sure to save it under unique name or path to prevent the conflict. You can use some of the UnpackContent processor write attributes and other upstream processors write attributes to help comeup with a unique name for the unzipped files (segment.original.filename, path, filename,fragment.identifier, fragment.index ...etc.)

If you find this is helpful please accept solution.

Thanks

MmSs · ‎09-08-2023

After the UnpackContent, we actually write them to S3 with PutS3Object. The S3 Keys are generated to be unique from context how you suggest, but the problem is that the underlying file content is corrupted if file2.zip extracts while file1.zip content is waiting to write to S3.

MmSs · ‎09-11-2023

@SAMSAL just for my understanding and as a potential workaround, when I call UnpackContent, the items themselves are stored in memory or on disk? I'm assuming if in memory, if I call UpdateAttributes on absolute.path to make it unique immediately after UnpackContent and the rest of my flow would still work? There is no physical file to move on disk after the Unpack?

MattWho · ‎09-12-2023

@MmSs

NiFi is data agnostic. To NiFi, the content of a FlowFile just bits. To remain data agnostic, NiFi uses what NiFi calls a "FlowFile". A FlowFile consists of two parts, FlowFile Attributes/Metadata (persisted in FlowFile repository and held in JVM heap memory) and FlowFile content (stored in content claims within content repository). This way NiFi core does not need to care or know anything about the format of the data/content. It becomes the responsibility of am individual processor component that needs to read or manipulate the content to understand the bits of content. The NiFi FlowFile metadata simply records in which content claim the bits exist and at what offset within the claim the content starts and number if bits that follow. As a far as directory paths go, these become just additional attributes on a FlowFile and have no bearing on NiFi's persistent storage of the FlowFiles content to the content repository.

As far as the unpackContent goes, the processor will process both zip1 and zip2 separately. Unpacked content from zip one is written to a new FlowFile and same hold true for zip2. So if you stop the processor immediately after your UnpackContent processor and send your zip1 and zip2 FlowFiles through, you can list the content on the outbound relationship to inspect them before further processing. You'll be able to view the content and the metadata for each output FlowFile. NiFi does not care if there are multiple FlowFiles with the same filename as NiFi tracks them with unique UUID within NiFi. What you describe as zip1 content (already queued in inbound connection to PutS3Object being corrupted if zip2 is then extracted) is not possible. Run both zip 1 and zip2 through your dataflow with putS3Object stopped and inspect the queued FlowFiles as they exist queued before putS3Object is started. Are queued files on same node in your NiFi cluster? Is your putS3Object using "${filename}" as the object key? What happens if you use "{filename}-${uuid}" instead? My guess is issue is in your putS3Object configuration leading to corruption on write to S3.

So your issue seems more likely to be a flow design issue then a processor of NiFi FlowFile handling issue. Sharing all the processors you are using in your dataflow and their configuration may help in pinpointing your design issue.

If you found any of the suggestions/solutions provided helped you with your issue, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt

SAMSAL · ‎09-11-2023

Looking at the UnpackContent processor code , it seems to be writing the items into memory.

https://github.com/kiranjilla/nifi-xom/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standa...

The absolute.path appears to be getting set there. Not sure if the proposed solution will work. @MattWho , @cotopaul , @steven-matison can you guys help with this?

Cloudera Community

Support Questions

UnpackContent overwriting data