Support Questions

Find answers, ask questions, and share your expertise

How Nifi handles huge 1 TB files ?

avatar
Contributor

Hi

I wanted to look into a need to transfer 1 TB files by using chunking in Nifi.

Each file also has to have its 20 items of meta data associated with it remain intact so the metadata and the data both survive the breaking up of the file ( chunking ) into 1000 chunks and re-assembling the file at the destination ( de-chunking ). Also, is the meta data for the large file duplicated onto each of the file 1000 chunks or is it a sub-set of the meta data?

Someone mentioned nifi passes the file chunks data through JVM memory on its way to the content repository.

Can I confirm whether file chunks pass through JVM memory as they are written to the file/content repository for a large file ( or any file for that matter ?) - I was fairly sure they aren't, otherwise the JVM size ( limited by machine RAM ) on the machine would limit reading in of large file data, and that would limit large file transfer speed - is that correct?

I'm trying to confirm my understanding of how Nifi handles these large files please.

Any help appreciated.

 

2 REPLIES 2

avatar
Master Collaborator

Hello @zzzz77

Glad to have you on the community. 

What you are asking should be done with this kind of flow: 
GetFile → SplitContent → Transfer → MergeContent → PutFile

The SplitContent will split the file and the attributes will be get duplicated, because they are saved on the FlowFile, not on the content. 
More attributes will be added for the fragmentation part. 

The MergeContent will rebuild the content and the original attributes properly. 
So the metadata will not be lost. 


Regards,
Andrés Fallas
--
Was your question answered? Please take some time to click on "Accept as Solution" below this post.
If you find a reply useful, say thanks by clicking on the thumbs-up button.

avatar
Master Mentor

@zzzz77 

FlowFile Metadata/attributes are held in NiFi Heap memory.  For queued FlowFiles, there is a configurable swap threshold in the nifi.properties that will swap batches of 10,000 FlowFIle's worth for metadata/attributes to disk when the threshold is met.  This swapping is there to minimize excessive heap usage when queues grow large.  The NiFi Content is not held in heap memory; however, some processor may need to read the content into heap memory for the processor to perform it's function.    You will notice if you look at the individual components documentation that a "System Resource Considerations" section exists.  If Heap memory usage is a concern for that processor, it will be documented there. 

SplitContent processor docs example:

MattWho_0-1770047305065.png

Processors like SplitContent will hold the all the FlowFile metadata/attributes (not content) for every split FlowFIle being produced in heap memory until all the output FlowFiles have been produced and committed to the downstream connection.  These FlowFiles being produced can not be swapped to disk until they committed to the downstream connection.  So if a splitContent were to produce 50,000 split FlowFiles, the attributes for all 50,000 would be held in heap.  After committed to the downstream connection. 40,000 of those would get swapped to disk based on default swap thresholds. So heap impact would spike but not persist. 

Since you have not shared the specific of your dataflow in question (which processors you are using), I can't provide any specific feedback.   Where is the chunking and de-chunking happening?  Sounds like this may be happening at source and at destination. NiFi is just moving these chunks from source to destination.  How are you sending the chunks to NiFi and transferring them to destination?

 

Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped.

Thank you,
Matt