Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi for batch ingest

avatar
Super Collaborator

Hi All,

what would be the side effects of doing batch ingestion through nifi. Lets say large files copies, if we ingest through nifi, how would it behave? same with large db copies.

Thanks,

Avijeet

1 ACCEPTED SOLUTION

avatar
Super Mentor

@Avijeet Dash

Is the intent to manipulate these large files in anyway once they have been ingested in to NiFi? NiFi has no problem ingesting files of any type or size provided sufficient space exists in the content repository to store that data.

For performance, Nifi only passes a FlowFile references between processors within the NiFi dataflow. Even if you "clone" a large file down two or more dataflow path, this only results in an additional reference FlowFile to the same content in the content repository. All rFlowFile references to the same content must be resolved before the actual content is removed from the repository.

That being said, NiFi provides a multitude of processors for manipulating the content of FlowFiles. Anytime you modify/change the content of a FlowFile, a new FlowFile is created along with the new content. This is important because following this new content creation, you still have the original as well as your new version of the content in your content repository. So you must plan accordingly if manipulation of the content is to be done to make sure you have sufficient repository storage.

JVM memory comes in to the mix most noticeably when doing any splitting of large content in to many smaller content. So if you plan on producing more then say 10,000 individual FlowFiles for a single Large FlowFile, you will likely need additional JVM memory allocation to your NiFi.

As you can see a lot more needs to be considered beyond just the size of teh data being ingested when planning out your NiFi needs.

Hope this helps,

Thanks,

Matt

View solution in original post

8 REPLIES 8

avatar

Hi @Avijeet Dash Generally speaking NiFi will handle that absolutely fine, I've seen it used to move very large video files with no issue, you'll need to ensure that the nodes have relevant disk, cpu and memory to support the file sizes you're interested, but otherwise no major concerns!

Hope that helps!

avatar
Super Mentor

@Avijeet Dash

Is the intent to manipulate these large files in anyway once they have been ingested in to NiFi? NiFi has no problem ingesting files of any type or size provided sufficient space exists in the content repository to store that data.

For performance, Nifi only passes a FlowFile references between processors within the NiFi dataflow. Even if you "clone" a large file down two or more dataflow path, this only results in an additional reference FlowFile to the same content in the content repository. All rFlowFile references to the same content must be resolved before the actual content is removed from the repository.

That being said, NiFi provides a multitude of processors for manipulating the content of FlowFiles. Anytime you modify/change the content of a FlowFile, a new FlowFile is created along with the new content. This is important because following this new content creation, you still have the original as well as your new version of the content in your content repository. So you must plan accordingly if manipulation of the content is to be done to make sure you have sufficient repository storage.

JVM memory comes in to the mix most noticeably when doing any splitting of large content in to many smaller content. So if you plan on producing more then say 10,000 individual FlowFiles for a single Large FlowFile, you will likely need additional JVM memory allocation to your NiFi.

As you can see a lot more needs to be considered beyond just the size of teh data being ingested when planning out your NiFi needs.

Hope this helps,

Thanks,

Matt

avatar
Super Collaborator

Thanks @Matt Clarke - so if I pick one large file and write to hdfs - it will create 1 flowfile in content repository and will keep it for some duration until we clean it, same with one SQL read writing into nifi.

avatar
Super Mentor

@Avijeet Dash Once a File is ingested in to NiFi and becomes a FlowFile, it will remain in NiFi's content repository until all FlowFiles active in your Dataflow that point at that content claim have been satisfied. By satisfied, I mean they have reached a point in your dataflow(s) where those FlowFiles have been auto-terminated. If FlowFile archiving is enabled in your NiFi, the FlowFile content will be moved to an archive directory once no active FlowFiles are pointed at it any longer. The length of time it will be retained in the archive directory is determined by the archive configuration properties in the nifi.properties file. The defaults for archive are enabled with usage set to 12 hours or 50% disk utilization.

Thanks,

Matt

avatar
Master Guru

Define large file? Do you mean megs? gigs? terabytes? petabytes?

For 100+ megs, NIFI blasts through that. Gigs are fine too. If the terabytes are streaming in that's fine too.

Once you get huge, you will need many cores, much RAM and many nodes. It will take similar horsepower that was required to do the same processing in Hadoop nodes. So 256GIG nodes with 32 cores and a few dozen will handle almost anything.

https://nifi.apache.org/docs.html

You will be limited by disk speed (~50m) and network. Network saturation could be a worry first.

get 10GB/s+ networks. SSD or faster drives.

avatar

Thoughts on "best practices" for ingesting large files into NiFi? No transformations - this is an archiving use case.

An external application would need to send us the file, and get an OID in return for future retrieval. Would this flow work? Other ideas?

1. App sends message on special RabbitMQ queue with name and location of file to archive. Could be SFTP, shared disk, etc.

2. NiFi gets file, stores in Azure BLOB Storage.

3. NiFi returns OID to calling application via RabbitMQ.

Appreciate your thoughts.

avatar
Master Guru

sure. lots of people do that one. like 4 processors. you can cluster it for lots of files. make sure you have fast disks and a fast network those are your bottle necks.

avatar

Thanks. Any other common patterns for this use case?