Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Handling 10k files in Nifi

avatar
Explorer

Hi,

I have a requirement to handle 10k files in Nifi in parallel. Whats the best way to handle this scenario? Say for example if i use GetFile Processor and point to a loading directory, will it take 10k files one by one or it will be handled in parallel. Is there any properties or way to set it up so that 10k files will be loaded in parallel. Its not necessarily GetFile. ListFile-FetchFile also will do.

Another question: Does nifi creates a single JVM instance to handle all the flowfiles being generated, or will there be multiple JVM's created for different processes?

1 ACCEPTED SOLUTION

avatar
Super Mentor
@yeah thatguy

10K FlowFiles in NiFi is nothing in terms of load.

NiFi processors use system threads to run. These processors can be configured with multiple "concurrent tasks". This allows one processor to essentially run multiple times at the exact same time. I would not however ever try to schedule one processor with 10,000 concurrent tasks (I don't know of any server that has 10,000 cpu cores.)

Can you elaborate on your use case and why you must load all 10k files in parallel versus rapid succession?

Processors are designed in a variety of ways depending on their function. Some processor work on one FlowFile at a time while other work on batches of FlowFiles.

GetFile has a configurable BatchSize which controls the number for Files retrieved per processor execution. All Files are committed as FlowFile in nifi at the same time upon ingestion. You could configure smaller batches and multiple concurrent tasks on this processor.

ListFile processor retrieve a complete listing of all Files in the target directory and then creates a single 0 byte FlowFile for each of them. The complete batch is committed to the success relationship at the same time.

FetchFile processor retrieves the content of each of the listed files and inserts that content in to the FlowFile. This processor is a good candidate for multiple concurrent tasks.

Each instance of NiFi runs in its own single JVM. Only FlowFile attributes live in JVM heap memory (FlowFile attributes are also persisted to disk). To help protect the JVM from OOM errors NiFi will swap FlowFiles to disk if a connections queue exceeds the configurable swapping threshold. The default swapping threshold is 20,000 and is set in the nifi.properties file. This setting is per connection and not for the entire NiFi dataflow(s).

FlowFile Content is written to the NiFi content repository. It is then only accessed when a processor performs a function that requires it to read or modify that content.

NiFi's JVM heap memory defaults to only 512 MB, but is configurable via NiFi's bootstrap.conf file.

Thanks,

Matt

View solution in original post

1 REPLY 1

avatar
Super Mentor
@yeah thatguy

10K FlowFiles in NiFi is nothing in terms of load.

NiFi processors use system threads to run. These processors can be configured with multiple "concurrent tasks". This allows one processor to essentially run multiple times at the exact same time. I would not however ever try to schedule one processor with 10,000 concurrent tasks (I don't know of any server that has 10,000 cpu cores.)

Can you elaborate on your use case and why you must load all 10k files in parallel versus rapid succession?

Processors are designed in a variety of ways depending on their function. Some processor work on one FlowFile at a time while other work on batches of FlowFiles.

GetFile has a configurable BatchSize which controls the number for Files retrieved per processor execution. All Files are committed as FlowFile in nifi at the same time upon ingestion. You could configure smaller batches and multiple concurrent tasks on this processor.

ListFile processor retrieve a complete listing of all Files in the target directory and then creates a single 0 byte FlowFile for each of them. The complete batch is committed to the success relationship at the same time.

FetchFile processor retrieves the content of each of the listed files and inserts that content in to the FlowFile. This processor is a good candidate for multiple concurrent tasks.

Each instance of NiFi runs in its own single JVM. Only FlowFile attributes live in JVM heap memory (FlowFile attributes are also persisted to disk). To help protect the JVM from OOM errors NiFi will swap FlowFiles to disk if a connections queue exceeds the configurable swapping threshold. The default swapping threshold is 20,000 and is set in the nifi.properties file. This setting is per connection and not for the entire NiFi dataflow(s).

FlowFile Content is written to the NiFi content repository. It is then only accessed when a processor performs a function that requires it to read or modify that content.

NiFi's JVM heap memory defaults to only 512 MB, but is configurable via NiFi's bootstrap.conf file.

Thanks,

Matt