It is empirical method but it seems that nifi could not index a large volumes of data without crash the cluster.
The last time my cluster was crashed after that listHDFS processor has listed 160.000 small files.
It is some parameters to update in the default configuration of NIFI ?
At this moment my cluster is compose 3 VM nodes with 2 CPU and 4Go.
The listHDFS processor reads in the entire listing creating a FlowFile for each File listed before committing them to the success relationship. The FlowFile's attributes reside in NiFI's JVM heap memory during this time. Once they are committed to the success relationship, NiFi will swap FlowFiles attributes to disk based on the swap threshold configured in the nifi.properties file (default 20,000). This swapping only occurs for FlowFiles in a queue and not during the listing phase.
Unfortunately there is no configuration change you can make listHDFS that will change this behavior.
- You can avoid OOM by increasing the available heap memory allocated to your NiFi in the bootstrap.conf file.
- If your listing is over numerous sub directories, you could could replace your one listHDFS with multiple ListHDFS processors (each pointing at a different sub directory). This would decrease the size of the listing created before commit to success relationship on each. After that swapping will take place on the connections. You could pass all these ListHDFS processor's success relationships a funnel before continuing on in your existing dataflow.
You may want to open an Apache Jira suggesting a change to how NiFi listHDFS processor works to help address the OOM condition that can occur.