Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NIFI : index large volume

Highlighted

NIFI : index large volume

Explorer

Hi all,

It is empirical method but it seems that nifi could not index a large volumes of data without crash the cluster.

The last time my cluster was crashed after that listHDFS processor has listed 160.000 small files.

It is some parameters to update in the default configuration of NIFI ?

At this moment my cluster is compose 3 VM nodes with 2 CPU and 4Go.

thanks

3 REPLIES 3

Re: NIFI : index large volume

Master Guru

@mayki wogno

What do you see in the nifi-app.log when your nIFi cluster "crashes"?

Are you seeing any OOM errors?

Thanks,

Matt

Re: NIFI : index large volume

Explorer

Hi Matt.

The Last time it seems that has 'out of memory'

Re: NIFI : index large volume

Master Guru
@mayki wogno

The listHDFS processor reads in the entire listing creating a FlowFile for each File listed before committing them to the success relationship. The FlowFile's attributes reside in NiFI's JVM heap memory during this time. Once they are committed to the success relationship, NiFi will swap FlowFiles attributes to disk based on the swap threshold configured in the nifi.properties file (default 20,000). This swapping only occurs for FlowFiles in a queue and not during the listing phase.

Unfortunately there is no configuration change you can make listHDFS that will change this behavior.

- You can avoid OOM by increasing the available heap memory allocated to your NiFi in the bootstrap.conf file.

- If your listing is over numerous sub directories, you could could replace your one listHDFS with multiple ListHDFS processors (each pointing at a different sub directory). This would decrease the size of the listing created before commit to success relationship on each. After that swapping will take place on the connections. You could pass all these ListHDFS processor's success relationships a funnel before continuing on in your existing dataflow.

You may want to open an Apache Jira suggesting a change to how NiFi listHDFS processor works to help address the OOM condition that can occur.

Thanks,

Matt

Don't have an account?
Coming from Hortonworks? Activate your account here