Support Questions

minshengjie · ‎05-15-2017

Can someone explain in details how NiFi processors like GetFile or QueryDatabaseTable store the rows when the next processor is not available to receive or process any data? Would the data gets piped up in memory and then gets swapped to disks when the size exceeds some threshold? Potentially would it have the risk of running out of memory or losing data?

MattWho · ‎05-15-2017

@Shengjie Min

When backpressure kicks in due to some configured threshold being met on a connection , the source processor of that connection (Whether it is GetFile or QueryDatabaseTable) is no longer allowed to run until the threshold drops below the back pressure threshold. These thresholds are soft limits so as to not cause and data loss. lets assume you have an object threshold set to 100 on a connection and that connection currently has 99 FlowFiles. If the processor before processes batches of Files at a time, the entire batch will be processed and put on the connection. Lets say the batch was 100 FlowFiles. Then you connection would now have 199 FlowFiles queued on it. The source processor would now not be allowed to run again until the threshold dropped below 100 again because of your object threshold setting.

Data ingested by NiFi is written to content repository and FlowFile Attributes about that content are written to the FlowFile repository. FlowFile also remains in heap memory space and is what is passed from processor to processor in your dataflow. To reduce the likelihood of NiFi running out of heap memory, NiFi is configured to swap FlowFile out of heap memory to disk should the number of FlowFiles queued on a single connection exceed a configurable value. The default swapping threshold is 20,000 FlowFile per connection and is set in the nifi.properties file.

Heap memory usage is something every person who builds dataflows must take in to account. While FlowFiles in general use very little heap memory space, there is nothing that stops a user from designing a dataflow that writes a lot of FlowFile attributes. A user could use the extractText processor for example to read the entire content (data) of a FlowFile in to a NiFi FlowFile attribute. Depending on the size of the content, that FlowFile could get very large.

By default, NiFi comes configured to use a heap of only 512 MB. This value can be adjusted in NiFi's bootstrap.conf file.

Thank you,

Matt

View solution in original post

MattWho · ‎05-15-2017

@Shengjie Min

When backpressure kicks in due to some configured threshold being met on a connection , the source processor of that connection (Whether it is GetFile or QueryDatabaseTable) is no longer allowed to run until the threshold drops below the back pressure threshold. These thresholds are soft limits so as to not cause and data loss. lets assume you have an object threshold set to 100 on a connection and that connection currently has 99 FlowFiles. If the processor before processes batches of Files at a time, the entire batch will be processed and put on the connection. Lets say the batch was 100 FlowFiles. Then you connection would now have 199 FlowFiles queued on it. The source processor would now not be allowed to run again until the threshold dropped below 100 again because of your object threshold setting.

Data ingested by NiFi is written to content repository and FlowFile Attributes about that content are written to the FlowFile repository. FlowFile also remains in heap memory space and is what is passed from processor to processor in your dataflow. To reduce the likelihood of NiFi running out of heap memory, NiFi is configured to swap FlowFile out of heap memory to disk should the number of FlowFiles queued on a single connection exceed a configurable value. The default swapping threshold is 20,000 FlowFile per connection and is set in the nifi.properties file.

Heap memory usage is something every person who builds dataflows must take in to account. While FlowFiles in general use very little heap memory space, there is nothing that stops a user from designing a dataflow that writes a lot of FlowFile attributes. A user could use the extractText processor for example to read the entire content (data) of a FlowFile in to a NiFi FlowFile attribute. Depending on the size of the content, that FlowFile could get very large.

By default, NiFi comes configured to use a heap of only 512 MB. This value can be adjusted in NiFi's bootstrap.conf file.

Thank you,

Matt

Cloudera Community

Support Questions

NiFi how to store the events in memory and disk