Support Questions

obaid_salikeen · ‎08-09-2016

Hello to this awesome community,

While playing with Apache Nifi (cluster with 5 slaves and 6GB RAM/slave) we managed to crash the cluster a bunch of times by making human errors. Example: We loaded a compressed file (100MB) in Nifi (GetFTP) and uncompressed it (500MB). Splitting this file (SplitText) caused the cluster to crash, since there were 1-2 million lines which translated to 1-2 million FlowFiles. Setting back pressure in such cases didn't help since SplitText will still generate 2million+ FlowFiles regardless. So in production, you wouldn't want your cluster to chock/crash if such kind of mistakes are made.

1. Nifi gives this awesome flexibility to write these workflows, however are there any guidelines that we could follow to safeguard against human-errors, and enable Nifi for a general purpose use (like share a Nifi cluster among teams)

Thanks

bbende · ‎08-09-2016

In the case of SplitText the approach when splitting large files is to use two instances of SplitText, where the first one might split to 10-20k lines per flow file, and then the second splits down to 1 line. This avoids producing millions of flow files in one execution of the processor.

For some other processors it is common for their description to include a warning statement if the processor is going to read in the whole flow file into memory so that the user is aware if they send in 2GB of data, its going to use 2GB of the heap, or create an OOM if its not available. Most processors whenever possible should perform their processing in a streaming fashion to avoid taking up large chunks of memory.

As far as sharing the cluster among teams, NiFi doesn't really have resource isolation, but NiFi 1.0.0 (initial BETA released yesterday) is going to introduce is fine grained security model so that different teams and people can be granted access to different parts of the flow. Team1 might only have access to Process Group 1, and Team 2 might only have access to Process Group 2, so each team can't see what the other team is doing or change their flow.

View solution in original post

bbende · ‎08-09-2016

In the case of SplitText the approach when splitting large files is to use two instances of SplitText, where the first one might split to 10-20k lines per flow file, and then the second splits down to 1 line. This avoids producing millions of flow files in one execution of the processor.

For some other processors it is common for their description to include a warning statement if the processor is going to read in the whole flow file into memory so that the user is aware if they send in 2GB of data, its going to use 2GB of the heap, or create an OOM if its not available. Most processors whenever possible should perform their processing in a streaming fashion to avoid taking up large chunks of memory.

As far as sharing the cluster among teams, NiFi doesn't really have resource isolation, but NiFi 1.0.0 (initial BETA released yesterday) is going to introduce is fine grained security model so that different teams and people can be granted access to different parts of the flow. Team1 might only have access to Process Group 1, and Team 2 might only have access to Process Group 2, so each team can't see what the other team is doing or change their flow.

obaid_salikeen · ‎08-09-2016

Very exciting, Regarding Nifi 1.0, do you know when will it be released (like approximation etc). Cannot wait to try it our, is it stable enough to run workflows?

bbende · ‎08-10-2016

It is still considered an unstable beta release so it is not recommended for production, but it is stable enough to run in a test/dev environment. Can't really say a specific timeline, but shouldn't be too far away. The community is already working on remaining issues and anything found from testing the beta.

Cloudera Community

Support Questions

Apache Nifi- How to add safeguards against human-errors