Created 03-29-2017 10:17 AM
nifi-app.zipAfter working one weak our NiFi cluster become very unstable. Nodes are disconnect and reconnect every 5 - 30 minutes, processors don't work fine too. Restarting all 3 nodes solve the issue.
Restarting NiFi weakly is not a good solution but we can work only with this approach.
Example of log file from one of the node in attachment.
Created 03-30-2017 02:04 PM
There are multiple reasons your cluster could become unstable. Without having more information about your flow and resources available on the nodes, I would only be able to guess what the issue might be.
What version of NiFi are you running?
Created 03-31-2017 12:49 PM
Our NiFi have 8 Gb of heap, NiFi version is 188.8.131.52.
Created 03-31-2017 01:10 PM
Are the systems NiFi is running on physical servers or VM?
How many CPU's per system?
How are the disks configured? Multiple partitions or a single partition?
Are the zookeeper servers embedded or are they on separate systems?
What is volume of data on the systems when you see the nodes disconnect?
There are errors in the log from the DetectDuplicate processor, have you tried to address that issue?
There are also a lot of socket timeout exceptions.
Created 03-31-2017 02:21 PM
Our NiFi is co-located with other Hadoop components. This is physical servers.
24 Cores per machine.templates.zip
Zookeper is separate but on this machines.
Errors on DetectDuplicate processor are symptoms of this issue. Socket Timeouts too.
We have 3 Process Group on Our NiFi Cluster, their templates are in attachment.
Created 04-03-2017 01:21 PM
I was able to load only two of the processing groups. One has a custom processor named JsonDateEdit.
The DistributedMapCacheClientService controller service needs a DistributedMapCacheServer controller service.
I would attempt to determine why there are so many socket connection issues and eliminate them. Maybe reduce the number of Max Total Connections on your DBCPConnectionPool controller service to see if it reduces the warnings for socket issues. This is most likely causing the stability issue.