We have 2 clusters (6 instances each one) running with NIFI 1.1.2 + JDK 8u121 + Linux CentOS
The traffic get divided between those 2 clusters:
1. TPS: 2700 - EAST cluster
2. TPS: 980. - WEST cluster
We have tried to migrate to NIFI 1.2.0, 1.3.0, and 1.4.0, but the cluster with higher TPS (EAST) got stuck after 4 hours of intensive traffic. Also its web console got unresponsive.
I've tried many things to fix this thing, but only thing I got was to increase the time from 4 to 6 hours before it fails
Our current instances are running on AWS and each EC2 instances has 8 cpus (c5.2xlarge), and 16GB RAM.
I've tried to use c5.4xlarge (it doubles the cpu and ram), but I got the same outcome.
I don't have a clue to figure it out what the issue is. Also I have a Datadog dashboard to track some java head metrics but everything looks normal.
What should I do to find why those new better instances are failing? is it memory or disk space or threads got stuck? Why an old NIFI cluster conf works better than a new NIFI?
Hope you can help me with this.
... View more