Support Questions

Find answers, ask questions, and share your expertise

1 node in the cluster getting excessive timeout errors

avatar
New Contributor

Nifi version 1.9.0.3.4.1.1-4

 

of my 3 node cluster we have one server that tends to get an excessive amount of timeout errors. If this node is ever master/coordinator data processing is very slow. If the nifi service is restarted while this node is master/coordinator this server starts back up with unable to create native thread. This only happens with 1 node on my cluster and all other nodes work as intended.

 

This setup has worked for months with no issues. Only change made was reverted back. That change was to round robin load balance. That node is showing low utilization. We are just lost and any help is greatly appericated. 

last note we have senstive data flowing through this cluster so getting full logs are not easy for us. I tried to attached what I could

 

top - 15:08:53 up 3 days, 0 min,  1 user,  load average: 0.39, 0.50, 0.63
Tasks: 410 total,   1 running, 409 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.2 us,  0.8 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 39604499+total, 54695292 free, 19844422+used, 14290547+buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 19644432+avail Mem

1 REPLY 1

avatar
Master Guru

check your timeouts

turn off or fix any firewalls

 

test any network calls from other machines.   could also be the sFTP server you are reading from

 

 

Connection timed out (Connection timed out); routing to comms.failure: java.io.IOException: Failed to obtain connection to remote host due to com.jcraft.jsch.JSchException: java.net.ConnectException: Connection timed out (Connection timed out)
java.io.IOException: Failed to obtain connection to remote host due to com.jcraft.jsch.JSchException: java.net.Connect

 

Up the timeouts for the network calls.   How many NIC cards do you have are they 10Gb+?

 

What is the RAM?   I recommend 32GB RAM with most to JVM, 30-32 cores.

 

The best practice is to use Cloudera Flow Management with a Cloudera Manager's managed cluster it will make sure everything is running properly.

 

You can also restart them to get a different leading node.   Usually when you do sFTP you have only one node making the calls, so that's why that one will get timeout errors calling that SFTP server.   Make the timeout greater, your SFTP may be slow or offline or blocked by firewall/gateway/proxy/linux network