I am in the process of installing an 8 node HDP 2.4 cluster, administered with Ambari 2.2. When running service checks in the following services:
I get the following error message:
Python script has been killed due to timeout after waiting 300 secs
That is the only error message shown on stderr or stdout. I have checked here and here and followed the advice given there of increasing the agent task timeout. However this has not improved things at all. Does anybody have any advice about how I can improve this?
usually when you get a timeout, solution is not to increase the timeout but to find the problem. Check in your logs the command being played by the script and try it by yourself. That should come for many reasons (DNS, iptables, etc.) and each other could be unrelated
It is usually best to check the slave logs to see if they're still running. If NodeManagers are all down, for example, the YARN-dependent service checks will timeout because a job may be submitted to the ResourceManager, but the jobs aren't actually being run.