Created 09-06-2020 06:54 AM
Sometimes when all vms with Cloudera managed clusters are rebooted or successful installation is restored from snapshot on vms, Cloudera agent fails to start.
I analyzed the cloudera-scm-agent.out log and saw that the Cloudera agent was down on some of the machines, and it was due to the fact that the ordering of BIND and Cloudera Agent in the boot process isn’t strictly ordered. Since Cloudera Agent requires the local BIND server to exist (because Cloudera Agent queries for the hostname of the machine, and the configured DNS server in resolv.conf is 127.0.0.1) the agent failed to start on some machines. Is there a way to configure Cloudera Agent to fail after some specific number of retries and timeout?