These four example messages appear over and over in the name service log:
SCM 5.7.0 & CDH 5.7.0 on RedHat 6.7.
Another test system almost identical to the trouble system can restart the cluster in about 5 minutes.
$ hdfs fsck /
reports healthy system, and 0 under replicated blocks.
This system always recovers, however every restart is taking an hour.
We have redundant directorys on different partitions for the name node fsimage, along with another host running secondary name service.
checkpoints complete once the safe mode is off.
Those messages are normal operation of the Namenode, so for us to help at all, we would be looking for logs that occur only during the safemode period.
Size of cluster and amount of metadata does increase start up time.
Additionally Namenode will use significantly more memory resources while starting up. So look for the Namenode Garbage collecting excessively during startup, as that will also significantly delay your startup. you can do this with a JMX client like Visual VM or with java built-ins like jstat -gccause <pid>
Increased heap and safe mode now lasts 17 minutes, which is an improvement.
Now that we recently switched from rpms to parcel packages, will now implement high availability for name service (NN x 2 instead of NN + SNN) and will have a restart procedure with manual failovers hopefully will avoid this long restart going forward.
This slow cluster restart continues to be a problem for us, especially after a redhat reboot. (to apply security updates)
In the logs initially it will say there are 0 datanodes, then eventually 1,then eventually 2, and so on. (we only have 5 data nodes for this small cluster)
in the meanwhile its going crazy with 'under replicated blocks' -- in reality there were no under replicated blocks prior to the cluster stop -- but appears that since it's not immediately finding them at startup, it goes into a big block recovery activity possibly hindering recovery / startup times.
1) why is it slow to discover restarted data nodes, and
2) is there a way to delay under replicated block recovery after restart. (for say 15 minutes)
This appears to have another follow-on problem in that once the cluster is finally recovered, and considered green status, some impala daemons are non-functional.
In this case:
3) it does not appear there is a canary test for every impala node, and
4) impala daemons which are dead, are marked green / healthy, but unable to respond to queries.
Our work-around for this this is after every restart, to then bounce impala once again at the end. Then all is well.
all of this behavior happens with Parcels 5.8.2 as well as 5.12.1 (slow starts, and impala health not accurate)
redhat 6.9 with latest security updates.