Reply
Explorer
Posts: 27
Registered: ‎04-08-2016

Every cluster restart the Name Service in safe mode an hour before Cluster finally going green

These four example messages appear over and over in the name service log:

 

  • 2016-06-12 00:15:06,486 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 10.10.10.84:50010 is added to blk_1074147634_1099546079234{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-6cb61d85-d3d4-4e98-ae2d-0313ab8941c4:NORMAL:10.10.10.83:50010|RBW], ReplicaUnderConstruction[[DISK]DS-d7077514-495d-42b2-a2ec-9f76fcc43db4:NORMAL:10.10.10.80:50010|RBW], ReplicaUnderConstruction[[DISK]DS-4b57fed3-06c0-444e-b217-e67f36edff97:NORMAL:10.10.10.84:50010|RBW]]} size 0
  • 2016-06-12 15:23:56,496 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1074163953_1099546095553 10.10.10.85:50010 10.10.10.84:50010 10.10.10.83:50010
  • 2016-06-12 15:24:27,731 INFO BlockStateChange: BLOCK* BlockManager: ask 10.10.10.80:50010 to delete [blk_1074163959_1099546095559]
  • 2016-06-13 12:12:19,705 INFO BlockStateChange: BLOCK* ask 10.10.10.83:50010 to replicate blk_1073950158_1099545881758 to datanode(s) 10.10.10.85:50010

 

SCM 5.7.0 & CDH 5.7.0 on RedHat 6.7.

 

Another test system almost identical to the trouble system can restart the cluster in about 5 minutes.

 

$ hdfs fsck /

 

reports healthy system, and 0 under replicated blocks.

 

This system always recovers, however every restart is taking an hour.

 

We have redundant directorys on different partitions for the name node fsimage, along with another host running secondary name service.

 

checkpoints complete once the safe mode is off.

Expert Contributor
Posts: 101
Registered: ‎01-24-2014

Re: Every cluster restart the Name Service in safe mode an hour before Cluster finally going green

Those messages are normal operation of the Namenode, so for us to help at all, we would be looking for logs that occur only during the safemode period. 

 

General Advice:

 

Size of cluster and amount of metadata does increase start up time.

 

Additionally Namenode will use significantly more memory resources while starting up. So look for the Namenode Garbage collecting excessively during startup, as that will also significantly delay your startup. you can do this with a JMX client like Visual VM or with java built-ins like jstat -gccause <pid>

Explorer
Posts: 27
Registered: ‎04-08-2016

Re: Every cluster restart the Name Service in safe mode an hour before Cluster finally going green

Increased heap and safe mode now lasts 17 minutes, which is an improvement.

 

Now that we recently switched from rpms to parcel packages, will now implement high availability for name service (NN x 2 instead of NN + SNN) and will have a restart procedure with manual failovers hopefully will avoid this long restart going forward.

 

Thanks.

Explorer
Posts: 27
Registered: ‎04-08-2016

Re: Every cluster restart the Name Service in safe mode an hour before Cluster finally going green

This slow cluster restart continues to be a problem for us, especially after a redhat reboot. (to apply security updates)

 

In the logs initially it will say there are 0 datanodes, then eventually 1,then eventually 2, and so on. (we only have 5 data nodes for this small cluster)

 

in the meanwhile its going crazy with 'under replicated blocks'  -- in reality there were no under replicated blocks prior to the cluster stop -- but appears that since it's not immediately finding them at startup, it goes into a big block recovery activity possibly hindering recovery / startup times.

 

1) why is it slow to discover restarted data nodes, and

2) is there a way to delay under replicated block recovery after restart. (for say 15 minutes)

 

This appears to have another follow-on problem in that once the cluster is finally recovered, and considered green status, some impala daemons are non-functional.

 

In this case:

 

3) it does not appear there is a canary test for every impala node, and

4) impala daemons which are dead, are marked green / healthy, but unable to respond to queries.

 

Our work-around for this this is after every restart, to then bounce impala once again at the end.  Then all is well.

 

all of this behavior happens with Parcels 5.8.2 as well as 5.12.1 (slow starts, and impala health not accurate)

 

redhat 6.9 with latest security updates.

 

 

 

 

Announcements