Support Questions
Find answers, ask questions, and share your expertise

Namenode/Datanode slow start (6+ hours)

Namenode/Datanode slow start (6+ hours)

Explorer

Hi,

I am relatively new to setting up and administrating HDFS and find myself administering a proof of concept HDFS setup. We have a single machine (128 cores, 512 GB, 15 TB storage) that is running name node, datanode, secondary name node etc. There are about 23 million blocks on this server occupying just under 1 TB of disk space. The block count is unusually high as the data records are quite small.
The issue we are facing is that the namenode/datanode start is taking very long(6 hours). My guess is the datanode is taking excessively long to prepare the block report and sync it with datanode, but that's just my guess.
The CPU usage is fairly low while this happens and the name node heap (128 GB) stays around 40% all the time. I see the following in namenode logs:

 

 

 

 

2019-10-23 12:39:15,761 INFO ipc.Server (Server.java:logException(2428)) - IPC Server handler 2330 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 22.243.33.26:41190 Call#1303 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 12:40:15,784 INFO ipc.Server (Server.java:logException(2428)) - IPC Server handler 2330 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 22.243.33.26:41322 Call#1305 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
2019-10-23 12:41:15,807 INFO ipc.Server (Server.java:logException(2428)) - IPC Server handler 898 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from 22.243.33.26:41488 Call#1307 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

 

 

 


While the datanode logs show the following:

 

 

 

 

2019-10-23 12:42:06,445 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1241ms
GC pool 'ParNew' had collection(s): count=1 time=1624ms
2019-10-23 12:42:08,209 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1264ms
GC pool 'ParNew' had collection(s): count=1 time=1616ms
2019-10-23 12:42:10,209 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1499ms
GC pool 'ParNew' had collection(s): count=1 time=1858ms
2019-10-23 12:42:12,405 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1695ms
GC pool 'ParNew' had collection(s): count=1 time=2054ms
2019-10-23 12:42:14,270 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1364ms
GC pool 'ParNew' had collection(s): count=1 time=1742ms
2019-10-23 12:42:16,152 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1381ms
GC pool 'ParNew' had collection(s): count=1 time=1756ms
2019-10-23 12:42:18,128 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1475ms
GC pool 'ParNew' had collection(s): count=1 time=1802ms
2019-10-23 12:42:20,126 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1498ms
GC pool 'ParNew' had collection(s): count=1 time=1872ms
2019-10-23 12:42:21,999 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1372ms
GC pool 'ParNew' had collection(s): count=1 time=1746ms
2019-10-23 12:42:24,108 INFO util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1608ms
GC pool 'ParNew' had collection(s): count=1 time=1985ms

 

 

 

 

During this time, the datanode process appears to be running in Ambari, but not live:

 

Annotation 2019-10-23 125852.png

 

 

Edit: When the datanode finishes whatever its doing, I see the following in datanode logs:

 

 

 

    GC pool 'ParNew' had collection(s): count=1 time=1864ms
    2019-10-23 13:05:26,724 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1597ms
    GC pool 'ParNew' had collection(s): count=1 time=1979ms
    2019-10-23 13:05:26,747 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(193)) - Time to add replicas to map for block pool BP-814497463-127.0.0.1-1558792659773 on volume /hadoop/hdfs/data/current: 12658576ms
    2019-10-23 13:05:26,748 INFO  impl.FsDatasetImpl (FsVolumeList.java:getAllVolumesMap(219)) - Total time to add all replicas to map: 12658578ms
    2019-10-23 13:05:26,794 INFO  datanode.VolumeScanner (VolumeScanner.java:findNextUsableBlockIter(383)) - Now scanning bpid BP-814497463-127.0.0.1-1558792659773 on volume /hadoop/hdfs/data
    2019-10-23 13:05:26,796 INFO  datanode.DirectoryScanner (DirectoryScanner.java:start(334)) - Periodic Directory Tree Verification scan starting at 1571848750796 with interval 21600000
    2019-10-23 13:05:26,800 INFO  datanode.DataNode (BPServiceActor.java:register(715)) - Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020 beginning handshake with NN
    2019-10-23 13:05:26,939 INFO  datanode.DataNode (BPServiceActor.java:register(734)) - Block pool Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020 successfully registered with NN
    2019-10-23 13:05:26,940 INFO  datanode.DataNode (DataNode.java:registerBlockPoolWithSecretManager(1443)) - Block token params received from NN: for block pool BP-814497463-127.0.0.1-1558792659773 keyUpdateInterval=600 min(s), tokenLifetime=600 min(s)
    2019-10-23 13:05:26,952 INFO  block.BlockTokenSecretManager (BlockTokenSecretManager.java:addKeys(193)) - Setting block keys
    2019-10-23 13:05:26,952 INFO  datanode.DataNode (BPServiceActor.java:offerService(569)) - For namenode xxxx.corp.com/22.243.33.26:8020 using BLOCKREPORT_INTERVAL of 21600000msec CACHEREPORT_INTERVAL of 10000msec Initial delay: 120000msec; heartBeatInterval=3000
    2019-10-23 13:05:27,066 INFO  datanode.DataNode (BPOfferService.java:updateActorStatesFromHeartbeat(513)) - Namenode Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020 trying to claim ACTIVE state with txid=198275130
    2019-10-23 13:05:27,066 INFO  datanode.DataNode (BPOfferService.java:updateActorStatesFromHeartbeat(525)) - Acknowledging ACTIVE Namenode Block pool BP-814497463-127.0.0.1-1558792659773 (Datanode Uuid 0aff4a22-3f1a-485b-9aec-46fd881dfab0) service to xxxx.corp.com/22.243.33.26:8020
    2019-10-23 13:05:33,542 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1316ms
    GC pool 'ParNew' had collection(s): count=1 time=1645ms
    2019-10-23 13:05:41,591 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1547ms
    GC pool 'ParNew' had collection(s): count=1 time=1859ms
    2019-10-23 13:05:49,487 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1393ms
    GC pool 'ParNew' had collection(s): count=1 time=1802ms
    2019-10-23 13:05:57,902 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1413ms
    GC pool 'ParNew' had collection(s): count=1 time=1889ms
    2019-10-23 13:06:01,296 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1391ms
    GC pool 'ParNew' had collection(s): count=1 time=1801ms
    2019-10-23 13:06:04,065 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1268ms
    GC pool 'ParNew' had collection(s): count=1 time=1761ms
    2019-10-23 13:06:05,775 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1209ms
    GC pool 'ParNew' had collection(s): count=1 time=1633ms
    2019-10-23 13:06:05,785 INFO  web.DatanodeHttpServer (SimpleHttpProxyHandler.java:exceptionCaught(147)) - Proxy for / failed. cause:
    java.io.IOException: Connection reset by peer
            at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
            at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
            at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
            at sun.nio.ch.IOUtil.read(IOUtil.java:192)
            at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
            at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:368)
            at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:891)
            at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:243)
            at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
            at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646)
            at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581)
            at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
            at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460)
            at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
            at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
            at java.lang.Thread.run(Thread.java:745)
    2019-10-23 13:06:07,425 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1148ms
    GC pool 'ParNew' had collection(s): count=1 time=1567ms
    2019-10-23 13:06:10,114 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1688ms
    GC pool 'ParNew' had collection(s): count=1 time=1888ms
    2019-10-23 13:06:10,253 INFO  datanode.DataNode (BPServiceActor.java:blockReport(395)) - Unsuccessfully sent block report 0x31d5073bda87,  containing 1 storage report(s), of which we sent 0. The reports had 23891422 total blocks and used 0 RPC(s). This took 5603 msec to generate and 6144 msecs for RPC and NN processing. Got back no commands.
    2019-10-23 13:06:10,254 WARN  datanode.DataNode (BPServiceActor.java:offerService(681)) - IOException in offerService
    java.io.EOFException: End of File Exception between local host is: "xxxx.corp.com/hidden.private.ip.address"; destination host is: "xxxx.corp.com":8020; : java.io.EOFException; For more details see:  <a href="<a href="http://wiki.apache.org/hadoop/EOFException" target="_blank">http://wiki.apache.org/hadoop/EOFException</a>" target="_blank"><a href="http://wiki.apache.org/hadoop/EOFException</a" target="_blank">http://wiki.apache.org/hadoop/EOFException</a</a>>
            at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
            at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
            at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
            at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
            at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
            at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
            at org.apache.hadoop.ipc.Client.call(Client.java:1498)
            at org.apache.hadoop.ipc.Client.call(Client.java:1398)
            at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
            at com.sun.proxy.$Proxy15.blockReport(Unknown Source)
            at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:216)
            at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:377)
            at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:653)
            at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:793)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: java.io.EOFException
            at java.io.DataInputStream.readInt(DataInputStream.java:392)
            at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1119)
            at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1014)
    2019-10-23 13:06:13,307 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(196)) - Detected pause in JVM or host machine (eg GC): pause of approximately 1193ms

 

 

 

And the namenode shows the following output:

 

 

 

2019-10-23 13:05:16,516 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 986 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from hidden.private.ip.address:46066 Call#1355 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
    The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
    The number of live datanodes 0 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
    2019-10-23 13:05:26,854 INFO  hdfs.StateChange (DatanodeManager.java:registerDatanode(954)) - BLOCK* registerDatanode: from DatanodeRegistration(hidden.private.ip.address:50010, datanodeUuid=0aff4a22-3f1a-485b-9aec-46fd881dfab0, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-56;cid=CID-49b9105e-fc0d-4ea4-9d2f-caceb95ce4bb;nsid=875919329;c=0) storage 0aff4a22-3f1a-485b-9aec-46fd881dfab0
    2019-10-23 13:05:26,860 INFO  blockmanagement.DatanodeDescriptor (DatanodeDescriptor.java:updateHeartbeatState(401)) - Number of failed storage changes from 0 to 0
    2019-10-23 13:05:26,916 INFO  net.NetworkTopology (NetworkTopology.java:add(427)) - Adding a new node: /default-rack/hidden.private.ip.address:50010
    2019-10-23 13:05:26,917 INFO  blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:registerNode(205)) - Registered DN 0aff4a22-3f1a-485b-9aec-46fd881dfab0 (hidden.private.ip.address:50010).
    2019-10-23 13:05:26,919 INFO  hdfs.StateChange (FSNamesystem.java:reportStatus(5952)) - STATE* Safe mode ON.
    The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
    The number of live datanodes 1 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
    2019-10-23 13:05:27,028 INFO  blockmanagement.DatanodeDescriptor (DatanodeDescriptor.java:updateHeartbeatState(401)) - Number of failed storage changes from 0 to 0
    2019-10-23 13:05:27,032 INFO  blockmanagement.DatanodeDescriptor (DatanodeDescriptor.java:updateStorage(854)) - Adding new storage ID DS-ea7ed3be-90ad-4424-a00c-577601814d81 for DN hidden.private.ip.address:50010
    2019-10-23 13:06:08,024 WARN  ipc.Server (Server.java:checkDataLength(1728)) - Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
    2019-10-23 13:06:08,026 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client hidden.private.ip.address threw exception [java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address]
    java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
            at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1729)
            at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1791)
            at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:995)
            at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:852)
            at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:823)
    2019-10-23 13:06:16,541 INFO  ipc.Server (Server.java:logException(2428)) - IPC Server handler 2890 on 8020, call org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol.rollEditLog from hidden.private.ip.address:46214 Call#1357 Retry#0: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Log not rolled. Name node is in safe mode.
    The reported blocks 0 needs additional 23891420 blocks to reach the threshold 1.0000 of total blocks 23891419.
    The number of live datanodes 1 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.
    2019-10-23 13:06:22,081 WARN  ipc.Server (Server.java:checkDataLength(1728)) - Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
    2019-10-23 13:06:22,081 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client hidden.private.ip.address threw exception [java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address]
    java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
            at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1729)
            at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1791)
            at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:995)
            at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:852)
            at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:823)
    2019-10-23 13:06:33,432 WARN  ipc.Server (Server.java:checkDataLength(1728)) - Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
    2019-10-23 13:06:33,432 INFO  ipc.Server (Server.java:doRead(1006)) - Socket Reader #1 for port 8020: readAndProcess from client hidden.private.ip.address threw exception [java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address]
    java.io.IOException: Requested data length 291562685 is longer than maximum configured RPC length 291562661.  RPC came from hidden.private.ip.address
            at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1729)
            at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1791)
            at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:995)
            at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:852)
            at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:823)

 

 


Questions:

1. Is this startup time normal for such volumes of data?
2. Is there anything that I could do to perhaps debug or troubleshoot this? My guess at this point is that it could point ot a heap issue as the GC pool message keeps appearning every few seconds?
4. Are there any recommendations to improve the startup time?

5. Are there any guidelines to estimate/benchmark the startup time give the size of data or blocks?

Thank you in advance.

6. Why does the ipc data packet length keeps increasing? I had to increase it twice before - is it somehow correlated to size of data/number of blocks?