Support Questions

Find answers, ask questions, and share your expertise

NameNode and Standbynamenode auto stopped

avatar
Rising Star

Hi,

I'm having 1 master namenode, 1 standby namenode and 4 data node, and 3 journal node.

Before enable HA of Name node , Cluster was working fine. When we enabled HA of NAme node we are facing problem of auto stopped Name node. Below logs find on name node:

2018-03-16 05:49:34,799 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1648)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-03-16 05:49:35,684 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 16048 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.10.20.5:8485] 2018-03-16 05:49:36,686 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 17050 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.10.20.5:8485] 2018-03-16 05:49:37,688 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 18052 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.10.20.5:8485] 2018-03-16 05:49:37,850 INFO BlockStateChange (BlockManager.java:computeReplicationWorkForBlocks(1648)) - BLOCK* neededReplications = 0, pendingReplications = 0. 2018-03-16 05:49:38,690 WARN client.QuorumJournalManager (QuorumCall.java:waitFor(134)) - Waited 19054 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [10.10.20.5:8485]

2018-03-16 05:49:39,637 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.10.20.5:8485, 10.10.20.15:8485, 10.10.20.13:8485], stream=QuorumOutputStream starting at txid 5667564)) java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113) at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57) at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:707) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:641) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3722) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:912) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:548) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345) 2018-03-16 05:49:39,638 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(72)) - Aborting QuorumOutputStream starting at txid 5667564 2018-03-16 05:49:39,647 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 2018-03-16 05:49:39,663 INFO namenode.NameNode (LogAdapter.java:info(47)) - SHUTDOWN_MSG:

Please assist..!!!

1 ACCEPTED SOLUTION

avatar
Rising Star

Hi,

I am able to telnet, ping to target machine and hostname.

It may be problem has resolved. Today HA NN in running state. I have done some changing. Let me share with you all.

1. First i have 3 three zookeeper server running. I had changed the zookeeper location from server 1 to server 4, meanwhile ha.zookeeper.quorum was still server1.zk.com:2181,server2.zk.com:2181,server3.zk.com:2181 even after restart all services.

I have changed it manually to server4.zk.com:2181,server2.zk.com:2181,server3.zk.com:2181.

2. Second i had done some modification in zookeeper configuration zoo.cfg

changed default to syncLimit=15, tickTime=4000 and initLimit=30.

Now it will take 15*4000=60sec to sync.

3. I have define the suggested properties in configuration files:

hdfs-site.xml

dfs.qjournal.start-segment.timeout.ms =90000
dfs.qjournal.select-input-streams.timeout.ms =90000
dfs.qjournal.write-txns.timeout.ms =90000

core-site.xml

ipc.client.connect.timeout =90000

4. All three JournalNode installed on Both Namenode instance and third installed on DataNode Server.

Today it seems everything is working fine.

View solution in original post

14 REPLIES 14

avatar
Rising Star

Hi,

I am able to telnet, ping to target machine and hostname.

It may be problem has resolved. Today HA NN in running state. I have done some changing. Let me share with you all.

1. First i have 3 three zookeeper server running. I had changed the zookeeper location from server 1 to server 4, meanwhile ha.zookeeper.quorum was still server1.zk.com:2181,server2.zk.com:2181,server3.zk.com:2181 even after restart all services.

I have changed it manually to server4.zk.com:2181,server2.zk.com:2181,server3.zk.com:2181.

2. Second i had done some modification in zookeeper configuration zoo.cfg

changed default to syncLimit=15, tickTime=4000 and initLimit=30.

Now it will take 15*4000=60sec to sync.

3. I have define the suggested properties in configuration files:

hdfs-site.xml

dfs.qjournal.start-segment.timeout.ms =90000
dfs.qjournal.select-input-streams.timeout.ms =90000
dfs.qjournal.write-txns.timeout.ms =90000

core-site.xml

ipc.client.connect.timeout =90000

4. All three JournalNode installed on Both Namenode instance and third installed on DataNode Server.

Today it seems everything is working fine.

avatar
Master Mentor

@Vinay K

@Sandeep Kumar

It's great that your problem has been resolved. It isn't normal that someone attributes himself the correct answer when other HCC members contributed to the answer namely Sandeep and I.

Your solution which was suggested by me.

1 Quorum of zookeeper 
3. Changes in the hdfs-site.xml/core-site.xml config 
4 Journal nodes /Sandeep too

So taking account the above I guess someone else merited the point 🙂

avatar
Rising Star

@Geoffrey Shelton

If you not comfort i can remove 2 point. I'd found 2nd point in zookeeper documentation.

avatar
Master Mentor

@Vinay K

Not important to me though but ethically . I am fine with all 🙂


avatar
Rising Star

@Geoffrey S. O.

May be that point is not meaningful.

Thanks for help me. You guys spent your precious time in community, That is appreciable.