Support Questions

Find answers, ask questions, and share your expertise

ambari cluster + both namenode are standby

avatar

we start the services in our ambari cluster as the following ( after reboot )

42898-capture.png

1. start Zk

2. start journal-node

3. start name node ( on master01 machine and on master02 machine )


and we noticed that both name-node are stand by

how to force on of the node to became active ?

from log:

 tail -200 hadoop-hdfs-namenode-master03.sys65.com.log

rics to be sent will be discarded. This message will be skipped for the next 20 times.
2017-12-04 18:56:03,649 WARN  namenode.FSEditLog (JournalSet.java:selectInputStreams(280)) - Unable to determine input streams from QJM to [152.87.28.153:8485, 152.87.28.152:8485, 152.87.27.162:8485]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1590)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1614)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:251)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:402)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:355)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:372)
        at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:476)
        at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:368)
2017-12-04 18:56:03,650 INFO  namenode.FSNamesystem (FSNamesystem.java:writeUnlock(1658)) - FSNamesystem write lock held for 20005 ms via
java.lang.Thread.getStackTrace(Thread.java:1556)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1658)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:285)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:402)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:355)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:372)
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:476)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:368)
        Number of suppressed write-lock reports: 0
        Longest write-lock held interval: 20005


2017-12-04 19:03:43,792 INFO  ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(323)) - Triggering log roll on remote NameNode
2017-12-04 19:03:43,820 INFO  ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(334)) - Skipping log roll. Remote node is not in Active state: Operation category JOURNAL is not supported in state standby
2017-12-04 19:03:49,824 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 6001 ms (timeout=20000 ms) for a response for selectInputStreams. Succeeded so far:
2017-12-04 19:03:50,825 INFO  client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 7003 ms (timeout=20000 ms) for a response for selectInputStreams. Succeeded so far:
You have mail in /var/spool/mail/root



capture.png
Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor
@Michael Bronson

We see the following error in your NameNode log:

2017-12-05 21:46:14,814 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.164.28.153:8485, 10.164.28.152:8485, 10.164.27.162:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)


This indicates that the Journal Nodes have some issues and hence NameNode is not coming up.


This kind of error can be seen due to corruption with the 'edits_inprogress_xxxxxxx' file on the JournalNode.


So please check if the 'edits_inprogress_xxxxxxx' files are corrupt on the JournalNodes, those files needs to be removed.
Please move (or take a backup of) the corrupt "edits_inprogress" file to /tmp or copy the fsimage edits directory ("/hadoop/hdfs/journal/XXXXXXX/current") from a functioning JournalNode to this node and restart JournalNode and NameNode services. You can check the JournalNode logs of all 3 nodes to find out which JournalNode is functioning fine without errors.

.

View solution in original post

24 REPLIES 24

avatar

how to put the NameNode log in "tail" mode ?

second how to force the port to start?

Michael-Bronson

avatar

from the log I can see that

Getting jmx metrics from NN failed. URL: http://<master>:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
Michael-Bronson

avatar
Master Mentor

@Michael Bronson

Inside your NameNode host you will find some file name as "" where you need to enable the tail as following

Example:

# tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-xxxxxxxxxxxxxx.log

.

The JMX URL shows that the jmx metrics from NN failed because the port 50070 seems to be down.

.

Regarding your query: "how to force the port to start?"

>>>> The only way to make sure that the port is opened properly is to ensure that the NameNode starts fine without any error. So please check the NameNode log to see if there is any error?

.

avatar
Master Mentor

@Michael Bronson

As the JournalNodes are not running as well as the Zookeper Failover Controllers (ZKFC) hence please restart those components first.

It will be best to try restarting the whole HDFS service from ambari UI.

Ambari UI --. HDFS --" Service Actions" (Drop Down) --> Restart All

Then please check if all components comes up fine or not?

Please share the complete logs of all the components (like NameNode, JournalNodes, ZKFS logs) which fails to restart successfully.

.

avatar
Master Mentor

@Michael Bronson

You are right that we need to focus first what block the port or why port not start.

In order to find that out we will need to see the NameNode logs to determine if there is any port conflict being logged Or if there are any error/exceptions which is causing the NameNode to not be able to open the port successfully.

avatar

the errors are

so how from these erros we can understand why port is down ?

    org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
2017-12-05 20:33:23,716 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [34.98.28.153:8485, 34.98.28.152:8485, 34.98.27.162:8485], stream=null))
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
    org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
2017-12-05 21:03:41,334 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [34.98.28.153:8485, 34.98.28.152:8485, 34.98.27.162:8485], stream=null))
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
Michael-Bronson

avatar

the errors are

2017-12-05 21:46:14,814 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [100.164.28.153:8485, 100.164.28.152:8485, 100.164.27.162:8485], stream=null))
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)

I also checked that

<br>telnet localhost 50070
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
telnet: connect to address 127.0.0.1: Connection refused
Michael-Bronson

avatar

the picture for now is ( JournalNodes are running as well as the Zookeper Failover Controllers are running also )

second we perfrm more then twice full restart but without results

42942-capture.png

Michael-Bronson

avatar

@Jay maybe we need to focus first what block the port or why port not start

Michael-Bronson

avatar

the logs namenodelog.txt

Michael-Bronson