Created on 12-04-2017 07:00 PM - edited 08-17-2019 08:07 PM
we start the services in our ambari cluster as the following ( after reboot )
1. start Zk
2. start journal-node
3. start name node ( on master01 machine and on master02 machine )
and we noticed that both name-node are stand by
how to force on of the node to became active ?
from log:
tail -200 hadoop-hdfs-namenode-master03.sys65.com.log rics to be sent will be discarded. This message will be skipped for the next 20 times. 2017-12-04 18:56:03,649 WARN namenode.FSEditLog (JournalSet.java:selectInputStreams(280)) - Unable to determine input streams from QJM to [152.87.28.153:8485, 152.87.28.152:8485, 152.87.27.162:8485]. Skipping. java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137) at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:471) at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:278) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1590) at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1614) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:251) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:402) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:355) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:372) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:476) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:368) 2017-12-04 18:56:03,650 INFO namenode.FSNamesystem (FSNamesystem.java:writeUnlock(1658)) - FSNamesystem write lock held for 20005 ms via java.lang.Thread.getStackTrace(Thread.java:1556) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1658) org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:285) org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:402) org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:355) org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:372) org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:476) org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:368) Number of suppressed write-lock reports: 0 Longest write-lock held interval: 20005 2017-12-04 19:03:43,792 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(323)) - Triggering log roll on remote NameNode 2017-12-04 19:03:43,820 INFO ha.EditLogTailer (EditLogTailer.java:triggerActiveLogRoll(334)) - Skipping log roll. Remote node is not in Active state: Operation category JOURNAL is not supported in state standby 2017-12-04 19:03:49,824 INFO client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 6001 ms (timeout=20000 ms) for a response for selectInputStreams. Succeeded so far: 2017-12-04 19:03:50,825 INFO client.QuorumJournalManager (QuorumCall.java:waitFor(136)) - Waited 7003 ms (timeout=20000 ms) for a response for selectInputStreams. Succeeded so far: You have mail in /var/spool/mail/root
Created 12-05-2017 10:31 PM
We see the following error in your NameNode log:
2017-12-05 21:46:14,814 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.164.28.153:8485, 10.164.28.152:8485, 10.164.27.162:8485], stream=null)) java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond. at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
This indicates that the Journal Nodes have some issues and hence NameNode is not coming up.
This kind of error can be seen due to corruption with the 'edits_inprogress_xxxxxxx' file on the JournalNode.
So please check if the 'edits_inprogress_xxxxxxx' files are corrupt on the JournalNodes, those files needs to be removed.
Please move (or take a backup of) the corrupt "edits_inprogress" file to /tmp or copy the fsimage edits directory ("/hadoop/hdfs/journal/XXXXXXX/current") from a functioning JournalNode to this node and restart JournalNode and NameNode services. You can check the JournalNode logs of all 3 nodes to find out which JournalNode is functioning fine without errors.
.
Created 12-05-2017 06:18 PM
From the error message, it looks like some of the services might not be running. Can you please make sure that zookeeper and journal nodes are indeed running before starting NN?
Created 12-05-2017 06:35 PM
Not resolved yet ?
Created 12-05-2017 06:40 PM
yes still not both name node not startup or startup as standby
Created 12-05-2017 06:56 PM
Looks like you are using IP Addersses instead of FQDN (Hostnames) for your components.
Example:
QJM to [152.87.28.153:8485, 152.87.28.152:8485, 152.87.27.162:8485]
.
Please make sure to use the Hostnames (FQDN) while defining the Address of your HDFS components. Do not use the IP Addresses.
Using proper FQDN (hostname -f) is one of the major requirement for HDFS cluster managed by Ambari.
.
Also please check if your QJM processes are running fine on the mentioned hosts? Have the QJMs opened the port "8485" properly? Or you are noticing any error in the QJM logs?
# netstat -tnlpa | grep 8485 # tail -f /var/log/hadoop/hdfs/hadoop-hdfs-journalnode-xxxxxxxxxxxx.log
.
Created 12-05-2017 07:17 PM
yes we get that on all masters servers:
netstat -tnlpa | grep 8485
tcp 0 0 0.0.0.0:8485 0.0.0.0:* LISTEN 14395/java
Created 12-05-2017 07:37 PM
Please check your hdfs-site and core-site configurations to confirm if you are using Hostname for the components instead of IP Address.
Also please double check that all the hostnames are in Lowercase. (Mixedcase or Uppercase hostnames will cause such issues). like "dfs.namenode.http-address" , dfs.namenode.http-address.$SERVICE_NAME.nn1 ..etc should be hostnames (not IP Address).
Also there should be not firewall issues while accessing the NameNode UI / JMX from ambari server host.
Created 12-05-2017 08:56 PM
dear jay , I check all you said and seems its ok ( yes we use only host names in the xml file ) , about - JMX from ambari server host. - what we need to check here ?
second , I on this case more then two days , how we can debug it more deeply ?
Created 12-05-2017 09:16 PM
I found something
refernce - https://ambari.apache.org/1.2.3/installing-hadoop-using-ambari/content/reference_chap2_1.html
netstat -tnlpa | grep 50070 <br>no return any putout <br> and also this api not return output<br> curl -s 'http://<master>:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'
Created 12-05-2017 09:22 PM
Based on the "netstat" output we can see that the port 50070 is not opened on the NameNode host, Which indicates that the NameNode might not have comeup successfully.
So please check the NameNode logs first to see if there are any errors that are causing the NameNode process to not come up clean .. or if there is any issue while opening the port 50070.
I will suggest , to put the NameNode log in "tail" mode and then restart the whole HDFS service from ambari UI.
Created 12-05-2017 09:32 PM
how to put the NameNode log in "tail" mode ?
second how to force the port to start?
Created 12-05-2017 09:37 PM
from the log I can see that
Getting jmx metrics from NN failed. URL: http://<master>:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem Traceback (most recent call last):
Created 12-05-2017 09:41 PM
Inside your NameNode host you will find some file name as "" where you need to enable the tail as following
Example:
# tail -f /var/log/hadoop/hdfs/hadoop-hdfs-namenode-xxxxxxxxxxxxxx.log
.
The JMX URL shows that the jmx metrics from NN failed because the port 50070 seems to be down.
.
Regarding your query: "how to force the port to start?"
>>>> The only way to make sure that the port is opened properly is to ensure that the NameNode starts fine without any error. So please check the NameNode log to see if there is any error?
Created 12-05-2017 09:57 PM
As the JournalNodes are not running as well as the Zookeper Failover Controllers (ZKFC) hence please restart those components first.
It will be best to try restarting the whole HDFS service from ambari UI.
Ambari UI --. HDFS --" Service Actions" (Drop Down) --> Restart All
Then please check if all components comes up fine or not?
Please share the complete logs of all the components (like NameNode, JournalNodes, ZKFS logs) which fails to restart successfully.
Created 12-05-2017 10:09 PM
You are right that we need to focus first what block the port or why port not start.
In order to find that out we will need to see the NameNode logs to determine if there is any port conflict being logged Or if there are any error/exceptions which is causing the NameNode to not be able to open the port successfully.
Created 12-05-2017 09:53 PM
the errors are
so how from these erros we can understand why port is down ?
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) 2017-12-05 20:33:23,716 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [34.98.28.153:8485, 34.98.28.152:8485, 34.98.27.162:8485], stream=null)) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393) 2017-12-05 21:03:41,334 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [34.98.28.153:8485, 34.98.28.152:8485, 34.98.27.162:8485], stream=null)) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
Created 12-05-2017 09:56 PM
the errors are
2017-12-05 21:46:14,814 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(398)) - Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [100.164.28.153:8485, 100.164.28.152:8485, 100.164.27.162:8485], stream=null)) at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
I also checked that
<br>telnet localhost 50070 Trying ::1... telnet: connect to address ::1: Connection refused Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused
Created on 12-05-2017 10:01 PM - edited 08-17-2019 08:07 PM
the picture for now is ( JournalNodes are running as well as the Zookeper Failover Controllers are running also )
second we perfrm more then twice full restart but without results
Created 12-05-2017 10:07 PM
@Jay maybe we need to focus first what block the port or why port not start
Created 12-05-2017 10:18 PM
the logs namenodelog.txt
Created 12-05-2017 10:05 PM
Good to see that now the Both ZKFC, All 3 JournalNodes and all 4 DataNodes are running (are green).
Regarding Both NameNode Down. We will need to investigate the NameNode logs in order to findout why they are not running.
So can you please share/attach the complete NN logs.