Support Questions

Find answers, ask questions, and share your expertise

Failed to find datanode, suggest to check cluster health. excludeDatanodes=null

avatar
Explorer

Hello guys,

I have setup a cluster and HDFS is up and running.

But MapReduce2 and Spark not getting started. It gives the below error.

{
  "RemoteException": {
    "exception": "IOException", 
    "javaClassName": "java.io.IOException", 
    "message": "Failed to find datanode, suggest to check cluster health. excludeDatanodes=null"
  }
}

Hbase is also getting the below error

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/hbase/data/.tmp/hbase.version could only be replicated to 0 nodes instead of minReplication (=1).There are 0 datanode(s) running and no node(s) are excluded in this operation.

Somehow the datanodes are not detected.

I updated the hosts file in all nodes with the private ip and private DNS. But still getting the same error.

Can someone please help me in this.

Thanks,

Nirmal J

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Nirmal J

I see that the rpc-address is set to Hostname But the actual "hostname -f" command output on NameNode shows "ip-10-0-223-116.ec2.internal" So ideally the following rpc-address should be actually using "ip-10-0-223-116.ec2.internal:8020" address instead of "ip-10-0-223-116:8020".

# grep -B 2 -A 2 'rpc-address' /Users/jsensharma/Downloads/41589-hdfs-site.xml 
  <property>
  	<name>dfs.namenode.rpc-address</name>
  	<value>ip-10-0-223-116:8020</value>
  </property>

.

Same with: "fs.defaultFS" should not be using "localhost", It seems to be causing the issue here and it should also be changed to the Hostname of NameNode

grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41588-core-site.xml 
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:8020</value>
      <final>true</final>
    </property>


Also strange that other addresses like following are set to "localhost"

# grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41589-hdfs-site.xml 
  <property>
  	<name>dfs.namenode.http-address</name>
  	<value>localhost:50070</value>
  	<final>true</final>
  	</property>

  <property>
  	<name>dfs.namenode.https-address</name>
  	<value>localhost:50470</value>
  </property>

  <property>
  	<name>dfs.namenode.secondary.http-address</name>
  	<value>localhost:50090</value>
  </property>





So can you please try this:


1. Edit your "/etc/hosts" file and add this line in addition to what you have in the "/etc/hosts" file of all the Hosts (including DataNode & NameNode) Please keep the "ip-10-0-223-116.ec2.internal" first after IP Address in this file

10.0.223.116     ip-10-0-223-116.ec2.internal    ip-10-0-223-116


2. Edit your "hdfs-site.xml" file and replace the "localhost" address that are mentioned above with "ip-10-0-223-116.ec2.internal" (if it does not work then try with "ip-10-0-223-116" on next try.)

3. Restart Your HDFS services after making these changes. I am not sure if you should be using "ip-10-0-223-116.ec2.internal" OR "ip-10-0-223-116" hostname. But based on the "hostname -f" output you should be using the "ip-10-0-223-116.ec2.internal", if it does not work then you should try both hostnames one by one to see which one works.

.

View solution in original post

23 REPLIES 23

avatar
Explorer

@Aditya Sirna The error seems to be weird

1) The namenode service is up and running in the master

[cloudbreak@ip-10-0-223-116 ~]$ sudo netstat -nltp| grep :8020

tcp00 127.0.0.1:80200.0.0.0:* LISTEN6689/java

2) Port 8020 is open for allscreen-shot-2017-10-27-at-124602-pm.png

3) I can ping from master to slave using the private DNS

4) But telnet is getting connection refused

[cloudbreak@ip-10-0-109-247 hdfs]$ telnet 10.0.223.116 8020

Trying 10.0.223.116...

telnet: connect to address 10.0.223.116: Connection refused

5) Datanode logs has the below error

2017-10-27 05:24:15,073 INFOipc.Client (Client.java:handleConnectionFailure(906)) - Retrying connect to server: ip-10-0-223-116.ec2.internal/10.0.223.116:8020. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

avatar
Super Guru

@Nirmal J,

Can you set dfs.namenode.rpc-address to ip-10-0-223-116 and restart the namenode.

Hope you have the below mapping in /etc/hosts

10.0.223.116 ip-10-0-223-116

avatar
Explorer

This didn't work @Aditya Sirna. Still the same issue.

It seems like some kind of firewall issue.

telnet 10.0.223.116 8020 is connecting from namenode, but not from datanode

avatar
Master Mentor

@Nirmal J

In one of your previous update you mentioned that "I can ping from master to slave using the private DNS, But telnet is getting connection refused"


That can happen if there is a firewall issue. So can you also please check if the "iptables" (or firewalld) is stopped on NameNode host and also on all the datanode hosts. Please share the output of the following from the DataNode host.

# service iptables status
# service iptables stop
# ping  telnet 10.0.223.116 
# ping  ip-10-0-223-116
# telnet 10.0.223.116 8020


Also the "/etc/hosts" file accross the cluster should look almost idnenticle so that every node of the cluster can connect/resolve to each other using their FQDN/Hostname.

Also please check of the "hostname -f" is returning the correct FQDN on every host Specially the NameNode and on the DataNode that is failing to connect.

# hostname -f

.

avatar
Master Mentor

@Nirmal J

Also based on your previous shared "/etc/hosts" file content i do not see any where the following address is mapped:

10.0.223.116	  ip-10-0-65-59.ec2.internal	ip-10-0-65-59.ec2

.

Is this your NameNode address ? "34.231.98.121" Or "10.0.223.116"

avatar
Master Mentor

@Nirmal J

Can you also share the output of the following commands from the NameNode host?

# service iptables status
# hostname -f
# netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid

.

avatar
Explorer

@Jay SenSharma

[cloudbreak@ip-10-0-223-116 ~]$ sudo service iptables status

iptables: Firewall is not running.

[cloudbreak@ip-10-0-223-116 ~]$ hostname -f

ip-10-0-223-116.ec2.internal

[cloudbreak@ip-10-0-223-116 ~]$ sudo netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid`

tcp00 127.0.0.1:80200.0.0.0:* LISTEN29280/java

tcp00 127.0.0.1:50070 0.0.0.0:* LISTEN29280/java

tcp00 127.0.0.1:20103 0.0.0.0:* LISTEN29280/java

avatar
Master Mentor

@Nirmal J

It is strange to see the address 127.0.0.1:8020 in your output which says that the port 8020 is listening on 127.0.0.1 address means from remote machine that port can not be accessed.

Ideally the NameNode address should be showing something like following:

# netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid`
tcp        0      0 172.27.10.2:8020             0.0.0.0:*                   LISTEN      509/java            
tcp        0      0 172.27.10.2:50070            0.0.0.0:*                   LISTEN      509/java      

.

Can you please check and share the core-site.xml / hdfs-site.xml to see if th NameNode address is correct and it listening on Hostname (FQDN) and not on 127.0.0.1

avatar
Master Mentor

@Nirmal J

Please check the ifconfig output on the NameNode to see the actual IP Address that it is supposed to be using.

Example:

# ifconfig
eth0  Link encap:Ethernet  HWaddr 12:42:AA:11:00:02  
  inet addr:172.27.10.2  Bcast:0.0.0.0  Mask:255.255.0.0
.
.
lo  Link encap:Local Loopback  
  inet addr:127.0.0.1  Mask:255.0.0.0

.

avatar
Explorer

@Jay SenSharma

Please find attached the core-ste.xml and hdfs-site.xml and /etc/hosts in all nodes

hosts-master.txthosts-slave-1.txthosts-slave-2.txtcore-site.xmlhdfs-site.xml

Thanks,

Nirmal J