Support Questions

Find answers, ask questions, and share your expertise

Failed to find datanode, suggest to check cluster health. excludeDatanodes=null

avatar
Explorer

Hello guys,

I have setup a cluster and HDFS is up and running.

But MapReduce2 and Spark not getting started. It gives the below error.

{
  "RemoteException": {
    "exception": "IOException", 
    "javaClassName": "java.io.IOException", 
    "message": "Failed to find datanode, suggest to check cluster health. excludeDatanodes=null"
  }
}

Hbase is also getting the below error

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/hbase/data/.tmp/hbase.version could only be replicated to 0 nodes instead of minReplication (=1).There are 0 datanode(s) running and no node(s) are excluded in this operation.

Somehow the datanodes are not detected.

I updated the hosts file in all nodes with the private ip and private DNS. But still getting the same error.

Can someone please help me in this.

Thanks,

Nirmal J

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Nirmal J

I see that the rpc-address is set to Hostname But the actual "hostname -f" command output on NameNode shows "ip-10-0-223-116.ec2.internal" So ideally the following rpc-address should be actually using "ip-10-0-223-116.ec2.internal:8020" address instead of "ip-10-0-223-116:8020".

# grep -B 2 -A 2 'rpc-address' /Users/jsensharma/Downloads/41589-hdfs-site.xml 
  <property>
  	<name>dfs.namenode.rpc-address</name>
  	<value>ip-10-0-223-116:8020</value>
  </property>

.

Same with: "fs.defaultFS" should not be using "localhost", It seems to be causing the issue here and it should also be changed to the Hostname of NameNode

grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41588-core-site.xml 
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:8020</value>
      <final>true</final>
    </property>


Also strange that other addresses like following are set to "localhost"

# grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41589-hdfs-site.xml 
  <property>
  	<name>dfs.namenode.http-address</name>
  	<value>localhost:50070</value>
  	<final>true</final>
  	</property>

  <property>
  	<name>dfs.namenode.https-address</name>
  	<value>localhost:50470</value>
  </property>

  <property>
  	<name>dfs.namenode.secondary.http-address</name>
  	<value>localhost:50090</value>
  </property>





So can you please try this:


1. Edit your "/etc/hosts" file and add this line in addition to what you have in the "/etc/hosts" file of all the Hosts (including DataNode & NameNode) Please keep the "ip-10-0-223-116.ec2.internal" first after IP Address in this file

10.0.223.116     ip-10-0-223-116.ec2.internal    ip-10-0-223-116


2. Edit your "hdfs-site.xml" file and replace the "localhost" address that are mentioned above with "ip-10-0-223-116.ec2.internal" (if it does not work then try with "ip-10-0-223-116" on next try.)

3. Restart Your HDFS services after making these changes. I am not sure if you should be using "ip-10-0-223-116.ec2.internal" OR "ip-10-0-223-116" hostname. But based on the "hostname -f" output you should be using the "ip-10-0-223-116.ec2.internal", if it does not work then you should try both hostnames one by one to see which one works.

.

View solution in original post

23 REPLIES 23

avatar
Super Guru

It looks like HDFS is not actually running. The message you copied from HBase states that there are no datanodes running.

Look at the DataNode logs and check the NameNode WebUI to try to identify why the DataNodes aren't running/included.

avatar
Explorer

Thanks @Josh Elser,

I see the below error in the datanode logs

2017-10-26 16:53:28,746 INFOipc.Server (Server.java:run(900)) - IPC Server listener on 8010: starting

2017-10-26 16:53:29,892 INFOipc.Client (Client.java:handleConnectionFailure(906)) - Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)

Is it because the namenode IP is specified wrong in some config?

Thanks,

Nirmal J

avatar
Super Guru

@Nirmal J,

From the above log looks this there is an issue with /etc/hosts configuration. Can you please try

telnet <namenode-host> 8020

The datanode is pointing to localhost/127.0.0.1:8020 instead of the name node host.

Thanks,

Aditya

avatar
Explorer

@Aditya Sirna Thanks for your time.

telnet gives connection refused error

But the service is running on port 8020 in namenode

Here is my master node etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost6 localhost6.localdomain6

10.0.65.59 ip-10-0-65-59.ec2.internal ip-10-0-65-59

10.0.125.23 ip-10-0-125-23.ec2.internal

10.0.190.115 ip-10-0-190-115.ec2.internal

And this is my slave node /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost6 localhost6.localdomain6

10.0.125.23 ip-10-0-125-23.ec2.internal ip-10-0-125-23

10.0.65.59 ip-10-0-65-59.ec2.internal

10.0.190.115 ip-10-0-190-115.ec2.internal

Thanks,

Nirmal J

avatar
Super Guru

@Nirmal J

Can you please check the value of dfs.namenode.rpc-address from Advanced hdfs-site.xml

Thanks, Aditya

avatar
Explorer

@Aditya Sirna

It is localhost:8020

Thanks,

Nirmal J

avatar
Super Guru

Please change it to hostname and try restarting the services. Make sure that "telnet hostname 8020" works

avatar
Explorer

@Aditya Sirna I tried to give the namenode public IP there. But the namenode itself failed to start

Call From ip-10-0-65-59.ec2.internal/10.0.65.59 to 34.231.98.121:8020 failed on connection exception: java.net.ConnectException: Connection refused;

avatar
Super Guru

@Nirmal J,

I would recommend using hostnames than IPs. If you want to use IPs instead , use either all private IPs or all public. As you have configured all private IPs in /etc/hosts , it is better to use private IPs. But my recommendation is to go with hostnames.

Also , make sure that you able to ping with the private IPs and hostnames from one box to another