Created 10-26-2017 03:01 PM
Hello guys,
I have setup a cluster and HDFS is up and running.
But MapReduce2 and Spark not getting started. It gives the below error.
{ "RemoteException": { "exception": "IOException", "javaClassName": "java.io.IOException", "message": "Failed to find datanode, suggest to check cluster health. excludeDatanodes=null" } }
Hbase is also getting the below error
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/hbase/data/.tmp/hbase.version could only be replicated to 0 nodes instead of minReplication (=1).There are 0 datanode(s) running and no node(s) are excluded in this operation.
Somehow the datanodes are not detected.
I updated the hosts file in all nodes with the private ip and private DNS. But still getting the same error.
Can someone please help me in this.
Thanks,
Nirmal J
Created 10-27-2017 09:50 AM
I see that the rpc-address is set to Hostname But the actual "hostname -f" command output on NameNode shows "ip-10-0-223-116.ec2.internal" So ideally the following rpc-address should be actually using "ip-10-0-223-116.ec2.internal:8020" address instead of "ip-10-0-223-116:8020".
# grep -B 2 -A 2 'rpc-address' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.rpc-address</name> <value>ip-10-0-223-116:8020</value> </property>
.
Same with: "fs.defaultFS" should not be using "localhost", It seems to be causing the issue here and it should also be changed to the Hostname of NameNode
grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41588-core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> <final>true</final> </property>
Also strange that other addresses like following are set to "localhost"
# grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.http-address</name> <value>localhost:50070</value> <final>true</final> </property> <property> <name>dfs.namenode.https-address</name> <value>localhost:50470</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>localhost:50090</value> </property>
So can you please try this:
1. Edit your "/etc/hosts" file and add this line in addition to what you have in the "/etc/hosts" file of all the Hosts (including DataNode & NameNode) Please keep the "ip-10-0-223-116.ec2.internal" first after IP Address in this file
10.0.223.116 ip-10-0-223-116.ec2.internal ip-10-0-223-116
2. Edit your "hdfs-site.xml" file and replace the "localhost" address that are mentioned above with "ip-10-0-223-116.ec2.internal" (if it does not work then try with "ip-10-0-223-116" on next try.)
3. Restart Your HDFS services after making these changes. I am not sure if you should be using "ip-10-0-223-116.ec2.internal" OR "ip-10-0-223-116" hostname. But based on the "hostname -f" output you should be using the "ip-10-0-223-116.ec2.internal", if it does not work then you should try both hostnames one by one to see which one works.
.
Created 10-27-2017 09:02 AM
As your NameNode port 127.0.0.1:8020 indicates that it is listening on "127.0.0.1" IP hence remotely the port 8020 can not be contacted.
So you should check the "dfs.namenode.rpc-address" property of your HDFS to see if it is configured properly with "$HOSTNAME:8020" or not?
And if that hostname is resolving to the IPAddress of the NameNode host?
Can you please check and share the core-site.xml / hdfs-site.xml to see if th NameNode address is correct and it listening on Hostname (FQDN) and not on 127.0.0.1
.
This can happen normally in two cases.
1. If the /etc/hosts file entry on NameNode is not correct.
2. The other network interface addresses are not resolving.
Created 10-27-2017 09:50 AM
I see that the rpc-address is set to Hostname But the actual "hostname -f" command output on NameNode shows "ip-10-0-223-116.ec2.internal" So ideally the following rpc-address should be actually using "ip-10-0-223-116.ec2.internal:8020" address instead of "ip-10-0-223-116:8020".
# grep -B 2 -A 2 'rpc-address' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.rpc-address</name> <value>ip-10-0-223-116:8020</value> </property>
.
Same with: "fs.defaultFS" should not be using "localhost", It seems to be causing the issue here and it should also be changed to the Hostname of NameNode
grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41588-core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> <final>true</final> </property>
Also strange that other addresses like following are set to "localhost"
# grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.http-address</name> <value>localhost:50070</value> <final>true</final> </property> <property> <name>dfs.namenode.https-address</name> <value>localhost:50470</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>localhost:50090</value> </property>
So can you please try this:
1. Edit your "/etc/hosts" file and add this line in addition to what you have in the "/etc/hosts" file of all the Hosts (including DataNode & NameNode) Please keep the "ip-10-0-223-116.ec2.internal" first after IP Address in this file
10.0.223.116 ip-10-0-223-116.ec2.internal ip-10-0-223-116
2. Edit your "hdfs-site.xml" file and replace the "localhost" address that are mentioned above with "ip-10-0-223-116.ec2.internal" (if it does not work then try with "ip-10-0-223-116" on next try.)
3. Restart Your HDFS services after making these changes. I am not sure if you should be using "ip-10-0-223-116.ec2.internal" OR "ip-10-0-223-116" hostname. But based on the "hostname -f" output you should be using the "ip-10-0-223-116.ec2.internal", if it does not work then you should try both hostnames one by one to see which one works.
.
Created 10-27-2017 10:36 AM
Created 10-27-2017 10:38 AM
Good to now that the issue is resolved.
As the issue is resolved, hence it will be also great if you can mark this HCC thread as Answered by clicking on the "Accept" Button on the correct answer. That way other HCC users can quickly find the solution when they encounter the same issue.