Created 10-26-2017 03:01 PM
Hello guys,
I have setup a cluster and HDFS is up and running.
But MapReduce2 and Spark not getting started. It gives the below error.
{ "RemoteException": { "exception": "IOException", "javaClassName": "java.io.IOException", "message": "Failed to find datanode, suggest to check cluster health. excludeDatanodes=null" } }
Hbase is also getting the below error
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/hbase/data/.tmp/hbase.version could only be replicated to 0 nodes instead of minReplication (=1).There are 0 datanode(s) running and no node(s) are excluded in this operation.
Somehow the datanodes are not detected.
I updated the hosts file in all nodes with the private ip and private DNS. But still getting the same error.
Can someone please help me in this.
Thanks,
Nirmal J
Created 10-27-2017 09:50 AM
I see that the rpc-address is set to Hostname But the actual "hostname -f" command output on NameNode shows "ip-10-0-223-116.ec2.internal" So ideally the following rpc-address should be actually using "ip-10-0-223-116.ec2.internal:8020" address instead of "ip-10-0-223-116:8020".
# grep -B 2 -A 2 'rpc-address' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.rpc-address</name> <value>ip-10-0-223-116:8020</value> </property>
.
Same with: "fs.defaultFS" should not be using "localhost", It seems to be causing the issue here and it should also be changed to the Hostname of NameNode
grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41588-core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> <final>true</final> </property>
Also strange that other addresses like following are set to "localhost"
# grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.http-address</name> <value>localhost:50070</value> <final>true</final> </property> <property> <name>dfs.namenode.https-address</name> <value>localhost:50470</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>localhost:50090</value> </property>
So can you please try this:
1. Edit your "/etc/hosts" file and add this line in addition to what you have in the "/etc/hosts" file of all the Hosts (including DataNode & NameNode) Please keep the "ip-10-0-223-116.ec2.internal" first after IP Address in this file
10.0.223.116 ip-10-0-223-116.ec2.internal ip-10-0-223-116
2. Edit your "hdfs-site.xml" file and replace the "localhost" address that are mentioned above with "ip-10-0-223-116.ec2.internal" (if it does not work then try with "ip-10-0-223-116" on next try.)
3. Restart Your HDFS services after making these changes. I am not sure if you should be using "ip-10-0-223-116.ec2.internal" OR "ip-10-0-223-116" hostname. But based on the "hostname -f" output you should be using the "ip-10-0-223-116.ec2.internal", if it does not work then you should try both hostnames one by one to see which one works.
.
Created 10-26-2017 03:44 PM
It looks like HDFS is not actually running. The message you copied from HBase states that there are no datanodes running.
Look at the DataNode logs and check the NameNode WebUI to try to identify why the DataNodes aren't running/included.
Created 10-26-2017 04:57 PM
Thanks @Josh Elser,
I see the below error in the datanode logs
2017-10-26 16:53:28,746 INFOipc.Server (Server.java:run(900)) - IPC Server listener on 8010: starting
2017-10-26 16:53:29,892 INFOipc.Client (Client.java:handleConnectionFailure(906)) - Retrying connect to server: localhost/127.0.0.1:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
Is it because the namenode IP is specified wrong in some config?
Thanks,
Nirmal J
Created 10-26-2017 05:32 PM
From the above log looks this there is an issue with /etc/hosts configuration. Can you please try
telnet <namenode-host> 8020
The datanode is pointing to localhost/127.0.0.1:8020 instead of the name node host.
Thanks,
Aditya
Created 10-26-2017 05:39 PM
@Aditya Sirna Thanks for your time.
telnet gives connection refused error
But the service is running on port 8020 in namenode
Here is my master node etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost6 localhost6.localdomain6
10.0.65.59 ip-10-0-65-59.ec2.internal ip-10-0-65-59
10.0.125.23 ip-10-0-125-23.ec2.internal
10.0.190.115 ip-10-0-190-115.ec2.internal
And this is my slave node /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost6 localhost6.localdomain6
10.0.125.23 ip-10-0-125-23.ec2.internal ip-10-0-125-23
10.0.65.59 ip-10-0-65-59.ec2.internal
10.0.190.115 ip-10-0-190-115.ec2.internal
Thanks,
Nirmal J
Created 10-26-2017 05:53 PM
Can you please check the value of dfs.namenode.rpc-address from Advanced hdfs-site.xml
Thanks, Aditya
Created 10-26-2017 05:55 PM
Created 10-26-2017 05:56 PM
Please change it to hostname and try restarting the services. Make sure that "telnet hostname 8020" works
Created 10-27-2017 04:07 AM
@Aditya Sirna I tried to give the namenode public IP there. But the namenode itself failed to start
Call From ip-10-0-65-59.ec2.internal/10.0.65.59 to 34.231.98.121:8020 failed on connection exception: java.net.ConnectException: Connection refused;
Created 10-27-2017 04:19 AM
I would recommend using hostnames than IPs. If you want to use IPs instead , use either all private IPs or all public. As you have configured all private IPs in /etc/hosts , it is better to use private IPs. But my recommendation is to go with hostnames.
Also , make sure that you able to ping with the private IPs and hostnames from one box to another