Created 10-26-2017 03:01 PM
Hello guys,
I have setup a cluster and HDFS is up and running.
But MapReduce2 and Spark not getting started. It gives the below error.
{ "RemoteException": { "exception": "IOException", "javaClassName": "java.io.IOException", "message": "Failed to find datanode, suggest to check cluster health. excludeDatanodes=null" } }
Hbase is also getting the below error
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /apps/hbase/data/.tmp/hbase.version could only be replicated to 0 nodes instead of minReplication (=1).There are 0 datanode(s) running and no node(s) are excluded in this operation.
Somehow the datanodes are not detected.
I updated the hosts file in all nodes with the private ip and private DNS. But still getting the same error.
Can someone please help me in this.
Thanks,
Nirmal J
Created 10-27-2017 09:50 AM
I see that the rpc-address is set to Hostname But the actual "hostname -f" command output on NameNode shows "ip-10-0-223-116.ec2.internal" So ideally the following rpc-address should be actually using "ip-10-0-223-116.ec2.internal:8020" address instead of "ip-10-0-223-116:8020".
# grep -B 2 -A 2 'rpc-address' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.rpc-address</name> <value>ip-10-0-223-116:8020</value> </property>
.
Same with: "fs.defaultFS" should not be using "localhost", It seems to be causing the issue here and it should also be changed to the Hostname of NameNode
grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41588-core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://localhost:8020</value> <final>true</final> </property>
Also strange that other addresses like following are set to "localhost"
# grep -B 2 -A 2 'localhost' /Users/jsensharma/Downloads/41589-hdfs-site.xml <property> <name>dfs.namenode.http-address</name> <value>localhost:50070</value> <final>true</final> </property> <property> <name>dfs.namenode.https-address</name> <value>localhost:50470</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>localhost:50090</value> </property>
So can you please try this:
1. Edit your "/etc/hosts" file and add this line in addition to what you have in the "/etc/hosts" file of all the Hosts (including DataNode & NameNode) Please keep the "ip-10-0-223-116.ec2.internal" first after IP Address in this file
10.0.223.116 ip-10-0-223-116.ec2.internal ip-10-0-223-116
2. Edit your "hdfs-site.xml" file and replace the "localhost" address that are mentioned above with "ip-10-0-223-116.ec2.internal" (if it does not work then try with "ip-10-0-223-116" on next try.)
3. Restart Your HDFS services after making these changes. I am not sure if you should be using "ip-10-0-223-116.ec2.internal" OR "ip-10-0-223-116" hostname. But based on the "hostname -f" output you should be using the "ip-10-0-223-116.ec2.internal", if it does not work then you should try both hostnames one by one to see which one works.
.
Created 10-27-2017 07:23 AM
@Aditya Sirna The error seems to be weird
1) The namenode service is up and running in the master
[cloudbreak@ip-10-0-223-116 ~]$ sudo netstat -nltp| grep :8020
tcp00 127.0.0.1:80200.0.0.0:* LISTEN6689/java
2) Port 8020 is open for allscreen-shot-2017-10-27-at-124602-pm.png
3) I can ping from master to slave using the private DNS
4) But telnet is getting connection refused
[cloudbreak@ip-10-0-109-247 hdfs]$ telnet 10.0.223.116 8020
Trying 10.0.223.116...
telnet: connect to address 10.0.223.116: Connection refused
5) Datanode logs has the below error
2017-10-27 05:24:15,073 INFOipc.Client (Client.java:handleConnectionFailure(906)) - Retrying connect to server: ip-10-0-223-116.ec2.internal/10.0.223.116:8020. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
Created 10-27-2017 07:56 AM
Can you set dfs.namenode.rpc-address to ip-10-0-223-116 and restart the namenode.
Hope you have the below mapping in /etc/hosts
10.0.223.116 ip-10-0-223-116
Created 10-27-2017 08:29 AM
This didn't work @Aditya Sirna. Still the same issue.
It seems like some kind of firewall issue.
telnet 10.0.223.116 8020 is connecting from namenode, but not from datanode
Created 10-27-2017 08:07 AM
In one of your previous update you mentioned that "I can ping from master to slave using the private DNS, But telnet is getting connection refused"
That can happen if there is a firewall issue. So can you also please check if the "iptables" (or firewalld) is stopped on NameNode host and also on all the datanode hosts. Please share the output of the following from the DataNode host.
# service iptables status # service iptables stop # ping telnet 10.0.223.116 # ping ip-10-0-223-116 # telnet 10.0.223.116 8020
Also the "/etc/hosts" file accross the cluster should look almost idnenticle so that every node of the cluster can connect/resolve to each other using their FQDN/Hostname.
Also please check of the "hostname -f" is returning the correct FQDN on every host Specially the NameNode and on the DataNode that is failing to connect.
# hostname -f
.
Created 10-27-2017 08:12 AM
Also based on your previous shared "/etc/hosts" file content i do not see any where the following address is mapped:
10.0.223.116 ip-10-0-65-59.ec2.internal ip-10-0-65-59.ec2
.
Is this your NameNode address ? "34.231.98.121" Or "10.0.223.116"
Created 10-27-2017 08:15 AM
Can you also share the output of the following commands from the NameNode host?
# service iptables status # hostname -f # netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid
Created 10-27-2017 08:32 AM
[cloudbreak@ip-10-0-223-116 ~]$ sudo service iptables status
iptables: Firewall is not running.
[cloudbreak@ip-10-0-223-116 ~]$ hostname -f
ip-10-0-223-116.ec2.internal
[cloudbreak@ip-10-0-223-116 ~]$ sudo netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid`
tcp00 127.0.0.1:80200.0.0.0:* LISTEN29280/java
tcp00 127.0.0.1:50070 0.0.0.0:* LISTEN29280/java
tcp00 127.0.0.1:20103 0.0.0.0:* LISTEN29280/java
Created 10-27-2017 08:50 AM
It is strange to see the address 127.0.0.1:8020 in your output which says that the port 8020 is listening on 127.0.0.1 address means from remote machine that port can not be accessed.
Ideally the NameNode address should be showing something like following:
# netstat -tnlpa | grep `cat /var/run/hadoop/hdfs/hadoop-hdfs-namenode.pid` tcp 0 0 172.27.10.2:8020 0.0.0.0:* LISTEN 509/java tcp 0 0 172.27.10.2:50070 0.0.0.0:* LISTEN 509/java
.
Can you please check and share the core-site.xml / hdfs-site.xml to see if th NameNode address is correct and it listening on Hostname (FQDN) and not on 127.0.0.1
Created 10-27-2017 08:54 AM
Please check the ifconfig output on the NameNode to see the actual IP Address that it is supposed to be using.
Example:
# ifconfig eth0 Link encap:Ethernet HWaddr 12:42:AA:11:00:02 inet addr:172.27.10.2 Bcast:0.0.0.0 Mask:255.255.0.0 . . lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0
.
Created 10-27-2017 09:31 AM
Please find attached the core-ste.xml and hdfs-site.xml and /etc/hosts in all nodes
hosts-master.txthosts-slave-1.txthosts-slave-2.txtcore-site.xmlhdfs-site.xml
Thanks,
Nirmal J