Created 08-12-2016 11:14 PM
I have created 5 node cluster on AWS, in the one of the DataNode is showing as down in Ambari. I have loging to node and run the ambari-agent status, it is showing as running. Please help me out to resolve the issue
Created 08-13-2016 01:11 AM
Can you please share more information? When you say one node is down, does that mean data node process? Can we see hdfs logs from /var/log folder? Is the node able to talk to Ambari? What kind of instance is this? Some low end instances share network bandwidth and other resources from applications other than yours. Those applications at times, may be using resources and impacting your system. If that's the case, it will show up and working as soon as resources become available. Is it possible to restart the node? I know in AWS it's not a simple decision like on-prem cluster but sometimes, that might be it.
Created 08-13-2016 01:11 AM
Can you please share more information? When you say one node is down, does that mean data node process? Can we see hdfs logs from /var/log folder? Is the node able to talk to Ambari? What kind of instance is this? Some low end instances share network bandwidth and other resources from applications other than yours. Those applications at times, may be using resources and impacting your system. If that's the case, it will show up and working as soon as resources become available. Is it possible to restart the node? I know in AWS it's not a simple decision like on-prem cluster but sometimes, that might be it.
Created 08-13-2016 01:15 AM
Can you please check logs on that datanode?
Also, run hdfs dfsadmin -report to check whether datanode is really down or ambari gletch ?
Created 08-13-2016 05:40 PM
Check with HDFS user
$sudo su - hdfs
$hadoop dfsadmin - report (from out put verify the list datanodes available)
which node is is listing or not verify
go to log directory
#cd /var/log/hdfs/datanode.log
It will give you some more information of issue.
if help full your comment and accept are appreciated
Created 08-15-2016 03:08 AM
Hi,
I have logged into the serve and run the hdfs dfsadmin -report, below is the output
[centos@ip-172-31-9-98 ~]$ sudo su hdfs
[hdfs@ip-172-31-9-98 centos]$ hdfs dfsadmin -report
Configured Capacity: 28984442880 (26.99 GB)
Present Capacity: 7856140288 (7.32 GB)
DFS Remaining: 5172633600 (4.82 GB)
DFS Used: 2683506688 (2.50 GB)
DFS Used%: 34.16%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (4):
Name: 172.31.58.15:50010 (ip-172-31-58-15.ec2.internal)
Hostname: ip-172-31-58-15.ec2.internal
Decommission Status : Normal
Configured Capacity: 7246110720 (6.75 GB)
DFS Used: 585097216 (557.99 MB)
Non DFS Used: 3310796800 (3.08 GB)
DFS Remaining: 3350216704 (3.12 GB)
DFS Used%: 8.07%
DFS Remaining%: 46.23%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
Last contact: Mon Aug 15 01:45:33 UTC 2016
Name: 172.31.6.230:50010 (ip-172-31-6-230.ec2.internal)
Hostname: ip-172-31-6-230.ec2.internal
Decommission Status : Normal
Configured Capacity: 7246110720 (6.75 GB)
DFS Used: 894488576 (853.05 MB)
Non DFS Used: 6351622144 (5.92 GB)
DFS Remaining: 0 (0 B)
DFS Used%: 12.34%
DFS Remaining%: 0.00%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
Last contact: Mon Aug 15 01:45:31 UTC 2016
Name: 172.31.9.97:50010 (ip-172-31-9-97.ec2.internal)
Hostname: ip-172-31-9-97.ec2.internal
Decommission Status : Normal
Configured Capacity: 7246110720 (6.75 GB)
DFS Used: 894484480 (853.05 MB)
Non DFS Used: 5936037888 (5.53 GB)
DFS Remaining: 415588352 (396.34 MB)
DFS Used%: 12.34%
DFS Remaining%: 5.74%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
Last contact: Mon Aug 15 01:45:32 UTC 2016
Name: 172.31.58.16:50010 (ip-172-31-58-16.ec2.internal)
Hostname: ip-172-31-58-16.ec2.internal
Decommission Status : Normal
Configured Capacity: 7246110720 (6.75 GB)
DFS Used: 309436416 (295.10 MB)
Non DFS Used: 5529845760 (5.15 GB)
DFS Remaining: 1406828544 (1.31 GB)
DFS Used%: 4.27%
DFS Remaining%: 19.41%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 2
Last contact: Mon Aug 15 01:45:33 UTC 2016
I have Gone through the HDFS logs in the server, It is showing I/O Exception(Jave), Below is the log out put, please go through it and Help me out.
[hdfs@ip-172-31-9-98 centos]$ cd /var/lo
local/ lock/ log/
[hdfs@ip-172-31-9-98 centos]$ cd /var/log/
[hdfs@ip-172-31-9-98 log]$ ls
ambari-agent cron-20160814 messages-20160807
ambari-metrics-collectorcups messages-20160814
ambari-metrics-monitor dmesg oozie
anaconda.ifcfg.log dmesg.old secure
anaconda.log dracut.log secure-20160807
anaconda.program.log falcon secure-20160814
anaconda.storage.log hadoopspark
anaconda.syslog hadoop-mapreduce spooler
anaconda.yum.log hadoop-yarn spooler-20160807
audithive spooler-20160814
boot.loghive-hcatalog tallylog
btmp lastlog wtmp
cloud-init.log maillog yum.log
cloud-init-output.log maillog-20160807 zookeeper
cron maillog-20160814
cron-20160807 messages
[hdfs@ip-172-31-9-98 log]$ cd hadoop
[hdfs@ip-172-31-9-98 hadoop]$ ls
hdfsmapreducerootyarn
[hdfs@ip-172-31-9-98 hadoop]$ cd hdfs/
[hdfs@ip-172-31-9-98 hdfs]$ ls
gc.log-201608031630
gc.log-201608031641
gc.log-201608081832
gc.log-201608082306
gc.log-201608141850
hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.log
hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out
hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.1
hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.2
hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.3
hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.4
hdfs-audit.log
SecurityAuth.audit
[hdfs@ip-172-31-9-98 hdfs]$ tail -100f hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.log
2016-08-08 18:39:26,009 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(214)) - Unable to send metrics to collector by address:http://ip-172-31-9-98.ec2.internal:6188/ws/v1/timeline/metrics
2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request
2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request
2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
2016-08-08 18:40:26,009 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request
2016-08-08 18:40:26,009 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(214)) - Unable to send metrics to collector by address:http://ip-172-31-9-98.ec2.internal:6188/ws/v1/timeline/metrics
2016-08-08 18:40:26,009 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused
2016-08-08 18:40:26,009 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request
Created 08-15-2016 05:19 AM
What type of instances are these? It seems like a simple connection issue. This might just be because of the lower end instances being used.
Is this the data node that's down?
http://ip-172-31-9-98.ec2.internal:6188/ws/v1/timeline/metrics