Support Questions

Find answers, ask questions, and share your expertise

One Data node is down in the cluster

New Contributor

I have created 5 node cluster on AWS, in the one of the DataNode is showing as down in Ambari. I have loging to node and run the ambari-agent status, it is showing as running. Please help me out to resolve the issue

1 ACCEPTED SOLUTION

Super Guru
@venkat v

Can you please share more information? When you say one node is down, does that mean data node process? Can we see hdfs logs from /var/log folder? Is the node able to talk to Ambari? What kind of instance is this? Some low end instances share network bandwidth and other resources from applications other than yours. Those applications at times, may be using resources and impacting your system. If that's the case, it will show up and working as soon as resources become available. Is it possible to restart the node? I know in AWS it's not a simple decision like on-prem cluster but sometimes, that might be it.

View solution in original post

5 REPLIES 5

Super Guru
@venkat v

Can you please share more information? When you say one node is down, does that mean data node process? Can we see hdfs logs from /var/log folder? Is the node able to talk to Ambari? What kind of instance is this? Some low end instances share network bandwidth and other resources from applications other than yours. Those applications at times, may be using resources and impacting your system. If that's the case, it will show up and working as soon as resources become available. Is it possible to restart the node? I know in AWS it's not a simple decision like on-prem cluster but sometimes, that might be it.

@venkat v

Can you please check logs on that datanode?

Also, run hdfs dfsadmin -report to check whether datanode is really down or ambari gletch ?

Expert Contributor

Check with HDFS user

$sudo su - hdfs

$hadoop dfsadmin - report (from out put verify the list datanodes available)

which node is is listing or not verify

go to log directory

#cd /var/log/hdfs/datanode.log

It will give you some more information of issue.

if help full your comment and accept are appreciated

New Contributor

Hi,

I have logged into the serve and run the hdfs dfsadmin -report, below is the output

[centos@ip-172-31-9-98 ~]$ sudo su hdfs

[hdfs@ip-172-31-9-98 centos]$ hdfs dfsadmin -report

Configured Capacity: 28984442880 (26.99 GB)

Present Capacity: 7856140288 (7.32 GB)

DFS Remaining: 5172633600 (4.82 GB)

DFS Used: 2683506688 (2.50 GB)

DFS Used%: 34.16%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

Missing blocks (with replication factor 1): 0

-------------------------------------------------

Live datanodes (4):

Name: 172.31.58.15:50010 (ip-172-31-58-15.ec2.internal)

Hostname: ip-172-31-58-15.ec2.internal

Decommission Status : Normal

Configured Capacity: 7246110720 (6.75 GB)

DFS Used: 585097216 (557.99 MB)

Non DFS Used: 3310796800 (3.08 GB)

DFS Remaining: 3350216704 (3.12 GB)

DFS Used%: 8.07%

DFS Remaining%: 46.23%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 2

Last contact: Mon Aug 15 01:45:33 UTC 2016

Name: 172.31.6.230:50010 (ip-172-31-6-230.ec2.internal)

Hostname: ip-172-31-6-230.ec2.internal

Decommission Status : Normal

Configured Capacity: 7246110720 (6.75 GB)

DFS Used: 894488576 (853.05 MB)

Non DFS Used: 6351622144 (5.92 GB)

DFS Remaining: 0 (0 B)

DFS Used%: 12.34%

DFS Remaining%: 0.00%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 2

Last contact: Mon Aug 15 01:45:31 UTC 2016

Name: 172.31.9.97:50010 (ip-172-31-9-97.ec2.internal)

Hostname: ip-172-31-9-97.ec2.internal

Decommission Status : Normal

Configured Capacity: 7246110720 (6.75 GB)

DFS Used: 894484480 (853.05 MB)

Non DFS Used: 5936037888 (5.53 GB)

DFS Remaining: 415588352 (396.34 MB)

DFS Used%: 12.34%

DFS Remaining%: 5.74%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 2

Last contact: Mon Aug 15 01:45:32 UTC 2016

Name: 172.31.58.16:50010 (ip-172-31-58-16.ec2.internal)

Hostname: ip-172-31-58-16.ec2.internal

Decommission Status : Normal

Configured Capacity: 7246110720 (6.75 GB)

DFS Used: 309436416 (295.10 MB)

Non DFS Used: 5529845760 (5.15 GB)

DFS Remaining: 1406828544 (1.31 GB)

DFS Used%: 4.27%

DFS Remaining%: 19.41%

Configured Cache Capacity: 0 (0 B)

Cache Used: 0 (0 B)

Cache Remaining: 0 (0 B)

Cache Used%: 100.00%

Cache Remaining%: 0.00%

Xceivers: 2

Last contact: Mon Aug 15 01:45:33 UTC 2016

I have Gone through the HDFS logs in the server, It is showing I/O Exception(Jave), Below is the log out put, please go through it and Help me out.

[hdfs@ip-172-31-9-98 centos]$ cd /var/lo

local/ lock/ log/

[hdfs@ip-172-31-9-98 centos]$ cd /var/log/

[hdfs@ip-172-31-9-98 log]$ ls

ambari-agent cron-20160814 messages-20160807

ambari-metrics-collectorcups messages-20160814

ambari-metrics-monitor dmesg oozie

anaconda.ifcfg.log dmesg.old secure

anaconda.log dracut.log secure-20160807

anaconda.program.log falcon secure-20160814

anaconda.storage.log hadoopspark

anaconda.syslog hadoop-mapreduce spooler

anaconda.yum.log hadoop-yarn spooler-20160807

audithive spooler-20160814

boot.loghive-hcatalog tallylog

btmp lastlog wtmp

cloud-init.log maillog yum.log

cloud-init-output.log maillog-20160807 zookeeper

cron maillog-20160814

cron-20160807 messages

[hdfs@ip-172-31-9-98 log]$ cd hadoop

[hdfs@ip-172-31-9-98 hadoop]$ ls

hdfsmapreducerootyarn

[hdfs@ip-172-31-9-98 hadoop]$ cd hdfs/

[hdfs@ip-172-31-9-98 hdfs]$ ls

gc.log-201608031630

gc.log-201608031641

gc.log-201608081832

gc.log-201608082306

gc.log-201608141850

hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.log

hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out

hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.1

hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.2

hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.3

hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.out.4

hdfs-audit.log

SecurityAuth.audit

[hdfs@ip-172-31-9-98 hdfs]$ tail -100f hadoop-hdfs-datanode-ip-172-31-9-98.ec2.internal.log

2016-08-08 18:39:26,009 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(214)) - Unable to send metrics to collector by address:http://ip-172-31-9-98.ec2.internal:6188/ws/v1/timeline/metrics

2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused

2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request

2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused

2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request

2016-08-08 18:40:26,008 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused

2016-08-08 18:40:26,009 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request

2016-08-08 18:40:26,009 WARN timeline.HadoopTimelineMetricsSink (HadoopTimelineMetricsSink.java:putMetrics(214)) - Unable to send metrics to collector by address:http://ip-172-31-9-98.ec2.internal:6188/ws/v1/timeline/metrics

2016-08-08 18:40:26,009 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(439)) - I/O exception (java.net.ConnectException) caught when processing request: Connection refused

2016-08-08 18:40:26,009 INFO httpclient.HttpMethodDirector (HttpMethodDirector.java:executeWithRetry(445)) - Retrying request

Super Guru

@venkat v

What type of instances are these? It seems like a simple connection issue. This might just be because of the lower end instances being used.

Is this the data node that's down?

http://ip-172-31-9-98.ec2.internal:6188/ws/v1/timeline/metrics