Reply
New Contributor
Posts: 22
Registered: ‎02-11-2014

2 dead nodes - but still getting written to

Hello.

 

I am using CDH 4 free version on Centos 6.4. I have a problem that I find strange. I have rebuilt the OS and re-installed each of the dead nodes in case something funky happened. But the problem remains.

 

CDH Manager shows each of my dead nodes as "Good Health, Started, Good." dfsadmin -report shows that they are "dead."

However when I write files to hdfs my two dead nodes still get data written to them.

 

Any suggestions?

Brian

 

 

 

----------------------------------------------------------------------------

[root@n2.company.com finalized]# pwd

/dfs/dn/current/BP-1978397931-192.168.129.13-1383064468925/current/finalized

[root@n2.company.com finalized]# du -hs .

3.2G    .

[root@n2.company.com finalized]# du -hs .

3.4G    .

 

-------------------------------------------------------------------------

dfsadmin -report (i left out the good nodes)

 

Datanodes available: 16 (18 total, 2 dead)

 

Dead datanodes:
Name: 192.168.129.1:50010 (n1.company.com)
Hostname: 192.168.129.1
Decommission Status : Normal
Configured Capacity: 0 (0 B)
DFS Used: 0 (0 B)
Non DFS Used: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used%: 100.00%
DFS Remaining%: 0.00%
Last contact: Wed Dec 31 19:00:00 EST 1969


Name: 192.168.129.2:50010 (n2.company.com)
Hostname: 192.168.129.2
Decommission Status : Normal
Configured Capacity: 0 (0 B)
DFS Used: 0 (0 B)
Non DFS Used: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used%: 100.00%
DFS Remaining%: 0.00%
Last contact: Wed Dec 31 19:00:00 EST 1969

New Contributor
Posts: 22
Registered: ‎02-11-2014

Re: 2 dead nodes - but still getting written to

Well some more information. I found that the files in the /etc/hadoop/conf on these 2 dead nodes were "stock" files.

They had no configuration information for my environment.

 

So I did "Deploy Client Configuration" and the correct conf files were placed in /etc/hadoop/conf.

 

I then restarted hdfs and no change, still show dead.

I then restarted the cluster and no change, they still show dead in dfsadmin -report. But cloudera manager says they are healthy.

 

Brian

Posts: 416
Topics: 51
Kudos: 86
Solutions: 49
Registered: ‎06-26-2013

Re: 2 dead nodes - but still getting written to

do those two nodes, by any chance, have iptables enabled?

 

sudo chkconfig iptables --list

sudo service iptables status

 

What you're seeing is that Cloudera Manager is able to communicate with it's SCM agent on those nodes and start/stop the datanodes, so it thinks things are fine, but those DNs don't seem to be able to check in with the NN.  I would also check the datanode logs in /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-DATANODE-cm46nn.demo.dev.log.out to see what they're complaining about.  Maybe they are getting denied by the NN?  Looking for the wrong IP address for the NN?

 

Also make sure those two nodes aren't overriding any settings in your hdfs1->Configuration pages in CM and aren't in the "dfs_hosts_exclude.txt" safety valve property for any reason.

 

if all else fails, make sure these nodes have their hostnames tied to an actual routable IP address, not loopback.  So make sure their hostnames aren't on the 127.0.0.1 line in /etc/hosts, etc.

New Contributor
Posts: 22
Registered: ‎02-11-2014

Re: 2 dead nodes - but still getting written to

iptables and selinux are all off.

let me look into what you said to check and if anything becomes apparent
i will report back.

thanks
New Contributor
Posts: 22
Registered: ‎02-11-2014

Re: 2 dead nodes - but still getting written to

hosts file is fine.

 

this is in the namenode log on the namenode (hadoop-cmf-hdfs1-NAMENODE-xxxx.log.out)

the log entry below is when i start the hdfs process on 192.168.129.1.

 

 

2014-02-13 17:09:47,638 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(192.168.129.1, storageID=DS-463268316-192.168.129.1-50010-1392235735193, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster5;nsid=2147262545;c=0) storage DS-463268316-192.168.129.1-50010-1392235735193
2014-02-13 17:09:47,639 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default/192.168.129.1:50010
2014-02-13 17:09:47,639 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default/192.168.129.1:50010
2014-02-13 17:09:47,684 INFO BlockStateChange: BLOCK* processReport: from DatanodeRegistration(192.168.129.1, storageID=DS-463268316-192.168.129.1-50010-1392235735193, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster5;nsid=2147262545;c=0), blocks: 2, processing time: 0 msecs

Posts: 416
Topics: 51
Kudos: 86
Solutions: 49
Registered: ‎06-26-2013

Re: 2 dead nodes - but still getting written to

OK, it looks like the DN is actually registering with the NN.   Do you have NN HA configured?  And what version of CDH is this?

 

Can you open up the NN's web UI and tell us if these DNs are showing up in both the live AND dead nodes list?

 

New Contributor
Posts: 22
Registered: ‎02-11-2014

Re: 2 dead nodes - but still getting written to

2.0.0-cdh4.5.0  (Cloudera Express). I do not think HA is an option with this version.

 

From NN UI.

Live Nodes     :    16 (Decommissioned: 0)
Dead Nodes     :    2 (Decommissioned: 0)

Posts: 416
Topics: 51
Kudos: 86
Solutions: 49
Registered: ‎06-26-2013

Re: 2 dead nodes - but still getting written to

OK, and those two machines only show up once in CM too?  If you look at the Hosts page in CM, there aren't any duplicate entries for these machines?

Posts: 416
Topics: 51
Kudos: 86
Solutions: 49
Registered: ‎06-26-2013

Re: 2 dead nodes - but still getting written to

Also, can you run Host Inspector from that page and see if it returns any warnings about these nodes?

New Contributor
Posts: 22
Registered: ‎02-11-2014

Re: 2 dead nodes - but still getting written to

I checked the hosts page.

There are no duplicate host names or IP Addresses.

 

There are no errors that I see with the host inspector. There are about 8 machines that I built at the same time (namenode, 2 dead, 6 alive) that have the exact same CDH versions.. There are about 7 other machines that have a slighlty older version of cdh and one node that is mixed.

 

 

Validations
    Inspector ran on all 17 hosts.
    Individual hosts resolved their own hostnames correctly.
    No errors were found while looking for conflicting init scripts.
    No errors were found while checking /etc/hosts.
    All hosts resolved localhost to 127.0.0.1.
    All hosts checked resolved each other's hostnames correctly.
    Host clocks are approximately in sync (within ten minutes).
    Host time zones are consistent across the cluster.
    The group oozie is missing on the following hosts:
    The group hue is missing on the following hosts:
    The user oozie is missing on the following hosts:
    The user hue is missing on the following hosts:
    No kernel versions that are known to be bad are running.
    No performance concerns with Transparent Huge Pages settings.
    1 hosts are reporting with MIXED CDH version
    There are mismatched versions across the system. See details below for details on which hosts are running what versions of components.
    All managed hosts have consistent versions of Java.
    All checked Cloudera Management Daemons versions are consistent with the server.
    All checked Cloudera Management Agents versions are consistent with the server.

Announcements