While doing a hot replacement of one of the HDFS disks, we've come up to one unusual situation. We have removed the failed disk from the HDFS configuration normally, but some HBase processes we're still trying to access it:
[root@hostname ~]# lsof | grep "/dfs/3/dn"
java 25935 hbase 285r REG 8,33 973 38930647 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/finalized/subdir
java 25935 hbase 286r REG 8,33 15 38930648 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/finalized/subdir
java 25935 hbase 293r REG 8,33 1041 38930649 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/finalized/subdir
java 25935 hbase 294r REG 8,33 19 38930650 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/finalized/subdir
java 25935 hbase 299r REG 8,33 2509 38930671 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/finalized/subdir
java 25935 hbase 300r REG 8,33 27 38930672 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/finalized/subdir
jsvc 32041 hdfs 212u REG 8,33 11 38930670 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/rbw/blk_10738239
jsvc 32041 hdfs 213r REG 8,33 83 38930669 /dfs/2/dn/dn/current/BP-854419853-192.168.100.101-1450451673611/current/rbw/blk_10738239
We could just simply kill them, but a quick "ps -ef" discovered that those were processes belonging to active HBase RegionServer. Without a better solution, we simply restarted it (RegionServer) and the processes dissapeared as we expected. The problem is that, because of a active process filehandles on the faulty mountpoint, the OS doesn't let to do a unmount (we use CentOS 6.x). Furthermore, a simple kill on the hanged processes can terminate an healthy HBase instance (we tried that also).
Does anybody know what could cause such a behavior (we reproduced it three times and on different servers)? It's not a big deal if you have to restart a service instance like RegionServer (or any other CDH redundant service), but the hot-swap procedure doesn't mention that this could be required, right (http://www.cloudera.com/documentation/enterprise/latest/topics/admin_dn_swap.html)?
Any thoughts on this? I've managed to reproduce the issue several times and it looks like it's not related to HBase only.
When a disk fails HDFS process keep hanging to it after the HDFS directories refresh. Restart of HDFS datanode cleans this up, but ..
@mat15, I have moved your topic to our Storage board in the hopes that the experts here can confirm my suspicion, but your issue seems related to a Technical Service Bulletin we released to our customers whereby HDFS can run into issues when a disk is swapped out on a datanode. The public JIRA capturing the issue is HDFS-7960, it should contain the details you need.