Created 01-15-2014 06:51 PM
I have a cluster of about 20 datanodes. Suppose I have a need to shut about half of them off, let's say to move them across the room. I have the impression that the correct action is to stop all services on all nodes, including the Primary NameNode, then shutdown the nodes to move. Question 1) is this correct? and 2) is there risk of losing any data? (Of course I have to ask.) And question 3) is the restart procedure just to boot the nodes, then start all services on all nodes? And 4) as I don't believe the cluster's ever been rebooted, can we test this procedure by stopping and starting all services on one node at a time while leaving others running?
Created 01-16-2014 08:30 AM
I'll take a stab at addressing these questions:
1) Yes, you will need to shut down all hadoop services on all nodes before you perform a move like this, because HDFS will naturally attempt to re-replicate all the data that was residing on the 10 datanodes which you shut down. And since that would be half your cluster, it's likely that there would be some blocks that could not be re-replicated because the only copies of those blocks resided on the 10 nodes you shut down, so your HDFS would go into safe mode due to under-replicated/missing blocks. No risk of data loss, just not the way you'd like to do it.
2) If you properly shut down all services before doing the move, there is no risk of data loss. Just be sure your move doesn't entail giving the machines new IP addresses/hostnames, as this is an entirely different operation that requires a careful migration process.
3) yes
4) as stated in my response #1, you will get data replication churn in your cluster if you shut down individual datanodes. Cloudera Manager (enterprise) supports the notion of a rolling restart for your services if you'd like to maximize uptime, but otherwise you'll get the Namenode trying to re-replicate data if you stop one single node. After a certain timeout is reached, at least. I think you have several minutes before the blocks will begin to re-replicate to other nodes.
Created 01-16-2014 08:30 AM
I'll take a stab at addressing these questions:
1) Yes, you will need to shut down all hadoop services on all nodes before you perform a move like this, because HDFS will naturally attempt to re-replicate all the data that was residing on the 10 datanodes which you shut down. And since that would be half your cluster, it's likely that there would be some blocks that could not be re-replicated because the only copies of those blocks resided on the 10 nodes you shut down, so your HDFS would go into safe mode due to under-replicated/missing blocks. No risk of data loss, just not the way you'd like to do it.
2) If you properly shut down all services before doing the move, there is no risk of data loss. Just be sure your move doesn't entail giving the machines new IP addresses/hostnames, as this is an entirely different operation that requires a careful migration process.
3) yes
4) as stated in my response #1, you will get data replication churn in your cluster if you shut down individual datanodes. Cloudera Manager (enterprise) supports the notion of a rolling restart for your services if you'd like to maximize uptime, but otherwise you'll get the Namenode trying to re-replicate data if you stop one single node. After a certain timeout is reached, at least. I think you have several minutes before the blocks will begin to re-replicate to other nodes.
Created 01-22-2014 04:56 PM
Thanks for the info.